[PDF] Fine-Grained Named Entity Typing over Distantly Supervised Data via Refinement in Hyperbolic Space

Abstract

Fine-Grained Named Entity Typing (FG-NET) aims at classifying the entity mentions into a wide range of entity types (usually hundreds) depending upon the context. While distant supervision is the most common way to acquire supervised training data, it brings in label noise, as it assigns type labels to the entity mentions irrespective of mentions' context. In attempts to deal with the label noise, leading research on the FG-NET assumes that the fine-grained entity typing data possesses a euclidean nature, which restraints the ability of the existing models in combating the label noise. Given the fact that the fine-grained type hierarchy exhibits a hierarchal structure, it makes hyperbolic space a natural choice to model the FG-NET data. In this research, we propose FGNET-HR, a novel framework that benefits from the hyperbolic geometry in combination with the graph structures to perform entity typing in a performance-enhanced fashion. FGNET-HR initially uses LSTM networks to encode the mention in relation with its context, later it forms a graph to distill/refine the mention's encodings in the hyperbolic space. Finally, the refined mention encoding is used for entity typing. Experimentation using different benchmark datasets shows that FGNET-HR improves the performance on FG-NET by up to 3.5% in terms of strict accuracy.

Full PDF

FFine-Grained Named Entity Typing over Distantly Supervised Data via Reﬁnementin Hyperbolic Space

Muhammad Asif Ali, Yifang Sun, Bing Li, Wei Wang,

School of Computer Science and Engineering, UNSW, Australia College of Computer Science and Technology, DGUT, China { muhammadasif.ali,bing.li } @unsw.edu.au, { yifangs,weiw } @cse.unsw.edu.au Abstract

Fine-Grained Named Entity Typing (FG-NET) aims at classifyingthe entity mentions into a wide range of entity types (usuallyhundreds) depending upon the context. While distant supervisionis the most common way to acquire supervised training data, itbrings in label noise, as it assigns type labels to the entity mentionsirrespective of mentions’ context. In attempts to deal with thelabel noise, leading research on the FG-NET assumes that theﬁne-grained entity typing data possesses a euclidean nature, whichrestraints the ability of the existing models in combating the labelnoise. Given the fact that the ﬁne-grained type hierarchy exhibits ahierarchal structure, it makes hyperbolic space a natural choice tomodel the FG-NET data. In this research, we propose FGNET-RH,a novel framework that beneﬁts from the hyperbolic geometry incombination with the graph structures to perform entity typing ina performance-enhanced fashion. FGNET-RH initially uses LSTMnetworks to encode the mention in relation with its context, laterit forms a graph to distill/reﬁne the mention’s encodings in thehyperbolic space. Finally, the reﬁned mention encoding is used forentity typing. Experimentation using different benchmark datasetsshows that FGNET-RH improves the performance on FG-NET byup to 3.5% in terms of strict accuracy.

Keywords —

FG-NET, Hyperbolic Geometry, DistantSupervision, Graph Convolution

Named Entity Typing (NET) is a fundamental operation innatural language processing, it aims at assigning discretetype labels to the entity mentions in the text. It has immenseapplications, including: knowledge base construction [7];information retrieval [12]; question answering [18]; relationextraction [27] etc. Traditional NET systems work withonly a coarse set of type labels, e.g., organization, person,location, etc., which severely limit their potential in thedown-streaming tasks. In recent past, the idea of NET isextended to Fine-Grained Named Entity Typing (FG-NET)that assigns a wide range of correlated entity types to theentity mentions [13]. Compared to NET, the FG-NEThas shown a remarkable improvement in the sub-sequentapplications. For example, Ling and Weld, [13] showed thatFG-NET can boost the performance of the relation extractionby 93%.FG-NET encompasses hundreds of correlated entitytypes with little contextual differences, which makes itlabour-intensive and error-prone to acquire manually labeled training data. Therefore, distant supervision is widely usedto acquire training data for this task. Distant supervisionrelies on: (i) automated routines to detect the entity men-tion, and (ii) using type-hierarchy from existing knowledge-bases, e.g., Probase [24], to assign type labels to the entitymention. However, it assigns type-labels to the entity men-tion irrespective of the mention’s context, which results inlabel noise [20]. Examples in this regard are shown in Fig-ure 1, where the distant supervision assigns labels: { person,author, president, actor, politician } to the entity mention: “Donald Trump” , whereas, from contextual perspective, itshould be labeled as: { person, president, politician } in S1,and { person, actor } in S2. Likewise, the entity mention: “Vladimir Putin” should be labeled as: { person, author } and { person, athlete } in S3 and S4 respectively. This labelnoise in-turn propagates in the model learning and severelyeffects/limits the end-performance of the FG-NET systems.Earlier research on FG-NET either ignored the labelnoise [13], or applied some heuristics to prune the noisylabels [8]. Ren et al., [19] bifurcated the training datainto clean and noisy data samples, and used different setof loss functions to model them. However, the modelingheuristics proposed by these models are not able to copewith the label noise, which limits the end-performance ofthe FG-NET systems relying on distant supervision. We,moreover, observe that these models are designed assuminga euclidean nature of the problem, which is inappropriatefor FG-NET, as the ﬁne-grained type hierarchy exhibit ahierarchical structure. Given that it is not possible to embedhierarchies in euclidean space [15], this assumption, in turnlimits the ability of the existing models to: (i) effectivelyrepresent FG-NET data, (ii) cater label noise, and (iii) perform FG-NET classiﬁcation task in a robust way.The inherent advantage of hyperbolic geometry to em-bed hierarchies is well-established in literature. It enforcesthe items on the top of the hierarchy to be placed close to theorigin, and the items down in the hierarchy near inﬁnity. Thisenables the embedding norm to cater to the depth in the hi-erarchy, and the distance between embeddings represent thesimilarity between the items. Thus the items sharing a parent a r X i v : . [ c s . C L ] J a n (i) (ii) S3: In his 2004 book: 'Judo: History, Theory, Practice' Putin discussed basics of Judo.{person, author, president, athlete, politician}

Entity: Donald Trump Candidate Entity Types: {person, author, president, athlete, politician}

Entity: Vladimir PutinCandidate Entity Types: {person, author, president, actor, politician} (iii)Type Hierarchy ( T ψ ) LOC PER ORG- - -president politician actorauthor

S2: In his early career TV series, Donald Trump used to host the best clowns of time.{person, author, president, actor, politician} S4: Vladimir Putin began judo classes in Russian capital, when he was just eleven.{person, author, president, athlete, politician} athlete - - -

Figure 1: FG-NET training data acquired by distant supervision P e r s on O r g a n i za ti on - - - L o ca ti on rootPerson LeaderPoliticianPresident (a) (b) Figure 2: (a) An illustration of how the entity type “Pres-ident” shares the context of the entity type “Politician” which in turn shares the context of the entity-type “Leader” and so on; (b) Embedding FG-NET data in 2-D PoincareBall, where each disjoint type may be embedded along a dif-ferent directionnode are close to each other in the embeddings space. Thismakes the hyperbolic space a perfect paradigm for embed-ding the distantly supervised FG-NET data, as it explicitlyallows label-smoothing by sharing the contextual informa-tion across noisy entity mentions corresponding to the sametype hierarchy, as shown in Figure 2 (b), for a 2D Poincar´eBall. For example, given the type hierarchy: “Person” ← “Leader” ← “Politician” ← “President” , the hyperbolicembeddings, on contrary to the euclidean embeddings, offera perfect geometry for the entity type “President” to shareand augments the context of “Politician” , which in turn addsto the context of “Leader” and “Person” etc., shown in Fig-ure 2 (a). We hypothesize that such hierarchically-organizedcontextually similar neighbours provide a robust platform forthe end task, i.e., FG-NET over distantly supervised data,also discussed in detail in the section 4.5.1.Nevertheless, we propose Fine-Grained Entity Typingwith Reﬁnement in Hyperbolic space (FGNET-RH), shown in Figure 3. FGNET-RH follows a two-stage process, stage-I: encode the mention along with its context using multipleLSTM networks, stage-II: form a graph to reﬁne mention’sencoding from stage-I by sharing contextual information inthe hyperbolic space. In order to maximize the beneﬁts of us-ing the hyperbolic geometry in combination with the graphstructure, FGNET-RH maps the mention encodings (fromstage-I) to the hyperbolic space. And, performs all the op-erations: linear transformation, type-speciﬁc contextual ag-gregation etc., in the hyperbolic space, required for appro-priate additive context-sharing along the type hierarchy tosmoothen the noisy type-labels prior to the entity typing. Themajor contributions of FGNET-RH are enlisted as follows:1. FGNET-RH accommodates the beneﬁts of: the graphstructures and the hyperbolic geometry to perform ﬁne-grained entity typing over distantly supervised noisydata in a robust fashion.2. FGNET-RH explicitly allows label-smoothing over thenoisy training data by using graphs to combine thetype-speciﬁc contextual information along the type-hierarchy in the hyperbolic space.3. Experimentation using two models of the hyperbolicspace, i.e., the Hyperboloid and the Poincar´e-Ball,shows that FGNET-RH outperforms the existing re-search by up to 3.5% in terms of strict accuracy. Existing research on FG-NET can be bifurcated into twomajor categories: (i) traditional feature-based systems, and (ii) embedding models.Traditional feature-based systems rely on feature extrac-tion, later using these features to train machine learning mod-els for classiﬁcation. Amongst them, Ling and Weld [13]developed FiGER, that uses hand-crafted features to developa multi-label, multi-class perceptron classiﬁer. Yosef et al.,29] developed HYENA, i.e., a hierarchical type classiﬁca-tion model using hand-crafted features in combination withthe SVM classiﬁer. Gillick et al., [8] proposed context-dependent ﬁne-grained typing using hand-crafted featuresalong with logistic regression classiﬁer. Shimaoka et al., [21]developed neural architecture for ﬁne-grained entity typingusing a combination of automated and hand-crafted features.Embedding models use widely available embedding re-sources with customized loss functions to form classiﬁcationmodels. Yogatama et al., [28] used embeddings along withWeighted Approximate Rank Pairwise (WARP) loss. Ren etal., [19] proposed AFET that uses different set of loss func-tions to model the clean and the noisy entity mentions. Ab-hishek et al., [1] proposed end-to-end architecture to jointlyembed the mention and the label embeddings. Xin et al.,[25] used language models to compute the compatibility be-tween the context and the entity type prior to entity typing.Choi et al., [4] proposed ultra-ﬁne entity typing encompass-ing more than 10,000 entity types. They used crowd-sourceddata along with the distantly supervised data for model train-ing. Graph convolution networks are introduced in recentpast in order to extend the concept of convolutions fromregular-structured grids to graphs [11]. Ali et al., [2] pro-posed attentive convolutional network for ﬁne-grained en-tity typing. Nickel et al., [15] illustrated the beneﬁts of hy-perbolic geometry for embedding the graph structured data.Chami et al., [3] combined graph convolutions with the hy-perbolic geometry. L´opez et al., [14] used hyperbolic geom-etry for ultra-ﬁne entity typing. To the best of our knowl-edge, we are the ﬁrst to explore the combined beneﬁts of thegraph convolution networks in relation with the hyperbolicgeometry for FG-NET over distantly supervised noisy data.

In this paper, we build a multi-class, multi-label entity typing system using distantly super-vised data to classify an entity mention into a set of ﬁne-grained entity types. Speciﬁcally, we propose attentive type-speciﬁc contextual aggregation in the hyperbolic space toﬁne-tune the mention’s encodings learnt over noisy data priorto entity typing. We assume the availability of training cor-pus C train acquired via distant supervision, and manuallylabeled test corpus C test . Each corpus C (train/test) encom-passes a set of sentences. For each sentence, the contex-tual token { c i } Ni =1 , the mention spans { m i } Ni =1 (correspond-ing to the entity mentions), and the candidate type labels { t i } Ni =1 ∈ { , } T ( T -dimensional vector with t i,x = 1 if x th type corresponds to the true label and zero other-wise) have been priorly identiﬁed. The type labels are in-ferred from type hierarchy in the knowledge base ψ withthe schema T ψ . Similar to Ren et al., [19], we bifurcate thetraining data D tr into clean D tr - clean and noisy D tr - noisy , if the corresponding mention’s type-path follows a single pathin the type-hierarchy T ψ or otherwise. Following the type-path in Figure 1 (ii), a mention with labels { person, author } will be considered as clean, whereas, a mention with labels { person, president, author } will be considered as noisy. Our proposed model, FGNET-RH, followsa two-step approach, labeled as stage-I and stage-II in theFigure 3. Stage-I follows text encoding pipeline to generatemention’s encoding in relation with its context. Stage-IIis focused on label noise reduction, for this, we map themention’s encoding (from stage-I) in the hyperbolic spaceand use a graph to share aggregated type-speciﬁc contextualinformation along the type-hierarchy in order to reﬁne themention encoding. Finally, the reﬁned mention encoding isembedded along with the label encodings in the hyperbolicspace for entity typing. Details of each stage are given in thefollowing sub-sections.

Stage-I followsa standard text processing pipeline using multiple LSTMnetworks [9] to encode the entity mention in relation withits context. Individual components of stage-I are explainedas follows:

Mention Encoding:

We use LSTM network to encodethe character sequence corresponding to the mention tokens.We use φ e = [ −−→ men ] ∈ R e to represent the encodedmention’s tokens. Context Encoding:

For context encoding, we use mul-tiple Bi-LSTM networks to encode the tokens correspondingto the left and the right context of the entity mention. We use φ c l = [ ←− c l ; −→ c l ] ∈ R c and φ c r = [ ←− c r ; −→ c r ] ∈ R c to represent theencoded left and the right context respectively. Position Encoding:

For position encoding, we useLSTM network to encode the position of the left and theright contextual tokens. We use φ p l = [ ←− l p ] ∈ R p and ; φ p r = [ −→ r p ] ∈ R p to represent the encoded position corre-sponding to the mention’s left and the right context. Mention Encodings:

Finally, we concatenate all themention-speciﬁc encodings to get L-dimensional noisy en-coding: x m ∈ R L , where L = e + 2 ∗ c + 2 ∗ p .(3.1) x m = [ φ p l ; φ c l ; φ e ; φ c r ; φ p r ] Stage-II is focused on alleviating the label noise. Underlyingassumption in combating the label noise is that the contextu-ally similar mentions should get similar type labels. For this,we form a graph to cluster contextually-similar mentions andemploy hyperbolic geometry to share the contextual infor-mation along the type-hierarchy. As shown in Figure 3, thestage-II follows the following pipeline: oisy Mention Encoding (x m ) Right ContextBi-directional LSTMIn my submissive opinion, the Trump Trump Trump cannot withstand such crowd.Left ContextPosition LSTM Position LSTMchar LSTM

Stage-I In my submissive opinion, the Trump cannot withstand such crowd.

Bi-directional LSTM

Inputs: Noisy Encodings (X m ) ; Adjacency Matrix (A) AgencyActorCity

Output: Refined Encodings ( Φ m ) L a b e l E n c od i ng s Stage-II

Figure 3: Proposed model, i.e., FGNET-RH, stage-I learns mention’s encodings based on local sentence-speciﬁc context,stage-II reﬁnes the encodings learnt in stage-I in the hyperbolic space.1. Construct a graph such that contextually and semanti-cally similar mentions end-up being the neighbors inthe graph.2. Use exponential map to project the noisy mention en-codings from stage-I to the hyperbolic space.3. In the hyperbolic space, use the corresponding expo-nential and logarithmic transformations to perform thecore operations, i.e., (i) linear transformation, and (ii) contextual aggregation, required to ﬁne-tune the encod-ings learnt in stage-I prior to entity typing.We work with two models in the hyperbolic space,i.e., the Hyperboloid ( H d ) and the Poincar´e-Ball ( D d ) . Inthe following sub-sections, we provide the mathematicalformulation for the Hyperboloid model of the hyperbolicspace. Similar formulation can be designed for the Poincar´e-Ball model. d -dimensional hyperboloidmodel of the hyperbolic space (denoted by H d,K ) is a spaceof constant negative curvature − /K , with T p H d,K as theeuclidean tangent space at point p , such that: H d,K = { p ∈ R d +1 : (cid:104) p , p (cid:105) = − K, p > }T p H d,K = r ∈ R d +1 : (cid:104) r , p (cid:105) L = 0 (3.2)where (cid:104) , ., (cid:105) L : R d +1 × R d +1 → R denotes theMinkowski inner product, with (cid:104) p , q (cid:105) L = − p q + p q + ... + p d q d . Geodesics and Distances:

For two points p , q ∈ H d,K ,the distance function between them is given by: d K L ( p , q ) = √ K arccosh ( −(cid:104) p , q (cid:105) L /K ) (3.3) Exponential and Logarithmic maps:

We use expo-nential and logarithmic maps for mapping to and from thehyperbolic and the tangent space respectively. Formally,given a point p ∈ H d,K and tangent vector t ∈ T p H d,K , theexponential map exp K p : T p H d,K → H d,K assigns a point to t such that exp K p ( t ) = γ (1) , where γ is the geodesic curvethat satisﬁes γ (0) = p and ˙ γ = t .The logarithmic map (log K p ) being the bijective inversemaps a point in hyperbolic space to the tangent space at p .We use the following equations for the exponential and thelogarithmic maps:(3.4) exp K p ( v ) = cosh( || v || L √ K ) p + √ K sinh( || v || L √ K ) v || v || L (3.5) log K p ( q ) = d K L ( p , q ) q + K < p , q > L p || q + K < p , q > L p || L The end-goal of graph con-struction is to group the entity mentions in such a waythat contextually similar mentions are clustered around eachother by forming edges in the graph. Given the fact, theeuclidean embeddings are better at capturing the semanticaspects of the text data [6], we opt to use deep contex-tualized embeddings in the euclidean domain [17] for theraph construction. For each entity type, we average outcorresponding d embeddings for all the mentions inthe training corpus C train , to learn prototype vectors foreach entity type, i.e., { prototype t } Tt =1 . Later, for each en-tity type t , we capture type-speciﬁc conﬁdent mention can-didates cand t , following the criterion: cand t = cand t ∪ men if ( cos ( men, { P rototype t } ) ≥ δ ) ∀ men ∈ C ; ∀ t ∈ T , where δ is a threshold. Finally, we form pairwise edgesfor all the mention candidates corresponding to each entity-type, i.e., { cand } Tt =1 , to construct the graph G , with adja-cency matrix A . The mention encodings learnt in the stage-I arenoisy, as they are learnt over distantly supervised data. Theseencodings lie in the euclidean space, and in order to reﬁnethem, we ﬁrst map them to the hyperbolic space, where wemay best exploit the ﬁne-grained type hierarchy in relationwith the type-speciﬁc context to ﬁne-tune these encodings asan aggregate of contextually-similar neighbors.Formally, let p E = X m ∈ R N × L be the matrix corre-sponding to the noisy mentions’ encodings in the euclideandomain. We consider o = {√ K, , ..., } as a referencepoint (origin) in a d-dimensional Hyperboloid with curva-ture − /K ( H d,K ) ; (0 , p E ) as a point in the tangent space ( T H d,K ) , and map it to p H ∈ H d,K using the exponentialmap given in Equation (3.4), as follows: p H = exp K ((0 , p E ))exp K ((0 , p E )) = (cid:16) √ K cosh (cid:16) || p E || √ K (cid:17) , √ K sinh (cid:16) || p E || √ K (cid:17) p E || p E || (cid:17) (3.6) In order to perform lineartransformation operation on the noisy mention encodings,i.e., (i) multiplication by weight matrix W , and (ii) additionof bias vector b , we rely on the exponential and the logarith-mic maps. For multiplication with the weight matrix, ﬁrstly,we apply logarithmic map on the encodings in the hyperbolicspace, i.e., p H ∈ H d,K , in order to project them to T H d,K .This projection is then multiplied by the weight matrix W ,and the resultant vectors are projected back to the manifoldusing the exponential map. For a manifold with curvatureconstant K , these operations can be summarized in the equa-tion, given below:(3.7) W ⊗ p H = exp K ( W log K ( p H )) For bias addition, we rely on parallel transport, let b bethe bias vector in T H d,K , we parallel transport b along thetangent space and ﬁnally map it to the manifold. Formally,let T K o → p H represent the parallel transport of a vector from T o H d,K to T x H H d,K , we use the following equation for thebias addition:(3.8) p H ⊕ b = exp K x H ( T Ko → p H ( b )) Aggrega-tion is a crucial step for noise reduction in FG-NET, ithelps to smoothen the type-label by reﬁning/ﬁne-tuningthe noisy mention encodings by accumulating informationfrom contextually similar neighbors lying at multiple hops.Given the graph G , with nodes ( V ) being the entity men-tions, we use the pairwise embedding vectors along theedges of the graph to compute the attention weights η ij = cos ( men i , men j ) ∀ ( i, j ) ∈ V . In order to perform the aggre-gation operation, we ﬁrst use the logarithmic map to projectthe results of the linear transformation from hyperbolic spaceto the tangent space. Later, we use the neighboring informa-tion contained in G to compute the reﬁned mention encodingas attentive aggregate of the neighboring mentions. Finally,we map these results back to the manifold using the exponen-tial map exp K . Our methodology for contextual aggregationis summarized in the following equation:(3.9) AGG cxtx ( p H ) i = exp K x Hi (cid:16) (cid:88) j ∈ N ( i ) ( (cid:94) η ij (cid:12) A ) log K ( p Hj ) (cid:17) where (cid:94) η ij (cid:12) A is the Hadamard product of the attentionweights and the adjacency matrix A . It accommodates thedegree of contextual similarity among the mention pairs in G . Contextually aggregatedmention’s encoding is ﬁnally passed through a non-linearactivation function σ ( ReLU in our case). For this, we fol-low similar steps, i.e., (i) map the encodings to the tangentspace, (ii) apply the activation function in the tangent space, (iii) map the results back to the hyperbolic space using ex-ponential map. These steps are summarized in the followingequation:(3.10) σ ( p H ) = exp K ( σ (log K ( p H ))) We combine the above-mentionedsteps to get the reﬁned mention encodings at lth -layer z l,Hout as follows: p l,H = W l ⊗ p l − ,H ⊕ b l ; y l,H = AGG cxtx ( p l,H ) ; z l,Hout = σ ( y l,H ) (3.11)Let z l,Hout ∈ H d,K correspond to the reﬁned mentions’ en-codings hierarchically organized in the hyperbolic space. Wembed them along with the ﬁne-grained type label encodings { φ t } Tt =1 ∈ H d . For that we learn a function f ( z l,Hout , φ t ) = φ Tt × z l,H + bias t , and separately learn the loss functions forthe clean and the noisy mentions. Loss for clean mentions:

In order to model the cleanentity mentions D tr - clean , we use a margin-based loss toembed the reﬁned mention encodings close to the true typelabels ( T y ), and push it away from the false type labels ( T y (cid:48) ).The loss function is summarized as follows: L clean = (cid:88) t ∈ T y ReLU (1 − f ( z l,Hout , φ t ))+ (cid:88) t (cid:48) ∈ T y (cid:48) ReLU (1 + f ( z l,Hout , φ t (cid:48) )) (3.12) Loss for noisy mentions:

In order to model the noisyentity mentions D tr - noisy , we use a variant of above-mentioned loss function to embed the mention close to mostrelevant type label t ∗ , where t ∗ = argmax t ∈ T y f ( z l,Hout , φ t ) ,among the set of noisy type labels ( T y ) and push it awayfrom the irrelevant type labels ( T y (cid:48) ). The loss function ismentioned as follows: L noisy = ReLU (1 − f ( z l,Hout , φ t ∗ ))+ (cid:88) t (cid:48) ∈ T y (cid:48) ReLU (1 + f ( z l,Hout , φ t (cid:48) )) (3.13)Finally, we minimize L clean + L noisy as the ﬁnal lossfunction of the FGNET-RH. We evaluate our model using a set of publiclyavailable datasets for FG-NET. We chose these datasets be-cause they contain fairly large proportion of test instancesand corresponding evaluation will be more concrete. Statis-tics of these dataset is shown in Table 1. These datasets areexplained as follows:

BBN:

Its training corpus is acquired from the WallStreet Journal annotated by [22] using DBpedia Spotlight.

OntoNotes:

It is acquired from newswire documentscontained in the OntoNotes corpus [23]. The training datais mapped to Freebase types via DBpedia Spotlight [5]. Thetesting data is manually annotated by Gillick et al., [8].

In order to set up a fair plat-form for comparative evaluation, we use the same data set-tings (training, dev and test splits) as used by all the modelsconsidered as baselines in Table 2. All the experiments areperformed using Intel Gold 6240 CPU with 256 GB mainmemory.

Dataset BBN OntoNotesTraining Mentions 86078 220398Testing Mentions 13187 9603% clean mentions (training) 75.92 72.61% clean mentions (testing) 100 94.0Entity Types 47 89

Table 1: Fine-Grained Named Entity Typing data sets

Model Parameters:

For stage-I, the hidden layer sizeof the context and the position encoders is set to 100d.The hidden layer size of the mention character encoder is200d. Character, position and label embeddings are ran-domly initialized. We report the model performance us-ing 300d Glove [16] and 1024d deep contextualized embed-dings [17].For stage-II, we construct graphs with 5.4M and 0.6Medges for BBN and OntoNotes respectively. Curvatureconstant of the hyperbolic space is set to K = 1 . All themodels are trained using Adam optimizer [10] with learningrate = 0.001. We evaluate FGNET-RH againstthe following baseline models: (i) Figer [13]; (ii)Hyena [29]; (iii) AFET, AFET-NoCo and AFET-NoPa [19];(iv) Attentive [21]; (v) FNET [1]; (vi) NFGEC + LME [25];and (vii) FGET-RR [2]. For performance comparison, weuse the scores reported in the original papers, as they arecomputed using a similar data settings as that of ours.Note that we do not compare our model against [4, 14]because these models use crowd-sourced data in addition tothe distantly supervised data for model training. Likewise,we exclude [26] from evaluation because Xu and Barbosachanged the ﬁne-grained problem deﬁnition from multi-labelto single-label classiﬁcation problem. This makes theirproblem settings different from that of ours and the endresults are no longer comparable.

The results of the proposed model areshown in Table 2. For each data, we boldface the bestscores with the existing state-of-the art underlined. Theseresults show that FGNET-RH outperforms the existing state-of-the-art models by a signiﬁcant margin. For the BBNdata, FGNET-RH achieves 3.5%, 1.2% and 1.5% improve-ment in strict accuracy, mac-F1 and mic-F1 respectively,compared to the FGET-RR. For OntoNotes, FGNET-RH im-proves the mac-F1 and mic-F1 scores by 1.2% and 1.6%.These results show that FGNET-RH offers multi-facetedbeneﬁts, i.e., using hyperbolic space in combination withthe graphs to encode the hierarchy, while at the same timecatering to noise in the best possible way. Especially,augmented context sharing along the hierarchy leads toconsiderable improvement in the performance compared tohe baseline models.

In this section, we evaluate the im-pact of different model components on label de-noising.Speciﬁcally, we analyze the performance of FGNET-RHusing variants of the adjacency graph, including: (i) ran-domly generated adjacency graph of approximately the samesize as G : FGNET-RH ( R ) , (ii) unweighted adjacencygraph: FGNET-RH ( A ) , and (iii) pairwise contextual sim-ilarity as the attention weights FGNET-RH ( (cid:94) η (cid:12) A ) . Theresults in Table 3 show that for the given model architec-ture, the performance improvement (correspondingly noise-reduction) can be attributed to using the appropriate adja-cency graph. A drastic reduction in the model performancefor FGNET-RH ( R ) shows that once the contextual similar-ity structure of the graph is lost, the label-smoothing is nolonger effective. Likewise, improvement in performance forthe models: FGNET-RH ( A ) and FGNET-RH ( (cid:94) η (cid:12) A ) , im-plies that the adjacency graphs ( A ) and especially ( (cid:94) η (cid:12) A ) indeed incorporate the required type-speciﬁc contextualclusters at the needed level of granularity to effectivelysmoothen the noisy labels prior to the entity typing. In order toverify the effectiveness of reﬁning the mention encodingsin the hyperbolic space (stage-II), we perform label-wiseperformance analysis for the dominant labels in the BBNdataset. Corresponding results for the Hyperboloid and thePoincar´e-Ball model (in Table 4) show that FGNET-RHoutperforms the existing state-of-the-art, i.e., FGET-RR byAli et al., [2], achieving higher F1-scores across all thelabels. Note that FGNET-RH can achieve higher perfor-mance for the base type labels: { e.g., “/Person”, “/Or-ganization”, “/GPE” etc., } , as well as other type labelsdown in the hierarchy, { e.g., “/Organization/Corporation”,“/GPE/City” etc., } . For { “Organization” and “Cor-poration” } FGNET-RH achieves a higher F1=0.896 andF1=0.855 respectively, compared to the F1=0.881 andF1=0.844 by FGET-RR. This is made possible because em-bedding in the hyperbolic space enables type-speciﬁc contextsharing at each level of the type hierarchy by appropriatelyadjusting the norm of the label vector.To further strengthen our claims regarding the effective-ness of using hyperbolic space for FG-NET, we analyzedthe context of the entity types along the type-hierarchy. Weobserved, for the ﬁne-grained type labels, the context is addi-tive and may be arranged in a hierarchical structure with thegeneric terms lying at the root and the speciﬁc terms lyingalong the children nodes. For example, “Government Or-ganization” being a subtype of “Organization” adds tokenssimilar to { bill, treasury, deﬁcit, ﬁscal, senate etc., } to thecontext of “Organization” . Likewise, “Hospital” adds to- kens similar to { family, patient, kidney, stone, infection etc., } to the context of “Organization” .This ﬁnding correlates with the norm of the labelvectors, shown in Table 5 for the Poincar´e-Ball model.The vector norm of the entity types deep in the hierar-chy { e.g., “/Facility/Building”, “/Facility/Bridge”, “/Facil-ity/Highway” etc., } is greater than that of the base en-tity type { “/Facility” } . A similar trend is observedfor the ﬁne-grained types: { “/Organization/Government”,“/Organization/Political” etc., } compared to the base type: { “/Organization” } . It justiﬁes that FGNET-RH indeed ad-justs the norm of the label vector according to the depth ofthe type-label in the label-hierarchy, which allows the modelto consequently cluster the type-speciﬁc context along thehierarchy in an augmented fashion.In addition, we also analyzed the entity mentions cor-rected especially by the label-smoothing process, i.e., thestage-II of FGNET-RH. For this, we examined the modelperformance with and without the label-smoothing, i.e.,we separately build a classiﬁcation model by using theoutput of stage-I. For the BBN data, the stage-II is ableto correct about 18% of the mis-classiﬁcations made bystage-I. For example in the sentence: “CNW Corp. saidthe ﬁnal step in the acquisition of the company has beencompleted with the merger of CNW with a subsidiary ofChicago & amp.” , the bold-faced entity mention

CNW is labeled { “/GPE” } by stage-I. However, after label-smoothing in stage-II, the label predicted by FGNET-RH is { “/Organization/Corporation” } , which indeed is the correctlabel. A similar trend was observed for the OntoNotes dataset. This analysis concludes that the FGNET-RH using ablend of the contextual graphs and the hyperbolic space in-corporates the right geometry to embed the noisy FG-NETdata with lowest possible distortion. Compared to the eu-clidean space, the hyperbolic space being a non-euclideanspace allows the graph volume (number of nodes withina ﬁxed radius) to grow exponentially along the hierarchy.This enables the FGNET-RH to perform label-smoothing byforming type-speciﬁc contextual clusters across noisy men-tions along the type hierarchy. We analyzed the prediction errors ofFGNET-RH and attribute them to the following factors:

Inadequate Context:

For these cases, type-labels aredictated entirely by the mention tokens, with very little in-formation contained in the context. For example, in the sen-tence: “The

IRS recently won part of its long-running bat-tle against John.” , the entity mention “ IRS ” is labeled as { “/Organization/Corporation” } irrespective of any informa-tion contained in the mention’s context. Limited informationcontained in the mention’s context in turn limits the end-performance of FGNET-RH in predicting all possible ﬁne- ntoNotes BBNstrict mac-F1 mic-F1 strict mac-F1 mic-F1 FIGER [13] 0.369 0.578 0.516 0.467 0.672 0.612

HYENA [29] 0.249 0.497 0.446 0.523 0.576 0.587

AFET-NoCo [19] 0.486 0.652 0.594 0.655 0.711 0.716

AFET-NoPa [19] 0.463 0.637 0.591 0.669 0.715 0.724

AFET-CoH [19] 0.521 0.680 0.609 0.657 0.703 0.712

AFET [19] 0.551 0.711 0.647 0.670 0.727 0.735

Attentive [21] 0.473 0.655 0.586 0.484 0.732 0.724

FNET-AllC [1] 0.514 0.672 0.626 0.655 0.736 0.752

FNET-NoM [1] 0.521 0.683 0.626 0.615 0.742 0.755

FNET [1] 0.522 0.685 0.633 0.604 0.741 0.757

NFGEC+LME [25] 0.529 0.724 0.652 0.607 0.743 0.760

FGET-RR [2] (Glove) 0.567 0.737 0.680 0.740 0.811 0.817

FGET-RR [2] (ELMO) 0.577 0.743 0.685 0.703 0.819 0.823

FGNET-RH (Hyperboloid + Glove) (Hyperboloid + ELMO) 0.575

FGNET-RH (Poincar´e-Ball + Glove) 0.579 0.741 0.684 0.760

FGNET-RH (Poincar´e-Ball + ELMO) 0.573 0.740 0.685 0.698 0.828 0.830

Table 2: FG-NET performance comparison against baseline modelsgrained labels thus effecting the recall. For the BBN dataset, more than 30% errors may be attributed to the inade-quate mention’s context.

Correlated Context:

FG-NET type hierarchy encom-passes semantically correlated entity types, e.g., { “Or-ganization” vs “Corporation” } ; { “Actor” vs “Artist” } ; { “Actor” vs “Director” } ; { “Ship” vs “Spacecraft” } ; { “Coach” vs “Athlete” } etc., with highly convolutedcontext. For example, the context of the entity types { “actor” } and { “artist” } is extremely overlapping, it con-tains semantically-related tokens like: { direct, dialogue,dance, acting, etc., } . This high contextual overlap makes ithard for the FGNET-RH to delineate the decision boundaryacross these correlated entity types. It leads to false predic-tions by the model thus effecting the precision. For the BBNdata set, more than 35% errors may be attributed to the cor-related context. Label Bias:

Label bias originating from the distantsupervision may result in the label-smoothing to be in-effective. This occurs speciﬁcally if all the labels originatingfrom the distant supervision are incorrect. For the BBN data

Model OntoNotes BBNstrict mac-F1 mic-F1 strict mac-F1 mic-F1FGNET-RH ( R ) ( A ) ( (cid:94) η (cid:12) A ) H d )FGNET-RH ( R ) ( A ) ( (cid:94) η (cid:12) A ) D d ) Table 3: FGNET-RH performance comparison using differ-ent adjacency matrices and Glove Embeddings

Labels Support FGET-RR [2] FGNET-RH (Poincar´e-Ball) FGNET-RH (Hyperboloid)Prec Rec F1 Prec Rec F1 Prec Rec F1/Organization 45.30% 0.924 0.842 0.881 0.916 0.876 /GPE/City 9.17% 0.802 0.767 0.784 0.806 0.750 0.777 0.804 0.795

Table 4: Label-wise Precision, Recall and F1 scores for theBBN data compared with FGET-RR [2]

Label Norm Label Norm/Organization 0.855 /Facility 0.643/Organization/Religious 0.860 /Facility/Building 0.725/Organization/Government 0.870 /Facility/Bridge 0.745/Organization/Political 0.875 /Facility/Highway 0.815

Table 5: FGNET-RH Label-norms for the Poincar´e-Ballmodel, the norm for the base type-labels is lower than thetype-labels deep in the hierarchyapproximately 5% errors may be attributed to the label bias.The rest of the errors may be attributed to the inabilityof FGNET-RH to explicitly deal with different word senses,in-depth syntactic analysis, in-adequacy of underlying em-bedding models to handle semantics, etc. We plan to accom-modate these aspects in the future work.

In this paper, we introduced FGNET-RH, a novel approachthat combines the beneﬁts of graph structures and hyper-bolic geometry to perform entity typing in a robust fashion.FGNET-RH initially learns noisy mention encodings usingLSTM networks and constructs a graph to cluster contex-tually similar mentions using embeddings in euclidean do-main, later it performs label-smoothing in hyperbolic do-main to reﬁne the noisy encodings prior to the entity-typing.erformance evaluation using the benchmark datasets showsthat the FGNET-RH offers a perfect geometry for contextsharing across distantly supervised data, and in turn outper-forms the existing research on FG-NET by a signiﬁcant mar-gin.

References [1] Abhishek, Ashish Anand, and Amit Awekar. Fine-grained en-tity type classiﬁcation by jointly learning representations andlabel embeddings. In

EACL (1) , pages 797–807. Associationfor Computational Linguistics, 2017.[2] Muhammad Asif Ali, Yifang Sun, Bing Li, and Wei Wang.Fine-grained named entity typing over distantly superviseddata based on reﬁned representations. In

AAAI , pages 7391–7398. AAAI Press, 2020.[3] Ines Chami, Zhitao Ying, Christopher R´e, and Jure Leskovec.Hyperbolic graph convolutional neural networks. In

NeurIPS ,pages 4869–4880, 2019.[4] Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer.Ultra-ﬁne entity typing. In

ACL (1) , pages 87–96. Associationfor Computational Linguistics, 2018.[5] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N.Mendes. Improving efﬁciency and accuracy in multilingualentity extraction. In

I-SEMANTICS , pages 121–124. ACM,2013.[6] Bhuwan Dhingra, Christopher J. Shallue, MohammadNorouzi, Andrew M. Dai, and George E. Dahl. Embed-ding text in hyperbolic spaces. In

TextGraphs@NAACL-HLT , pages 59–69. Association for Computational Linguis-tics, 2018.[7] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn,Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun,and Wei Zhang. Knowledge vault: a web-scale approachto probabilistic knowledge fusion. In

KDD , pages 601–610.ACM, 2014.[8] Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner,and David Huynh. Context-dependent ﬁne-grained entitytype tagging.

CoRR , abs/1412.1820, 2014.[9] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-termmemory.

Neural Computation , 9(8):1735–1780, 1997.[10] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In

ICLR (Poster) , 2015.[11] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁ-cation with graph convolutional networks. In

ICLR (Poster) .OpenReview.net, 2017.[12] Ni Lao and William W Cohen. Relational retrieval usinga combination of path-constrained random walks.

Machinelearning , 81(1):53–67, 2010.[13] Xiao Ling and Daniel S. Weld. Fine-grained entity recogni-tion. In

AAAI . AAAI Press, 2012.[14] Federico L´opez, Benjamin Heinzerling, and Michael Strube.Fine-grained entity typing in hyperbolic space. In

RepL4NLP@ACL , pages 169–180. Association for Compu-tational Linguistics, 2019.[15] Maximilian Nickel and Douwe Kiela. Poincar´e embeddings for learning hierarchical representations. In

NIPS , pages6338–6347, 2017.[16] Jeffrey Pennington, Richard Socher, and Christopher D. Man-ning. Glove: Global vectors for word representation. In

EMNLP , pages 1532–1543. ACL, 2014.[17] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard-ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.Deep contextualized word representations. In

NAACL-HLT ,pages 2227–2237. Association for Computational Linguis-tics, 2018.[18] Deepak Ravichandran and Eduard Hovy. Learning surfacetext patterns for a question answering system. In

Proceedingsof the 40th annual meeting on association for computationallinguistics , pages 41–47. Association for Computational Lin-guistics, 2002.[19] Xiang Ren, Wenqi He, Meng Qu, Lifu Huang, Heng Ji,and Jiawei Han. AFET: automatic ﬁne-grained entity typingby hierarchical partial-label embedding. In

EMNLP , pages1369–1378. The Association for Computational Linguistics,2016.[20] Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji,and Jiawei Han. Label noise reduction in entity typingby heterogeneous partial-label embedding. In

KDD , pages1825–1834. ACM, 2016.[21] Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebas-tian Riedel. An attentive neural architecture for ﬁne-grainedentity type classiﬁcation. In

AKBC@NAACL-HLT , pages 69–74. The Association for Computer Linguistics, 2016.[22] Ralph Weischedel and Ada Brunstein. Bbn pronoun coref-erence and entity type corpus.

Linguistic Data Consortium,Philadelphia , 112, 2005.[23] Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, MarthaPalmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, CraigGreenberg, Eduard Hovy, Robert Belvin, et al. Ontonotesrelease 4.0.

LDC2011T03, Philadelphia, Penn.: LinguisticData Consortium , 2011.[24] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny QiliZhu. Probase: a probabilistic taxonomy for text understand-ing. In

SIGMOD Conference , pages 481–492. ACM, 2012.[25] Ji Xin, Hao Zhu, Xu Han, Zhiyuan Liu, and Maosong Sun.Put it back: Entity typing with language model enhancement.In

EMNLP , pages 993–998. Association for ComputationalLinguistics, 2018.[26] Peng Xu and Denilson Barbosa. Neural ﬁne-grained entitytype classiﬁcation with hierarchy-aware loss. In

NAACL-HLT , pages 16–25. Association for Computational Linguis-tics, 2018.[27] Yadollah Yaghoobzadeh, Heike Adel, and Hinrich Sch¨utze.Noise mitigation for neural entity typing and relation extrac-tion. arXiv preprint arXiv:1612.07495 , 2016.[28] Dani Yogatama, Daniel Gillick, and Nevena Lazic. Embed-ding methods for ﬁne grained entity type classiﬁcation. In

ACL (2) , pages 291–296. The Association for Computer Lin-guistics, 2015.[29] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, MarcSpaniol, and Gerhard Weikum. Hyena-live: Fine-grainedonline entity type classiﬁcation from natural-language text.In