[PDF] Enquire One's Parent and Child Before Decision: Fully Exploit Hierarchical Structure for Self-Supervised Taxonomy Expansion

Abstract

Taxonomy is a hierarchically structured knowledge graph that plays a crucial role in machine intelligence. The taxonomy expansion task aims to find a position for a new term in an existing taxonomy to capture the emerging knowledge in the world and keep the taxonomy dynamically updated. Previous taxonomy expansion solutions neglect valuable information brought by the hierarchical structure and evaluate the correctness of merely an added edge, which downgrade the problem to node-pair scoring or mini-path classification. In this paper, we propose the Hierarchy Expansion Framework (HEF), which fully exploits the hierarchical structure's properties to maximize the coherence of expanded taxonomy. HEF makes use of taxonomy's hierarchical structure in multiple aspects: i) HEF utilizes subtrees containing most relevant nodes as self-supervision data for a complete comparison of parental and sibling relations; ii) HEF adopts a coherence modeling module to evaluate the coherence of a taxonomy's subtree by integrating hypernymy relation detection and several tree-exclusive features; iii) HEF introduces the Fitting Score for position selection, which explicitly evaluates both path and level selections and takes full advantage of parental relations to interchange information for disambiguation and self-correction. Extensive experiments show that by better exploiting the hierarchical structure and optimizing taxonomy's coherence, HEF vastly surpasses the prior state-of-the-art on three benchmark datasets by an average improvement of 46.7% in accuracy and 32.3% in mean reciprocal rank.

Full PDF

EEnquire One’s Parent and Child Before Decision: Fully ExploitHierarchical Structure for Self-Supervised Taxonomy Expansion

Suyuchen Wang ∗ [email protected] & DIRO, Université de MontréalMontréal, Québec, Canada Ruihui Zhao [email protected] Jarvis LabShenzhen, Guangdong, China

Xi Chen [email protected] Jarvis LabShenzhen, Guangdong, China

Yefeng Zheng [email protected] Jarvis LabShenzhen, Guangdong, China

Bang Liu † [email protected] & DIRO, Université de MontréalMontréal, Québec, Canada ABSTRACT

HEF ), which fully exploits the hierarchical structure’s propertiesto maximize the coherence of expanded taxonomy.

HEF makes useof taxonomy’s hierarchical structure in multiple aspects: i)

HEF utilizes subtrees containing most relevant nodes as self-supervisiondata for a complete comparison of parental and sibling relations; ii)

HEF adopts a coherence modeling module to evaluate the coher-ence of a taxonomy’s subtree by integrating hypernymy relationdetection and several tree-exclusive features; iii)

HEF introducesthe Fitting Score for position selection, which explicitly evaluatesboth path and level selections and takes full advantage of parentalrelations to interchange information for disambiguation and self-correction. Extensive experiments show that by better exploitingthe hierarchical structure and optimizing taxonomy’s coherence,

HEF vastly surpasses the prior state-of-the-art on three benchmarkdatasets by an average improvement of 46.7% in accuracy and 32.3%in mean reciprocal rank.

KEYWORDS taxonomy expansion, self-supervised learning, hierarchical struc-ture

Taxonomy is a particular type of hierarchical knowledge graph thatportrays the hypernym-hyponym relations or “is-A” relations ofvarious concepts and entities. They have been adopted as the un-derlying infrastructure of a wide range of online services in variousdomains, such as product catalogs for e-commerce [14, 22], scien-tific indices like

MeSH [19], and lexical databases like

WordNet [25].A well-constructed taxonomy can assist various downstream tasks, ∗ Work done during an internship at Tencent Jarvis Lab. † Corresponding author.

Seed Taxonomy Expanded Taxonomy foodbeverage nutriment …… …… vitamincoffee tea …… …… black tea oolong …… foodbeverage nutriment …… …… vitamincoffee tea …… …… black tea oolong …… colaespresso vitamin C Hierarchy Expansion Framework (HEF)

New Goal: Maximize the coherence of taxonomy expansion

Self-SupervisionDesign foodicedtea oolongbeverageteablacktea

Ego-tree

Modeling Strategy

Parent’s Selection

Scoring Function

FittingScore

Path Selection Level Selection

Child’s Selection ? ? ??

Go Down! ? Go Up! ?!! ! !

Parent’s Selection

Coherence Modeling Coherence Evaluation

New Approaches: Fully exploit the hierarchical structure of taxonomy

Training on hierarchical structured data Adding hierarchy-exclusive features Multi-dimensional Evaluation of hierarchical relations foodlatte mochabeverageespresso

Coherent? coffee

Figure 1: An illustration of the taxonomy expansion taskand the contributions of the proposed

HEF model. including web content tagging [20, 27], web searching [46], person-alized recommendation [10] and helping users achieve quick navi-gation on web applications [9]. Manually constructing and main-taining a taxonomy is laborious, expensive and time-consuming. Itis also highly inefficient and detrimental for downstream tasks if weconstruct a taxonomy from scratch [7, 41] as long as the taxonomyhas new terms to be added. A more realistic strategy is to insert newterms ( “query” ) into an existing taxonomy, i.e., the seed taxonomy,as a child of an existing node in the taxonomy ( “anchor” ) withoutmodifying its original structure to best preserve its design. Thisproblem is called taxonomy expansion [13].Early taxonomy expansion approaches use terms that do notexist in the seed taxonomy and its best-suited position in the seedtaxonomy as training data [12]. However, it suffers from the insuf-ficiency of training data and the shortage of taxonomy structuresupervision. More recent solutions adopt self-supervision and tryto exploit the information of nodes in the seed taxonomy (seed a r X i v : . [ c s . C L ] J a n odes) to perform node pair matching [34] or classification alongmini paths in the taxonomy [47]. However, these approaches do notfully utilize the taxonomy’s hierarchical structure’s characteristics,and neglect the coherence of the extended taxonomy which oughtsto be the core of the taxonomy expansion task. More specifically,existing approaches do not model a hierarchical structure identicalto the taxonomy. Instead, they use ego-nets [34] or mini-paths [47]and feature few or no tree-exclusive information, making themunable to extract or learn the complete hierarchical design of a tax-onomy. Besides, they do not consider the coherence of a taxonomy.They manage to find the most suitable node in a limited subgraphand only evaluate the correctness of a single edge instead of theexpanded taxonomy, which downgrades the taxonomy expansiontask to a hypernymy detection task. Lastly, their scoring approachregards the anchor node as an individual node without consideringthe hierarchical context information. However, the hierarchicalstructure provides multi-aspect criteria to evaluate a node, such asits path or level correctness. The structure also marks the nodesthat are most likely to be wrongly chosen to be a parent in a specificparental relation.To solve all the stated flaws in previous works, we propose theHierarchy Expansion Framework ( HEF ), which aims to maximizethe coherence of the expanded taxonomy instead of the fitness of asingle edge by fully exploiting the hierarchical structure of a taxon-omy for self-supervised training, as well as modeling and evaluatingthe structure of taxonomy.

HEF ’s designs and goals are illustratedin Fig. 1. Specifically, we make the following contributions.Firstly, we design an innovative hierarchical data structure forself-supervision to mimic how humans construct a taxonomy. Rela-tions in a taxonomy include hypernymy relations along a root-pathand similarity among siblings. To find the most suitable parentnode for the query term, human experts need to compare an anchornode with all its ancestors to distinguish the most appropriate oneand compare the query with its potential siblings to testify theirsimilarity. For example, to choose the parent for query “black tea” inthe food taxonomy, the most appropriate anchor “tea” can only beselected by distinguishing from its ancestors “beverage” and “food” ,which are all “black tea” ’s hypernyms, as well as compare the query “black tea” with “tea” ’s children like “iced tea” and “oolong” to guar-antee similarity among siblings. Thus, we design a new structurecalled “ego-tree” for self-supervision, which contains all ancestorsand sample of children of a node for taxonomy structure learning.Our ego-tree incorporates richer topological context informationfor attaching a query term to a candidate parent with minimal com-putation cost compared to previous approaches based on node pairmatching or path information.Secondly, we design a new modeling strategy to perform explicitego-tree coherence modeling apart from the traditional node-pairhypernymy detection. Instead of merely modeling the correctnessof the added edge, we adopt a more comprehensive approach to de-tect whether the anchor’s ego-tree after adding the query maintainsthe original design of the seed taxonomy. The design of taxonomyincludes natural hypernymy relations, which needs the representa-tion of node-pair relations and expert-curated level configurations,such as species must be placed in the eighth level of biological tax-onomy, or adding one more adjective to a term means exactly onelevel higher in the e-commerce taxonomy. We adopt a coherence modeling module to detect the two aspects of coherence: i) Fornatural hypernymy relations, we adopt a hypernymy detectionmodule to represent the relation between the query and each nodein the anchor’s ego-tree. ii) For expert-curated designs, we inte-grate hierarchy-exclusive features such as embeddings of a node’sabsolute level and relative level to the anchor into the coherencemodeling module.Thirdly, we design a multi-dimensional evaluation to score thecoherence of the expanded taxonomy. The hierarchical structureof taxonomy allows the model to evaluate the correctness of pathselection and level selection separately and the parental relation-ships in a hierarchy not only allow the model to disambiguate themost similar terms but also enables the model to self-correct itslevel selection by deciding the current anchor’s granularity is toohigh or too low. We introduce the Fitting Score for the coherenceevaluation of the expanded ego-tree by using a Pathfinder and aStopper to score path correctness and level correctness, respectively.The Fitting Score calculation also disambiguates the most appro-priate anchor from its parent and children and self-correct its levelselection by bringing the level suggestion from the anchor’s parentand one of its children into consideration. The Fitting Score’s opti-mization adopts a self-supervised multi-task training paradigm forthe Pathfinder and Stopper, which automatically generates trainingdata from the seed taxonomy to utilize its information fully.We conduct extensive evaluations based on three benchmarkdatasets to compare our method with state-of-the-art baseline ap-proaches. The results suggest that the proposed

HEF model signif-icantly surpasses the previous solutions on all three datasets byan average improvement of 46.7% in accuracy and 32.3% in meanreciprocal rank. A series of ablation studies further demonstratethat

HEF can effectively perform the taxonomy expansion task.

Taxonomy Construction . Taxonomy construction aims to createa tree-structured taxonomy with a set of terms (such as conceptsand entities) from scratch, integrating hypernymy discovery andtree structure alignment. It can be further separated into two subdi-visions. The first focuses on topic-based taxonomy, where each nodeis a cluster of several terms sharing the same topic [32, 48]. Theother subdivision tackles the problem of term-based taxonomy con-struction, in which each node represents the term itself [3, 24, 35].A typical pipeline for this task is to extract “is-A” relations witha hypernymy detection model first using either a pattern-basedmodel [1, 8, 11, 28] or a distributional model [4, 18, 42, 45], thenintegrate and prune the mined hypernym-hyponym pairs into asingle directed acyclic graph (DAG) or tree [7]. More recent solu-tions utilize hyperbolic embeddings [17] or transfer learning [31]to boost performance.

Taxonomy Expansion . In the taxonomy expansion task, anexpert-curated seed taxonomy like MeSH [19] is provided as boththe guidance and the base for adding new terms. The taxonomyexpansion task is a ranking task to maximize a score of a nodeand its ground-truth parent in the taxonomy. Wang et al. [43]adopted Dirichlet distribution to model the parental relations. ETF[40] trained a learning-to-rank framework with handcrafted struc-tural and semantic features. Arborist [23] calculated the ranking core in a bi-linear form and adopted margin ranking loss. Taxo-Expan [34] modeled the anchor node by passing messages from itsegonet instead of considering a single node, and scored by feedinga concatenation of egonet representation and query embeddingto a feed-forward layer. STEAM [47] transformed the scoring taskinto a classification task on mini-paths and performed model en-semble of three sub-models processing distributional, contextual,and lexical-syntactic features, respectively. However, existing ap-proaches mostly neglect the characteristics of taxonomy’s hierar-chical structure and only evaluate the correctness of a single edgefrom anchor to query. On the contrary, our method utilizes thefeatures and relations brought by the hierarchical structure andaims to enhance the expanded taxonomy’s overall coherence. Modeling of Tree-Structured Data . Taxonomy expansion in-volves modeling a tree or graph structure. Plenty of works havebeen devoted to extending recurrent models to tree structures, likeTree-LSTM [38]. For explicit tree-structure modeling, previous ap-proaches include modeling the likelihood of a Bayesian network[6, 43] or using graph neural net variants [34, 47]. Recently, Trans-formers [39] achieved state-of-the-art performance in the programtranslation task by designing a novel positional encoding related topaths in the tree [36] or merely transforming a tree to sequence bytraversing its nodes [15]. In our work, we model tree-structure bya Transformer encoder, which, to the best of our knowledge, is thefirst to use the Transformer for taxonomy modeling. We adopt amore natural setting than [36] by using two different embeddingsfor a node’s absolute and relative level to denote positions.

In this section, we provide the formal definition of the taxonomyexpansion task and the explanation of key concepts that will occurin the following sections.

Definition and Concepts about Taxonomy.

A taxonomy T = (N , E) is an arborescence that presents hypernymy relations amonga set of nodes. Each node 𝑛 ∈ N represents a term , usually a con-cept mined from a large corpus online or an artificially extractedphrase. Each edge (cid:10) 𝑛 𝑝 , 𝑛 𝑐 (cid:11) ∈ E points to a node from its most exacthypernym node, where 𝑛 𝑝 is 𝑛 𝑐 ’s parent node, and 𝑛 𝑐 is 𝑛 𝑝 ’s child node. Since hypernymy relation is transitive [29], such relationexists not only in node pairs connected by a single edge, but also innode pairs connected by a path in the taxonomy. Thus, for a node 𝑛 in the taxonomy, its hypernym set and hyponym set consists ofits ancestors A 𝑛 , and its descendants D 𝑛 respectively. Definition of the Taxonomy Expansion Task.

Given a seedtaxonomy T = (N , E ) and the set of terms C to be added tothe seed taxonomy, The model outputs the taxonomy T = (N ∪C , E ∪R) , where R is the newly added relations from seed nodes in N to new terms in C . More specifically, during the inference phaseof a taxonomy expansion model, when given a query node 𝑞 ∈ C , themodel finds its best-suited parent node by iterating each node in theseed taxonomy as an anchor node 𝑎 ∈ N , calculating a score 𝑓 ( 𝑎, 𝑞 ) representing the suitability for adding the edge ⟨ 𝑎, 𝑞 ⟩ , and deciding 𝑞 ’s parent 𝑝 𝑞 in the taxonomy by 𝑝 𝑞 = arg max 𝑎 ∈N 𝑓 ( 𝑎, 𝑞 ) . Accessible External Resources.

As a term’s surface name isusually insufficient to convey the semantic information for hy-pernymy relationship detection, previous research usually utilizes term definitions [13, 34] or related web pages [16, 43] to learn termrepresentations. Besides, existing hypernymy detection solutionsusually use large external corpora to discover lexical or syntacticpatterns [37, 47]. As for the SemEval-2016 Task 13 datasets [2] usedfor our model’s evaluation, utilizing the WordNet [25] definitionsis allowed by the original task, which guarantees a fair comparisonwith previous solutions.

In this section, we introduce the design of the Hierarchy Expan-sion Framework (

HEF ). An illustration of

HEF is shown in Fig. 2.We first introduce the way

HEF models the coherence of a treestructure, including two components for node pair hypernymy de-tection and ego-tree coherence modeling, respectively. Then, wediscuss how

HEF further exploits the hierarchical structure formulti-dimensional evaluation by the modules of Pathfinder andStopper, and the self-supervised paradigm to train the model forthe Fitting Score calculation.

We first introduce the hypernymy detection module of

HEF , whichdetects the hypernymy relationships between two terms. Unlikeprevious approaches that manually design a set of classical lexical-syntactic features, we accomplish the task more directly and auto-matically by expanding the surface names of terms to their descrip-tions and utilizing pre-trained language models to represent therelationship between two terms.Given a seed term 𝑛 ∈ N and a query term 𝑞 ∈ N during train-ing or 𝑞 ∈ C during inference, the hypernymy detection moduleoutputs a representation 𝑟 𝑛,𝑞 suggesting how well these two termsform a hypernymy relation. Note that 𝑛 might not be identical to theanchor 𝑎 . Since the surface names of terms do not contain sufficientinformation for relation detection, we expand the surface namesto their descriptions, enabling the model to better understand thesemantic of new terms. We utilize the WordNet [25] concept defini-tions for completing this task. However, WordNet cannot explainall terms in a taxonomy due to its low coverage. Besides, manyterms used in taxonomies are complex phrases like “adaptationto climate change” or “bacon lettuce tomato sandwich” . Therefore,we further develop a description generation algorithm descr( · ) ,which generates meaningful and domain-related descriptions for agiven term based on WordNet. Specifically, descr( · ) is a dynamicprogramming algorithm that tends to integrate tokens into longerand explainable noun phrases. It describes each noun phrase by themost relative description to the taxonomy’s root’s surface name fordomain relevance. The details are shown in Alg. 2 in the appendix.The input for hypernymy detection is organized as the input formatof a Transformer: 𝐷 𝑛,𝑞 = [ ⊕ descr( 𝑛 ) ⊕ ⊕ descr( 𝑞 ) ⊕ ] , where ⊕ represents concatenation, and are the specialtoken for classification and sentence separation in the Transformerarchitecture, respectively.As shown in Fig. 2, the hypernymy detection module utilizes apre-trained DistilBERT [30], a lightweight variant of BERT [5], to nchor ’ s Ego-tree Query Node Pair Encoder(DistilBERT)

HypernymyDetectionModule

DescriptionEmb.

Node Pair Encoder(DistilBERT) Node Pair Encoder(DistilBERT) Node Pair Encoder(DistilBERT) ………… …… ? Ego-treeRepresentationAbsolute Level Emb.Relative LevelEmb.

CoherenceModelingModule PathﬁnderStopper

Segment Emb. ℒ Compute Fitting Score for Inference

Fitting Score 𝐹 Anchor ’s 𝑆 𝑝 ’s 𝑆 𝑓 Anchor ’ s parentAnchor ’s 𝑆 𝑐 ’s 𝑆 𝑏 Anchor ’ s child* 𝑎 𝑞 Ego-tree TransformerEncoder …………

Description Generation

Compute Loss for Self-Supervised Training

MLPsMLPs

Input

Figure 2: Illustration of the

HEF model. Each circle denotes a seed node or a query node. The “Anchor’s child*” in Fitting Scorecalculation denotes the anchor’s child with maximum Pathfinder Score 𝑆 𝑝 . learn the representations of cross-text relationships. Specifically, wefirst encode 𝐷 𝑛,𝑞 by DistilBERT (·) with positional encoding. Thenwe take the final layer representation of as the representationof the node pair ⟨ 𝑛, 𝑞 ⟩ : 𝑟 𝑛,𝑞 = DistilBERT (cid:0) 𝐷 𝑛,𝑞 (cid:1) [ ] , where index 0 represents the position of ’s embedding. We further design a coherence modeling module to evaluate thecoherence of the tree structure after attaching the query term 𝑞 into taxonomy T as the anchor 𝑎 ∈ N ’s child.There are two different aspects for considering a taxonomy’scoherence: i) the natural hypernymy relations. Since a node’s an-cestors in the taxonomy all hold hypernymy relations with it, anexplicit comparison among a node’s ancestors is needed to dis-tinguish the most appropriate one; ii) the expert-curated designs,which act as supplement information for maintaining the overallstructure. Some taxonomies contain latent rules about a node’sabsolute or relative levels in a taxonomy. For example, in the bio-logical taxonomy, kingdoms and species are all placed in the secondand eighth levels, respectively; in some e-commerce catalog tax-onomies, terms that are one level higher than another term containexactly one more adjective. Hence, the coherence modeling moduleneeds to: i) model a subtree with the query as a node in it, ratherthan a single node pair, enabling the model to learn the design ofa complete hierarchy; ii) add tree-exclusive features like absolutelevel or relative level compared to the anchor to assist learning theexpert-curated designs of the taxonomy.We design the Ego-tree H 𝑎 , a novel contextual structure of ananchor 𝑎 , which consists of all the ancestors and children of 𝑎 (seeFig. 2). This structure contains all relevant nodes to both anchorand query, enabling the model to both compare all hypernymyrelations along the root path and detect similarity among query and its potential siblings with minimal computation cost: H 𝑎 = A 𝑎 ∪ { 𝑎 } ∪ sample_child ( 𝑎 ) , (1)where A 𝑎 is all ancestors of 𝑎 in the seed taxonomy T , andsample_child (·) means sampling at most three children of the an-chor based on surface name similarity. The 3-children sampling isa trade-off between accuracy and speed, for three potential siblingsare empirically enough for a comprehensive similarity comparisonwith the query (especially when these potential siblings are quitedifferent) while decreasing the computation cost. Since this proce-dure is to leverage the similarity brought by a hierarchy’s siblingrelations, sampling by surface name similarity is intuitive and cost-saving given that similar surface names usually indicate similarterms. The input of the coherence modeling module includes theanchor’s ego-tree H 𝑎 and the query 𝑞 as the anchor’s child in H 𝑎 .For each node 𝑛 ∈ H 𝑎 , we represent the node pair ⟨ 𝑛, 𝑞 ⟩ by thefollowing representations: • Ego-tree representations.

The ego-tree representation 𝑟 𝑛,𝑞 is the output of the hypernymy detection module describedin Sec. 4.1. It suggests the node pair’s relation. • Absolute level embedding.

The absolute level embedding 𝑙 𝑛,𝑞 = AbsLvlEmb ( 𝑑 𝑛 ) , where 𝑑 𝑛 is the depth of 𝑛 in the ex-panded taxonomy. When 𝑛 = 𝑞 , 𝑙 𝑞,𝑞 = AbsLvlEmb ( 𝑑 𝑎 + ) .It assists the modeling of the expert-curated designs aboutgranularity of a certain level. • Relative level embedding.

The relative level embedding 𝑒 𝑛,𝑞 = RelLvlEmb (cid:0) 𝑑 𝑛 − 𝑑 𝑞 (cid:1) , where 𝑑 𝑛 is the depth of 𝑛 inthe expanded taxonomy. It assists the modeling of expert-curated designs about the cross-level comparison. • Segment embedding.

The segment embedding of ⟨ 𝑛, 𝑞 ⟩ 𝑔 𝑛,𝑞 = SegEmb ( segment ( 𝑛 )) distinguishes anchor and query oodbeverage nutriment …… …… vitamincoffee tea …… …… black tea oolong …… foodbeverage nutriment …… …… vitamincoffee tea …… …… black tea oolong …… Query node Self-supervision anchor nodesPathfinder Score = 0 (Wrong path)

Pathfinder Stopper

Pathfinder Score = 1 (Right path)

Stopper Tag = FORWARD (Child is better)

Stopper Tag = CURRENT (Right level)

Stopper Tag = BACKWARD (Parent is better)

Figure 3: An illustration of the self-supervision data labelsfor Pathfinder and Stopper. with other nodes in the ego-tree, where:segment ( 𝑛, 𝑞 ) =  , if 𝑛 is the anchor , , if 𝑛 is the query , , otherwise . The input of the coherence modeling module 𝑅 𝑎,𝑞 ∈ R ( |H 𝑎 |+ )× 𝑑 is the sum of the above embeddings calculated with the anchor’sego-tree and the query, organized as the input of a Transformer: 𝑅 𝑎,𝑞 =  𝑒 < 𝐶𝐿𝑆 > ⊕ 𝑒 < 𝐶𝐿𝑆 > (cid:202) 𝑛 ∈H 𝑎 ∪{ 𝑞 } (cid:0) 𝑟 𝑛,𝑞 + 𝑙 𝑛,𝑞 + 𝑒 𝑛,𝑞 + 𝑔 𝑛,𝑞 (cid:1) , (2)where 𝑑 is the dimension of embedding, 𝑒 < 𝐶𝐿𝑆 > is a randomlyinitialized placeholder vector for obtaining the ego-tree’s path andlevel coherence representations, and ⊕ denotes concatenation.We implement the coherence modeling module using a Trans-former encoder. Transformers are powerful to model sequences,but they lack positional information to process the relations amongnodes in graphs. However, in a hierarchy like taxonomy, the levelof nodes can be used as positional information, which simultane-ously eliminates the positional difference of nodes on the samelevel. Transformers are also strong enough to integrate multiple-source information by adding their embeddings, thus they are quitesuitable for modeling tree structures. In our HEF model, as shownin Fig. 2, by using two s in the module’s input, we can obtaintwo different representations: 𝑝 𝑎,𝑞 representing the coherence ofhypernymy relations (whether the path is correct), and 𝑑 𝑎,𝑞 repre-senting the coherence of inter-level granularity (whether the level is correct), evaluating how well the query fits the current positionin the taxonomy in both horizontal and vertical perspective: 𝑝 𝑎,𝑞 = TransformerEncoder (cid:0) 𝑅 𝑎,𝑞 (cid:1) [ ] 𝑑 𝑎,𝑞 = TransformerEncoder (cid:0) 𝑅 𝑎,𝑞 (cid:1) [ ] , where 0 and 1 are the position indexes of the two s. The two representations 𝑝 𝑎,𝑞 and 𝑑 𝑎,𝑞 need to be transformed intoscores indicating the fitness of placing the query 𝑞 on a particular path and a particular level. Thus, we propose the Pathfinder forpath selection and the

Stopper for level selection, as well as a newself-supervised learning algorithm for training and the Fitting Scorecalculation for inference.

Pathfinder.

The Pathfinder detects whether the query is posi-tioned on the right path. This module performs a binary classifica-tion using 𝑝 𝑎,𝑞 . The Pathfinder Score 𝑆 𝑝 = 𝑎 and 𝑞 are on the same root-path: 𝑆 𝑝 ( 𝑎, 𝑞 ) = 𝜎 (cid:0) W 𝑝 tanh (cid:0) W 𝑝 𝑝 𝑎,𝑞 + 𝑏 𝑝 (cid:1) + 𝑏 𝑝 (cid:1) , (3)where 𝜎 is the sigmoid function, and W 𝑝 , W 𝑝 , 𝑏 𝑝 , 𝑏 𝑝 are train-able parameters for multi-layer perceptrons. Stopper.

The Stopper detects whether the query 𝑞 is placedon the right level, i.e., under the most appropriate anchor 𝑎 on aparticular path. Selecting the right level is nonidentical to selectingthe right path since levels are kept in order. The order of nodes on apath enables us to design a more representative module for furtherclassifying whether the current level is too high (anchor 𝑎 is a coarse-grained ancestor of 𝑞 ) or too low ( 𝑎 is a descendant of 𝑞 ). Thus,the Stopper module uses 𝑑 𝑎,𝑞 to perform a 3-class classification:searching for a better anchor node needs to go Forward , remain

Current , or go

Backward , in the taxonomy: [ 𝑆 𝑓 ( 𝑎, 𝑞 ) , 𝑆 𝑐 ( 𝑎, 𝑞 ) , 𝑆 𝑏 ( 𝑎, 𝑞 )] = softmax (cid:0) W 𝑠 tanh (cid:0) W 𝑠 𝑑 𝑎,𝑞 + 𝑏 𝑠 (cid:1) + 𝑏 𝑠 (cid:1) , (4)where W 𝑝 , W 𝑝 , 𝑏 𝑝 , 𝑏 𝑝 are trainable parameters for multi-layerperceptrons. Forward Score 𝑆 𝑓 , Current Score 𝑆 𝑐 and Backward Score 𝑆 𝑏 are called Stopper Scores . Self-Supervised Training.

Training the

HEF model needs datalabels for both the Pathfinder and the Stopper. The tagging schemeis illustrated in Fig. 3. There are totally four kinds of Pathfinder-Stopper label combinations since Pathfinder Score is always 1 whenStopper Tag is

Forward or Current . The training process of

HEF isshown in Alg. 1. Specifically, we sample the ego-tree of all four typesof nodes for a query: 𝑞 ’s parent 𝑎 , 𝑎 ’s ancestors, 𝑎 ’s descendantsand other nodes, as a mini-batch for training the Pathfinder andStopper simultaneously.The optimization of Pathfinder and Stopper can be regardedas a multi-task learning process. The loss L 𝑞 in Alg. 1 is a linearcombination of the loss from Pathfinder and Stopper: L 𝑞 = − 𝜂 |X 𝑞 | ∑︁ 𝑎 ∈X 𝑞 BCELoss (cid:16) ˆ 𝑆 𝑝 ( 𝑎, 𝑞 ) , 𝑆 𝑝 ( 𝑎, 𝑞 ) (cid:17) − ( − 𝜂 ) |X 𝑞 | ∑︁ 𝑎 ∈X 𝑞 ∑︁ 𝑘 ∈{ 𝑓 ,𝑐,𝑏 } ˆ 𝑠 𝑘 ( 𝑎, 𝑞 ) log 𝑠 𝑘 ( 𝑎, 𝑞 ) , (5)where BCELoss (·) denotes the binary cross entropy, and 𝜂 is theweight of multi-task learning. Fitting Score-based Inference.

During inference, evaluationof an anchor-query pair ⟨ 𝑎, 𝑞 ⟩ should consider both Pathfinder’spath evaluation and Stopper’s level evaluation. However, instead ofmerely using 𝑆 𝑝 and 𝑆 𝑐 , the multi-classifying Stopper also enablesthe HEF model to disambiguate the most suited anchor from itsneighbors (its direct parent and children) and self-correct its levelprediction by exchanging scores with its neighbors to find thebest position for maintaining the taxonomy’s coherence. Thus, We lgorithm 1 Self-Supervised Training Process of

HEF . procedure TrainEpoch( T , Θ ) Θ ← Θ for 𝑞 ← N − root (cid:0) T (cid:1) do ⊲ Root is not used as query X 𝑞 = {} ⊲ Initialize anchor set 𝑝 ← parent ( 𝑞 ) ⊲ Reference node of labeling X 𝑞 ← X 𝑞 ∪ { 𝑝 } ⊲ Ground Truth Parent: 𝑆 𝑝 = , 𝑆 𝑐 = X 𝑞 ← X 𝑞 ∪ sample (cid:0) A 𝑝 (cid:1) ⊲ Ancestors: 𝑆 𝑝 = , 𝑆 𝑓 = X 𝑞 ← X 𝑞 ∪ sample (cid:0) D 𝑝 (cid:1) ⊲ Descendants: 𝑆 𝑝 = , 𝑆 𝑏 = X 𝑞 ← X 𝑞 ∪ sample (cid:0) N − { 𝑝 } − A 𝑝 − D 𝑝 (cid:1) ⊲ Other nodes: 𝑆 𝑝 = , 𝑆 𝑏 = for 𝑎 ← X 𝑞 do Compute 𝑆 𝑝 ( 𝑎, 𝑞 ) using Eqn. 3 Compute 𝑆 𝑓 ( 𝑎, 𝑞 ) , 𝑆 𝑐 ( 𝑎, 𝑞 ) , 𝑆 𝑏 ( 𝑎, 𝑞 ) using Eqn. 4 end for Compute L 𝑞 with 𝑆 𝑝 , 𝑆 𝑓 , 𝑆 𝑐 , 𝑆 𝑏 using Eqn. 5 Θ ← optimize (cid:0) Θ , L 𝑞 (cid:1) end for return Θ end procedure introduce the Fitting Score function during inference. For a newquery term 𝑞 ∈ C , we first obtain the Pathfinder Scores and StopperScores of all node pairs ⟨ 𝑎, 𝑞 ⟩ , 𝑎 ∈ N . For each anchor node 𝑎 , weassign its Fitting Score by multiplying the following four items: • 𝑎 ’s Pathfinder Score: 𝑆 𝑝 ( 𝑎, 𝑞 ) , which suggests whether 𝑎 is on the right path. • 𝑎 ’s parent’s Forward Score: 𝑆 𝑓 ( parent ( 𝑎 ) , 𝑞 ) , which dis-tinguishes 𝑎 and 𝑎 ’s parent, and rectifies 𝑎 ’s Current Score.When 𝑎 is the root node, we assign this item as a small num-ber like 1 𝑒 − • 𝑎 ’s Current Score: 𝑆 𝑐 ( 𝑎, 𝑞 ) , which suggests whether 𝑎 ison the right level. • 𝑎 ’s child with maximum Pathfinder Score’s BackwardScore: 𝑆 𝑏 (cid:0) 𝑐 ∗ 𝑎 , 𝑞 (cid:1) , 𝑐 ∗ 𝑎 = arg max 𝑐 𝑎 ∈ child ( 𝑎 ) 𝑆 𝑝 ( 𝑐 𝑎 , 𝑞 ) , whichdistinguishes 𝑎 and 𝑎 ’s children, and rectifies 𝑎 ’s CurrentScore. Since 𝑎 might have multiple children, we pick thechild with max Pathfinder Score, for larger 𝑆 𝑝 indicates abetter hypernymy relation to 𝑞 . When 𝑎 is a leaf node, weassign this item as the proportion of leaf nodes in the seedtaxonomy to keep its overall design.The Fitting Score of ⟨ 𝑎, 𝑞 ⟩ is given by: 𝐹 ( 𝑎, 𝑞 ) = 𝑆 𝑝 ( 𝑎, 𝑞 ) · 𝑆 𝑓 ( parent ( 𝑎 ) , 𝑞 ) · 𝑆 𝑐 ( 𝑎, 𝑞 ) · 𝑆 𝑏 (cid:0) 𝑐 ∗ 𝑎 , 𝑞 (cid:1) (6) 𝑐 ∗ 𝑎 = arg max 𝑐 𝑎 ∈ child ( 𝑎 ) 𝑆 𝑝 ( 𝑐 𝑎 , 𝑞 ) . The Fitting Score can be computed using ordered 𝑆 𝑝 , 𝑆 𝑓 , 𝑆 𝑐 , 𝑆 𝑏 arrays and the seed taxonomy’s adjacency matrix. Since a tree’sadjacency matrix is sparse, the time complexity of Fitting Scorecomputation is low. After calculating the Fitting Scores between allseed nodes and the query, we select the seed node with the highestFitting Score as the query’s parent in the expanded taxonomy:parent ( 𝑞 ) (cid:66) arg max 𝑎 ∈N 𝐹 ( 𝑎, 𝑞 ) . (7) Table 1: Statistics of datasets. | 𝑁 | and | 𝐸 𝑂 | are the numbersof nodes and edges in the original datasets, respectively. 𝐷 is the depth of the taxonomy. We adopt the spanning tree ofeach dataset, and | 𝐸 | is the number of remaining edges. Dataset | 𝑁 | | 𝐸 𝑂 | | 𝐸 | 𝐷 SemEval16-Env

261 261 260 6

SemEval16-Sci

429 452 428 8

SemEval16-Food

In this section, we first introduce our experimental setups, includ-ing datasets, our implementation details, evaluation criteria, anda brief description of the compared baseline methods. Then, weprovide extensive evaluation results for overall model performance,performance contribution brought by each design, and sensitivityanalysis of the multi-task learning weight 𝜂 in Equation 5. In-depthvisualizations of hypernymy detection and coherence modelingmodules are provided to analyze the model’s inner behavior. Wealso provide a case study in the appendix. We evaluate

HEF on three public benchmark datasetsretrieved from SemEval-2016 Task 13 [2]. This task contains threetaxonomies in the domain of Environment (

SemEval16-Env ), Sci-ence (

SemEval16-Sci ), and Food (

SemEval16-Food ), respectively. Thestatistics of the benchmark datasets are provided in Table 1. Notethat the original dataset may not form a tree. In this case, we usea spanning tree of the taxonomy instead of the original graph tomatch the problem definition. The pruning process only removesless than 6% of the total edges, keeping the taxonomy’s informationand avoiding multiple ground truth parents for a single node.Since

HEF and the compared baselines [34, 47] are all limited toadding new terms without modifying the seed taxonomy, nodesin the test and validation set can only sample from leaf nodes toguarantee that the parents of test or validation nodes exist in theseed taxonomy. This is also the sampling strategy of TaxoExpan[34]. Following the previous state-of-the-art model STEAM [47],we exclude 20% of the nodes in each dataset, of which ten nodes ofeach dataset are separated as the validation set for early stopping,and the rest as the test set. The nodes not included in the validationset and test set are seed nodes for self-supervision in the trainingphase and potential anchor nodes in the inference phase. Note thatpruning the dataset does not affect the node count, thus the scaleof the dataset remains identical to our baselines’ settings.

We compare our proposed

HEF model with the following baseline approaches: • BERT+MLP : This method utilizes BERT [5] to perform hy-pernym detection. This model’s input is the term’s surfacename, and the representation of BERT’s classification token ⟨ CLS ⟩ is fed into a feed-forward layer to score whether thefirst sequence is the ground-truth parent. The code will be available at https://github.com/sheryc/HEF. able 2: Comparison of the proposed method against state-of-the-art methods. All metrics are presented in percentages (%).Best results for each metric of each dataset are marked in bold. Reported performance is the average of three runs usingdifferent random seeds. The MRR of TAXI [26] is inaccessible since it outputs the whole taxonomy instead of node rankings.The performance of baseline methods are retrieved from [47]. Dataset

SemEval16-Env SemEval16-Sci SemEval16-Food

Metric Acc MRR Wu&P Acc MRR Wu&P Acc MRR Wu&PBERT+MLP 11.1 21.5 47.9 11.5 15.7 43.6 10.5 14.9 47.0TAXI [26] 16.7 - 44.7 13.0 - 32.9 18.2 - 39.2HypeNet [37] 16.7 23.7 55.8 15.4 22.6 50.7 20.5 27.3 63.2TaxoExpan [34] 11.1 32.3 54.8 27.8 44.8 57.6 27.6 40.5 54.2STEAM [47] 36.1 46.9 69.6 36.5 48.3 68.2 34.2 43.4 67.0

HEF • HypeNet [37]: HypeNet is an LSTM-based hypernym ex-traction model that scores a term pair by representing nodepaths in the dependency tree. • TAXI [26]: TAXI was the top solution of SemEval-2016 Task13. It explicitly splits the task into a pipeline of hypernymdetection using substring matching and pattern extraction,and hypernym pruning to avoid multiple parents. • TaxoExpan [34]: TaxoExpan is a self-supervised taxonomyexpansion model. The anchor’s representation is modeledby a graph network of its Egonet with consideration of rel-ative levels, and the parental relationship is scored by afeed-forward layer. BERT embedding is used as its inputinstead of the model’s original configuration. • STEAM [47]: STEAM is the state-of-the-art self-supervisedtaxonomy expansion model, which scores parental relationsby ensembling three classifiers considering graph, contex-tual, and hand-crafted lexical-syntactic features, respectively.

In our setting, the coherence mod-eling module is a 3-layer, 6-head, 768-dimensional Transformerencoder initialized from Gaussian distribution

N ( , . ) . The firsthidden layers of Pathfinder and Stopper are both 300-dimensional.The input to the hypernymy detection module is either truncatedor padded to a length of 64. Each training step contains a set of 32query-ego-tree pairs of 32 query nodes using gradient accumula-tion, with each query-ego-tree pair set containing one ground-truthparent ( 𝑆 𝑝 = 𝑆 𝑐 = 𝑆 𝑝 = 𝑆 𝑓 = 𝑆 𝑝 = 𝑆 𝑏 = 𝑆 𝑝 = 𝑆 𝑏 = 𝜖 set to1 × − . A linear warm-up is adopted with the learning rate linearlyrise from 0 to 5e-5 in the first 10% of total training steps and linearlydrop to 0 at the end of 150 epochs. The multi-task learning weight 𝜂 is set to 0.9. After each epoch, we validate the model and savethe model with the best performance on the validation set. Thesehyperparameters are tuned on on SemEval16-Env ’s validation set,and are used across all datasets and experiments unless specifiedin the ablation studies or sensitivity analysis.

Assume 𝑘 (cid:66) |C| to be the term countof the test set, { 𝑝 , 𝑝 , · · · , 𝑝 𝑘 } to be the predicted parents for testset queries, and { ˆ 𝑝 , ˆ 𝑝 , · · · , ˆ 𝑝 𝑘 } to be the ground truth parentsaccordingly. Following the previous solutions [23, 40, 47], we adoptthe following three metrics as evaluation criteria. • Accuracy (Acc) : It measures the proportion that the pre-dicted parent for each node in the test set exactly matchesthe ground truth parent:Acc = Hit@1 = 𝑘 𝑘 ∑︁ 𝑖 = I ( 𝑝 𝑖 = ˆ 𝑝 𝑖 ) , where I (·) denotes the indicator function, • Mean Reciprocal Rank (MRR) : It calculates the averagereciprocal rank of each test node’s ground truth parent:MRR = 𝑘 𝑘 ∑︁ 𝑖 = ( ˆ 𝑝 𝑖 ) , • Wu & Palmer Similarity (Wu&P) [44]: It is a tree-basedmeasurement that judges how close the predicted and groundtruth parents are in the seed taxonomy:Wu&P = 𝑘 𝑘 ∑︁ 𝑖 = × depth ( LCA ( 𝑝 𝑖 , ˆ 𝑝 𝑖 )) depth ( 𝑝 𝑖 ) + depth ( ˆ 𝑝 𝑖 ) , where depth (·) is the node’s depth in the seed taxonomy andLCA (· , ·) is the least common ancestor of two nodes. The performance of

HEF is shown in Table 2.

HEF achieves the bestperformance on all datasets and surpasses previous state-of-the-artmodels with a significant improvement on all metrics.From the table, we get an overview of how taxonomy expansionmodels evolve chronologically. The solution of BERT+MLP does notutilize any structural and lexical-syntactic features of terms, and theinsufficiency of information attributes to its poor results. Models ofthe first generation like TAXI and HypeNet utilize lexical, syntactic,or contextual information to achieve better results, mainly for thetask’s hypernymy detection part. However, these two models do notutilize any of the structural information of taxonomy; hence theyare unable to maintain the taxonomy’s structural design. Models of able 3: Ablation experiment results on the SemEval16-Env dataset. All metrics are presented in percentages (%). Foreach ablation experiment setting, only the best result is re-ported.

Abl. Type Setting Acc MRR Wu&POriginal

HEF

HEF further improves both the previous two generations’ strengthby proposing a new approach that better fits the taxonomy expan-sion task. Moreover, it introduces a new goal for the task: to bestpreserve the taxonomy’s coherence after expansion. We propose thedescription generation algorithm to generate accurate and domain-specific descriptions for complex and rare terms, to incorporatelexical-syntactic features for hypernymy detection. Aided by Dis-tilBERT’s power of sentence-pair representation,

HEF can minehypernymy features more automatically and accurately.

HEF alsoaims to fully exploit the information brought by the taxonomy’shierarchical structure to boost performance.

HEF uses ego-trees toperform thorough comparison along root path and among siblings,injects tree-exclusive features to assist modeling the expert-curatedtaxonomy designs and explicitly evaluates both path and level forthe anchor node as well as its parent and child. Experiment resultssuggest that these designs are capable of modeling and maximizingthe coherence of taxonomy in different aspects, which results in avast performance increase in the taxonomy expansion task.

We discuss how exploiting different characteristics of taxonomy’shierarchical structure brings performance increase by a series ofablation studies. We substitute some designs of

HEF in dataflowand score function to a vanilla setting and rerun the experiments.The results of the ablation studies are shown in Table 3. • - WordNet Descriptions : WordNet descriptions are sub-stituted with the term’s surface name as the hypernymydetection module’s input. • - Ego-tree + Egonet : the Egonet from TaxoExpan [34] isused instead of the ego-tree for modeling the tree structure. • - Relative Level Emb. : The relative level embedding for thecoherence modeling module is removed. • - Absolute Level Emb. : The absolute level embedding forthe coherence modeling module is removed. • Stopper Only : Only the Stopper Scores are used for FittingScore calculation. More specifically, 𝜂 =

0, and the FittingScore in Equation 6 becomes: 𝐹 ( 𝑎, 𝑞 ) = 𝑆 𝑓 ( parent ( 𝑎 ) , 𝑞 ) · 𝑆 𝑐 ( 𝑎, 𝑞 ) · 𝑆 𝑏 (cid:0) 𝑐 ∗ 𝑎 , 𝑞 (cid:1) ,𝑐 ∗ 𝑎 = arg max 𝑐 𝑎 ∈ child ( 𝑎 ) 𝑆 𝑝 ( 𝑐 𝑎 , 𝑞 ) . • Pathfinder + Current Only : Only the Pathfinder Score andthe Current Score are used for Fitting Score calculation. Morespecifically, the Fitting Score in Equation 6 and the loss inEquation 5 become: 𝐹 ( 𝑎, 𝑞 ) = 𝑆 𝑝 ( 𝑎, 𝑞 ) · 𝑆 𝑐 ( 𝑎, 𝑞 ) , L 𝑞 = − 𝜂 |X 𝑞 | ∑︁ 𝑎 ∈X 𝑞 BCELoss (cid:16) ˆ 𝑆 𝑝 ( 𝑎, 𝑞 ) , 𝑆 𝑝 ( 𝑎, 𝑞 ) (cid:17) − ( − 𝜂 ) |X 𝑞 | ∑︁ 𝑎 ∈X 𝑞 BCELoss (cid:16) ˆ 𝑆 𝑐 ( 𝑎, 𝑞 ) , 𝑆 𝑐 ( 𝑎, 𝑞 ) (cid:17) . • Current Only : Only the Current Score is used for FittingScore calculation. This is the scoring strategy identical toprior arts [34, 47]. More specifically, the Fitting Score inEquation 6 and the loss in Equation 5 become: 𝐹 ( 𝑎, 𝑞 ) = 𝑆 𝑐 ( 𝑎, 𝑞 ) , L 𝑞 = − |X 𝑞 | ∑︁ 𝑎 ∈X 𝑞 BCELoss (cid:16) ˆ 𝑆 𝑐 ( 𝑎, 𝑞 ) , 𝑆 𝑐 ( 𝑎, 𝑞 ) (cid:17) . We notice that by changing the design of dataflows, the per-formance of the

HEF model suffers from various deteriorations.Substituting WordNet descriptions with a term’s surface namesurprisingly remains a relatively high performance, which mightattribute to the representation power of the DistilBERT model. Us-ing Egonets rather than ego-trees for coherence modeling alsoaffects the performance. Although egonets can capture the localstructure of taxonomy, ego-trees are more capable of modeling thecomplete construction of a hierarchy. For the introduction of levelembeddings, the results show that removing one of the two levelembeddings for the coherence modeling module hurts the learn-ing of taxonomy’s design. This is in accordance with the previousresearch about the importance of using the information of bothabsolute and relative positions in Transformers [33] and confirmsour assumption that taxonomies have intrinsic designs about bothabsolute and relative levels.Changes to the score function bring a smaller negative impact onthe model compared to the dataflow changes, except for the settingof using merely Current Score. When using only the Current Score,the model loses the ability to disambiguate with its neighbors andthe capacity of directly choosing the right path, downgrading theproblem to be a series of independent scoring problems like theprevious solutions. Adding Backward Score and Forward Score intoFitting Score calculation allows the model to distinguish the groundtruth from its neighbors, bringing a boost to accuracy. However,without the Pathfinder, the “Stopper Only” setting only explicitly .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.94045505560 A cc ( % ) Dataset:SemEval16-EnvSemEval16-SciSemEval16-Food 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.950.052.555.057.560.062.565.067.5 M RR ( % ) Dataset:SemEval16-EnvSemEval16-SciSemEval16-Food 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.964666870727476 W u & P ( % ) Dataset:SemEval16-EnvSemEval16-SciSemEval16-Food (a) Accuracy on all 3 datasets. (b) MRR on all 3 datasets. (c) Wu&P on all 3 datasets.

Figure 4: Sensitivity analysis of model performance under different multi-task learning weight 𝜂 . [ C L S ] t e a i s a t r o p i c a l e v e r g r e e n s h r u b o r s m a ll t r e ee x t e n s i v e l y c u l t i v a t e d i n e . g [SEP]o Figure 5: Illustration of one self-attention head in the lastlayer of the hypernymy detection module, showing how thehypernymy detection module detects hypernymy relations.In this example, the seed node is “tea” , and the query is “oo-long” . focuses on choosing the right level without considering the pathand is inferior to the original HEF model.However, we observe that although changing several designsof dataflow or scoring function deteriorates the performance, ourmethod can still surpass the previous state-of-the-art in Acc andMRR, suggesting that the

HEF model introduces improvementsin multiple aspects, which also testifies that maximizing the tax-onomy’s coherence is a better goal for the taxonomy expansiontask. 𝜂 In this section, we discuss the impact of 𝜂 in Equation 5 througha sensitivity analysis. Since 𝜂 controls the proportion of loss fromthe path-oriented Pathfinder and the level-oriented Stopper, this hyperparameter affects HEF ’s tendency to prioritize path or levelselection. The results on all three datasets are shown in Fig. 4.From the result, we notice that 𝜂 cannot be set too low, whichmeans that explicit path selection contributes a lot to the model’sperformance. This is in accordance with the fact that taxonomiesare based on hypernymy relations and selecting the right path isthe essential guidance for anchor selection. A better setting of 𝜂 is [ . , . ] . In this setting, the model tends to balance path andlevel selections, which results in better performance. Surprisingly,setting 𝜂 to a high value like 0.9 also brings a performance boost,and sometimes even achieves the best result. This phenomenonconsistently exists when changing random seeds. However, setting 𝜂 to 1 means using merely Pathfinder, which cannot distinguishthe ground truth from other nodes and breaks the model. Thisdiscovery further testifies the importance of explicitly evaluatingpath selection in the taxonomy expansion task. To illustrate howthe hypernymy detection module works, we show the weights ofone of the attention heads of the hypernymy detection module’slast Transformer encoder layer in Fig. 5.By expanding a term to its description, the model is capable ofunderstanding the term “oolong” by its description, which cannotbe achieved by constructing rule-based lexical-syntactic featuressince “oolong” and “tea” have no similarity in their surface names.Furthermore, by adopting the pretrained DistilBERT, the hyper-nymy detection module can also discover more latent patterns suchas the relation between “leaves” and “tree”, allowing the model todiscover more in-depth hypernymy relations.

To illustrate how thecoherence modeling module compares the nodes in the ego-treeto maintain the taxonomy’s coherence, we present the weightsof an attention head of the module’s first Transformer encoderlayer in Fig. 6. Since the last layer’s attention mostly focuses on theanchor node ( “herb” ), the first layer can better illustrate the model’scomparison among ego-tree nodes.Based on our observation, the two s are capable of findingthe best-suited parent node in the ego-tree even if it is not theanchor. Since the coherence modeling module utilizes ego-trees forhierarchy modeling, the coherence modeling module can compare C L S > < C L S > f oo d f oo d s t u ff i n g r e d i e n t f l a v o r e r h e r b b o r a g e c o m f r e y t e a oo l o n g (For Pathfinder) (For Stopper) 0.1 0.2 0.3 0.4 0.5 Figure 6: Illustration of one self-attention head in the firstlayer of the coherence modeling module, showing how thecoherence modeling module finds the most fitted node in anego-tree. In this example, the anchor is “herb” , the query is “oolong” , and the query’s ground truth parent is “tea” . a node with all its ancestors and its children to find the most suitedanchor, which makes the model more robust. Besides, the coherencemodeling module is also able to assign a lower attention weight tothe best-suited parent’s parent node when it is on the right path,suggesting that the coherence modeling module can achieve bothpath-wise and level-wise comparison. We proposed

HEF , a self-supervised taxonomy expansion modelthat fully exploits the hierarchical structure of a taxonomy forbetter hierarchical structure modeling and taxonomy coherencemaintenance. Compared to previous methods that evaluate the an-chor by merely a new edge in a normal graph neglecting the treestructure of taxonomy, we used extensive experiments to provethat, evaluating a tree structure for coherence maintenance, andmining multiple tree-exclusive features in the taxonomy, includ-ing hypernymy relations from parent-child relations, term similar-ity from sibling relations, absolute and relative levels, path+levelbased multi-dimensional evaluation, and disambiguation based onparent-current-child chains all brought performance boost. Thisindicates the importance of using the information of tree structurefor the taxonomy expansion task. We also proposed a frameworkfor injecting these features, introduced our implementation of theframework, and surpassed the previous state-of-the-art. We believethat these novel designs and their motivations will not only benefitthe taxonomy expansion task, but also be influential for all tasksinvolving hierarchical or tree structure modeling and evaluation.Future works include how to model and utilize these or new tree-exclusive features to boost other taxonomy-related tasks, and betterimplementation of each module in HEF.

ACKNOWLEDGMENTS

Thanks to everyone who helped me with this paper in the TencentJarvis Lab, my family, and my loved one.

REFERENCES [1] Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting Relations FromLarge Plain-Text Collections. In

Proceedings of JCDL . 85–94.[2] Georgeta Bordea, Els Lefever, and Paul Buitelaar. 2016. SemEval-2016 Task 13:Taxonomy Extraction Evaluation (TExEval-2). In

Proceedings of the SemEval-2016 .1081–1091.[3] Anne Cocos, Marianna Apidianaki, and Chris Callison-Burch. 2018. ComparingConstraints for Taxonomic Organization. In

Proceedings of NAACL . 323–333. [4] Sarthak Dash, Md Faisal Mahbub Chowdhury, Alfio Gliozzo, Nandana Mihin-dukulasooriya, and Nicolas Rodolfo Fauceglia. 2020. Hypernym Detection UsingStrict Partial Order Networks. In

Proceedings of AAAI . 7626–7633.[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding.In

Proceedings of NAACL . 4171–4186.[6] Trevor Fountain and Mirella Lapata. 2012. Taxonomy Induction Using Hierarchi-cal Random Graphs. In

Proceedings of NAACL . 466–476.[7] Amit Gupta, Rémi Lebret, Hamza Harkous, and Karl Aberer. 2017. TaxonomyInduction Using Hypernym Subsequences. In

Proceedings of CIKM . 1329–1338.[8] Marti A. Hearst. 1992. Automatic Acquisition of Hyponyms From Large TextCorpora. In

Proceedings of ACL . 539–545.[9] Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2017.Understand Short Texts by Harvesting and Analyzing Semantic Knowledge.

IEEETransactions on Knowledge and Data Engineering (2017), 499–512.[10] Jin Huang, Zhaochun Ren, Wayne Xin Zhao, Gaole He, Ji-Rong Wen, and DaxiangDong. 2019. Taxonomy-Aware Multi-Hop Reasoning Networks for SequentialRecommendation. In

Proceedings of WSDM . 573–581.[11] Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M. Kaplan, Timo-thy P. Hanratty, and Jiawei Han. 2017. MetaPAD: Meta Pattern Discovery FromMassive Text Corpora. In

Proceedings of KDD . 877–886.[12] David Jurgens and Mohammad Taher Pilehvar. 2015. Reserating the Awesometas-tic: An Automatic Extension of the WordNet Taxonomy for Novel Terms. In

Proceedings of NAACL . 1459–1465.[13] David Jurgens and Mohammad Taher Pilehvar. 2016. SemEval-2016 Task 14:Semantic Taxonomy Enrichment. In

Proceedings of the SemEval-2016 . 1092–1102.[14] Giannis Karamanolakis, Jun Ma, and Xin Luna Dong. 2020. TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories. (2020).[15] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2020. Code Predic-tion by Feeding Trees to Transformers. (2020).[16] Zornitsa Kozareva and Eduard Hovy. 2010. A Semi-Supervised Method to Learnand Construct Taxonomies Using the Web. In

Proceedings of EMNLP . 1110–1118.[17] Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and MaximilianNickel. 2019. Inferring Concept Hierarchies From Text Corpora via HyperbolicEmbeddings. In

Proceedings of ACL . 3231–3241.[18] Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In

Proceed-ings of ICML . 296–304.[19] Carolyn E. Lipscomb. 2000. Medical Subject Headings (MeSH).

Bulletin of theMedical Library Association (2000), 265–266.[20] Bang Liu, Weidong Guo, Di Niu, Chaoyue Wang, Shunnan Xu, Jinghong Lin,Kunfeng Lai, and Yu Xu. 2019. A User-Centered Concept Mining System for Queryand Document Understanding at Tencent. In

Proceedings of KDD . 1831–1841.[21] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.(2019).[22] Xusheng Luo, Luxin Liu, Yonghua Yang, Le Bo, Yuanpeng Cao, Jinghang Wu,Qiang Li, Keping Yang, and Kenny Q. Zhu. 2020. AliCoCo: Alibaba E-CommerceCognitive Concept Net. In

Proceedings of SIGMOD . 313–327.[23] Emaad Manzoor, Rui Li, Dhananjay Shrouty, and Jure Leskovec. 2020. ExpandingTaxonomies With Implicit Edge Semantics. In

Proceedings of TheWebConf . 2044–2054.[24] Yuning Mao, Xiang Ren, Jiaming Shen, Xiaotao Gu, and Jiawei Han. 2018. End-To-End Reinforcement Learning for Automatic Taxonomy Induction. In

Proceedingsof ACL . 2462–2472.[25] George A. Miller. 1995. WordNet: A Lexical Database for English.

Commun. ACM (1995), 39–41.[26] Alexander Panchenko, Stefano Faralli, Eugen Ruppert, Steffen Remus, HubertNaets, Cédrick Fairon, Simone Paolo Ponzetto, and Chris Biemann. 2016. TAXI atSemEval-2016 Task 13: A Taxonomy Induction Method Based on Lexico-SyntacticPatterns, Substrings and Focused Crawling. In

Proceedings of SemEval-2016 . 1320–1327.[27] Hao Peng, Jianxin Li, Senzhang Wang, Lihong Wang, Qiran Gong, Renyu Yang, BoLi, Philip Yu, and Lifang He. 2019. Hierarchical Taxonomy-Aware and AttentionalGraph Capsule RCNNs for Large-Scale Multi-Label Text Classification.

IEEETransactions on Knowledge and Data Engineering (2019).[28] Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst Patterns Re-visited: Automatic Hypernym Detection From Large Text Corpora. In

Proceedingsof ACL . 358–363.[29] Erik Tjong Kim Sang. 2007. Extracting Hypernym Pairs From the Web. In

Pro-ceedings of ACL . Association for Computational Linguistics, 165–168.[30] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. Distil-BERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. (2020).[31] Chao Shang, Sarthak Dash, Md. Faisal Mahbub Chowdhury, Nandana Mihin-dukulasooriya, and Alfio Gliozzo. 2020. Taxonomy Construction of UnseenDomains via Graph-Based Cross-Domain Knowledge Transfer. In

Proceedings ofACL . 2198–2208.[32] Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li, and Jiawei Han. 2020. Net-Taxo: Automated Topic Taxonomy Construction From Text-Rich Network. In

Proceedings of TheWebConf . 1908–1919.

33] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention WithRelative Position Representations. In

Proceedings of NAACL . 464–468.[34] Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang, and Ji-awei Han. 2020. TaxoExpan: Self-Supervised Taxonomy Expansion With Position-Enhanced Graph Neural Network. In

Proceedings of TheWebConf . 486–497.[35] Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T.Vanni, Brian M. Sadler, and Jiawei Han. 2018. HiExpan: Task-Guided TaxonomyConstruction by Hierarchical Tree Expansion. In

Proceedings of KDD . 2180–2189.[36] Vighnesh Shiv and Chris Quirk. 2019. Novel Positional Encodings to EnableTree-Based Transformers. In

Advances in Neural Information Processing Systems32 . 12081–12091.[37] Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving Hypernymy De-tection With an Integrated Path-Based and Distributional Method. In

Proceedingsof ACL . Association for Computational Linguistics, 2389–2398.[38] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. ImprovedSemantic Representations From Tree-Structured Long Short-Term Memory Net-works. In

Proceedings of ACL/IJCNLP . 1556–1566.[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In

Advances in Neural Information Processing Systems 30 . 5998–6008.[40] Nikhita Vedula, Patrick K. Nicholson, Deepak Ajwani, Sourav Dutta, AlessandraSala, and Srinivasan Parthasarathy. 2018. Enriching Taxonomies With FunctionalDomain Knowledge. In

Proceedings of SIGIR . 745–754.[41] Paola Velardi, Stefano Faralli, and Roberto Navigli. 2013. OntoLearn Reloaded:A Graph-Based Algorithm for Taxonomy Induction.

Computational Linguistics (2013), 665–707.[42] Chengyu Wang, Yan Fan, Xiaofeng He, and Aoying Zhou. 2019. A Family of FuzzyOrthogonal Projection Models for Monolingual and Cross-Lingual HypernymyPrediction. In

Proceedings of WWW . 1965–1976.[43] Jingjing Wang, Changsung Kang, Yi Chang, and Jiawei Han. 2014. A HierarchicalDirichlet Model for Taxonomy Expansion for Search Engines. In

Proceedings ofWWW . 961–970.[44] Zhibiao Wu and Martha Palmer. 1994. Verbs Semantics and Lexical Selection. In

Proceedings of ACL . 133–138.[45] Wenpeng Yin and Dan Roth. 2018. Term Definitions Help Hypernymy Detection.In

Proceedings of *SEM . 203–213.[46] Xiaoxin Yin and Sarthak Shah. 2010. Building Taxonomy of Web Search Intentsfor Name Entity Queries. In

Proceedings of WWW . 1001–1010.[47] Yue Yu, Yinghao Li, Jiaming Shen, Hao Feng, Jimeng Sun, and Chao Zhang. 2020.STEAM: Self-Supervised Taxonomy Expansion With Mini-Paths. In

Proceedingsof KDD . 1026–1035.[48] Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler,Michelle Vanni, and Jiawei Han. 2018. TaxoGen: Unsupervised Topic TaxonomyConstruction by Adaptive Term Embedding and Clustering. In

Proceedings ofKDD . 2701–2709.

A CASE STUDY

To understand how different Fitting Score components contribute to

HEF ’s performance, we conduct a case study on the

SemEval16-Food dataset and show the detailed results in Table 4.The first two rows of Table 4 shows two cases where

HEF suc-cessfully predicts the query’s parent. We can see that the PathfinderScore and the three Stopper Scores all contribute to the correct se-lection, which testifies the effectiveness of the Fitting Score design.The last two rows of Table 4 provide situations when

HEF failsto select the correct parent. In the third row, “bourguignon” is de-scribed as “reduced red wine”, thus the model attaches it to thenode “wine” . However, “bourguignon” is also a sauce for cookingbeef. Such ambiguation consequently affects the meaning of a termby assigning an incorrect description, which hurts the model’s per-formance. In the last row, although “hot fudge sauce” ’s descriptioncontains "chocolate sauce", the node “chocolate sauce” still gets alow Current Score. In

HEF , the design of Stopper Scores enablesthe model to self-correct the occasionally wrong Current Scoresby assigning larger Forward Score from a node’s parent and largerBackward Score from one of the node’s children. However, since “chocolate sauce” is a leaf node, its child’s Backward Score is assignedto be the proportion of leaf nodes in the seed taxonomy, which is 0.07 in the

SemEval16-Food dataset. This indicates that future workincludes designing a more reasonable Backward Score function forleaf nodes to improve the model’s robustness.

B DESCRIPTION GENERATION ALGORITHM

Algorithm 2 shows the description generation algorithm descr( · ) used in HEF ’s hypernymy detection module. descr( · ) utilizesWordNet descriptions to generate domain-related term descriptionsby dynamic programming. In this algorithm, WordNetNounDescr( · ) means the set of a concept’s noun descriptions from WordNet [25],and CosSimilarity( 𝑡, 𝑛 root) means calculating the average to-ken cosine similarity of word vectors between a candidate descrip-tion 𝑡 and the surface name of a taxonomy’s root term 𝑛 root. Algorithm 2

Description Generation Algorithm for the Hyper-nymy Detection Module procedure Descr( 𝑛 ) ⊲ Input: term 𝑛 𝑁 ← split( 𝑛 ) for 𝑖 ← , · · · , length( 𝑁 ) do 𝑆 [ 𝑖 ] = ⊲ Initialize score array 𝐶 [ 𝑖 ] = ⊲ Initialize splitting positions end for for 𝑖 ← , · · · , length( 𝑁 ) − do for 𝑗 ← , · · · , 𝑖 do if WordNetNounDescr( 𝑁 [ 𝑗 : 𝑖 + ] ) > then 𝑠 𝑖 𝑗 = ( 𝑖 − 𝑗 + ) + ⊲ Prefer longer concepts else 𝑠 𝑖 𝑗 = end if if 𝑆 [ 𝑗 ] + 𝑠 𝑖 𝑗 > 𝑆 [ 𝑖 + ] then 𝑆 [ 𝑖 + ] ← 𝑆 [ 𝑗 ] + 𝑠 𝑖 𝑗 ⊲ Save max score 𝐶 [ 𝑖 ] = 𝑗 ⊲ Save splitting position end if end for end for 𝐷 ← “” ⊲ Initialize description 𝑝 ← length( 𝑁 ) ⊲ Generate split pointer while 𝑝 ≠ − do 𝐷 𝑊 𝑁 = WordNetNounDescr( 𝑁 [ 𝐶 [ 𝑝 ] : 𝑝 + ] ) if l en( 𝐷 𝑊 𝑁 ) > then ⊲ Noun or noun phrase 𝑑 ← arg max 𝑡 ∈ 𝐷 𝑊 𝑁

CosSimilarity( 𝑡, 𝑛 root ) else ⊲ Prep. or adj. 𝑑 ← j 𝑜𝑖𝑛 ( 𝑁 [ 𝐶 [ 𝑝 ] : 𝑝 + ] ) end if 𝐷 ← 𝑑 + 𝐷 ⊲ Put new description in the front 𝑝 ← 𝐶 [ 𝑝 ] − ⊲ Go to next split end while end procedure able 4: Examples of HEF ’s prediction, with detailed Fitting Score composition and comparison between the ground truth andthe predicted parent. Scores in this table correspond to the node in the same tabular cell with the score.

Query ( 𝑞 ) Ground Truth ( ˆ 𝑝 ) Scores Prediction ( 𝑝 ) Scores 𝑞 : paddy is rice inthe husk eithergathered or still inthe field ˆ 𝑝 : rice is grains used as food eitherunpolished or more often polished 𝑆 𝑝 = . 𝑝 : rice is grains used as food eitherunpolished or more often polished 𝑆 𝑝 = . 𝑆 𝑐 = . 𝑆 𝑐 = . ( ˆ 𝑝 ) : starches is a commercialpreparation of starch that is used tostiffen textile fabrics in laundering 𝑆 𝑓 = . ( 𝑝 ) : starches is a commercialpreparation of starch that is used tostiffen textile fabrics in laundering 𝑆 𝑓 = . 𝐹 ( ˆ 𝑝, 𝑞 ) = . 𝑐 ∗ ˆ 𝑝 : white rice is having husk or outerbrown layers removed 𝑆 𝑏 = . 𝑐 ∗ 𝑝 : white rice is having husk or outerbrown layers removed 𝑆 𝑏 = . 𝑝 ’s Ranking: 1 𝑞 : fish meal isground dried fishused as fertilizerand as feed fordomestic livestock ˆ 𝑝 : feed is food for domestic livestock 𝑆 𝑝 = . 𝑝 : feed is food for domestic livestock 𝑆 𝑝 = . 𝑆 𝑐 = . 𝑆 𝑐 = . ( ˆ 𝑝 ) : food is any substance that canbe metabolized by an animal to giveenergy and build tissue 𝑆 𝑓 = . ( 𝑝 ) : food is any substance that canbe metabolized by an animal to giveenergy and build tissue 𝑆 𝑓 = . 𝐹 ( ˆ 𝑝, 𝑞 ) = . 𝑐 ∗ ˆ 𝑝 : mash is mixture of ground animalfeeds 𝑆 𝑏 = . 𝑐 ∗ 𝑝 : mash is mixture of ground animalfeeds 𝑆 𝑏 = . 𝑝 ’s Ranking: 1 𝑞 : bourguignon is reduced redwine with onionsand parsley andthyme and butter ˆ 𝑝 : sauce is flavorful relish or dressing ortopping served as an accompaniment · · · 𝑆 𝑝 = . 𝑝 : wine is a red as dark as red wine 𝑆 𝑝 = . 𝑆 𝑐 = . 𝑆 𝑐 = . ( ˆ 𝑝 ) : condiment is a preparation (asauce or relish or spice) to enhance flavoror enjoyment 𝑆 𝑓 = . ( 𝑝 ) : alcohol is any of a series ofvolatile hydroxyl compounds that aremade from hydrocarbons by distillation 𝑆 𝑓 = . 𝐹 ( ˆ 𝑝, 𝑞 ) = 𝑒 − 𝑐 ∗ ˆ 𝑝 : bercy is butter creamed with whitewine and shallots and parsley 𝑆 𝑏 = . 𝑐 ∗ 𝑝 : red wine is wine having a red colorderived from skins of dark-colored grapes 𝑆 𝑏 = . 𝑝 ’s Ranking: 328 𝑞 : hot fudgesauce is hot thickchocolate sauceserved hot ˆ 𝑝 : chocolate sauce is sauce made withunsweetened chocolate or cocoa · · · 𝑆 𝑝 = . 𝑝 : sauce is flavorful relish or dressing ortopping served as an accompaniment · · · 𝑆 𝑝 = . 𝑆 𝑐 = 𝑒 − 𝑆 𝑐 = . ( ˆ 𝑝 ) : sauce is flavorful relish ordressing or topping served as anaccompaniment · · · 𝑆 𝑓 = . ( 𝑝 ) : condiment is a preparation (asauce or relish or spice) to enhance flavoror enjoyment 𝑆 𝑓 = . 𝐹 ( ˆ 𝑝, 𝑞 ) = 𝑒 − 𝑐 ∗ ˆ 𝑝 : None 𝑆 𝑏 = . 𝑐 ∗ 𝑝 : lyonnaise sauce is brown sauce withsauteed chopped onions and parsley · · · 𝑆 𝑏 = . 𝑝 ’s Ranking: 20’s Ranking: 20