Enhancing Hierarchical Information by Using Metric Cones for Graph Embedding
EEnhancing Hierarchical Informationby Using Metric Cones for Graph Embedding
Daisuke Takehara Kei Kobayashi Abstract
Graph embedding is becoming an important method with applications in various areas, including social networksand knowledge graph completion. In particular, Poincar´e embedding has been proposed to capture the hierarchicalstructure of graphs, and its effectiveness has been reported. However, most of the existing methods have isometricmappings in the embedding space, and the choice of the origin point can be arbitrary. This fact is not desirablewhen the distance from the origin is used as an indicator of hierarchy, as in the case of Poincar´e embedding. Inthis paper, we propose graph embedding in a metric cone to solve such a problem, and we gain further benefits: 1)we provide an indicator of hierarchical information that is both geometrically and intuitively natural to interpret, 2)we can extract the hierarchical structure from a graph embedding output of other methods by learning additionalone-dimensional parameters, and 3) we can change the curvature of the embedding space via a hyperparameter.
1. Introduction
In recent years, machine learning for graph-structured data has attracted significant attention. A recently developed method isusing graph convolutional neural networks (Kipf & Welling, 2017), which apply convolutional neural networks to graphs. Inthis study, we focus on graph embedding, which has become an important method with applications in social networks (Hoffet al., 2002), knowledge graph completion (Bordes et al., 2013), and other fields. In particular, various methods of embeddingin non-Euclidean spaces have been proposed to capture the hierarchical structure of graphs. In (Klimovskaia et al., 2020),Poincar´e embedding (Nickel & Kiela, 2017) has been applied to extract hierarchies from biological cell data.However, there are three major problems with existing graph embedding methods that are used for extracting the hierarchicalstructures.The first problem concerns the selection of the origin point. The distance from the origin is used as an indicator of hierarchywhen applying the embedding method to the space where an isometric map exists. In non-Euclidean spaces, existing graphembedding methods use a loss function that depends on the distance between two nodes in the embedding. Because thedistance from the origin is changed by the isometric transformation as the value of the loss function is not changed, it is notappropriate to use the distance from the origin as an indicator of hierarchy. In other words, when we define the indicator ofthe hierarchy in the space where the isometric map exists, it is necessary to be invariant to the isometric map.The second problem is a lack of scalability. If we are interested in information other than the hierarchical structure in theexisting methods for hierarchical embedding, we need to learn another embedding or solve the problem by an independentmethod because the embedding does not consider that information. Therefore, we cannot directly apply the embeddingmethod using non-Euclidean space for extracting the hierarchical structure.The third problem is that the curvature of the space needs to be considered in graph embedding in non-Euclidean spaces. Forexample, because the curvature of the Poincar´e sphere changes depending on the scaling of the space, the embedding resultmay change if the scaling is varied.In this paper, we propose a method of embedding graphs in metric cones as a solution to the three problems described above.A metric cone can be defined on any length metric space—a wide class of metric spaces (including normed vector spaces,manifolds, and metric graphs). This space’s dimension is larger than that of the original space. It is known that the curvature
The first author was supported by RIKEN AIP Japan. The second author was supported by JSPS KAKENHI (JP19K03642, JP19K00912)and RIKEN AIP Japan. Faculty of Science and Tecnology, Keio University, Yokohama, Japan. a r X i v : . [ c s . A I] F e b nhancing Hierarchical Information by Using Metric Cones for Graph Embedding of this space can be varied, and a method of changing the structure of the data space for analysis has also been proposed(Kobayashi & Wynn, 2020).First, we show that it is effective to use the coordinate corresponding to “the height of the metric cone” (a one-dimensionalparameter added to the original space) as an indicator of hierarchy. Second, we demonstrate that the hierarchical structurecan be extracted by learning the embedding of the metric cone, which optimizes the only one-dimensional parameter, incase we have a graph embedding pretrained. By keeping the coordinates of the original space fixed, we show that onlythe one-dimensional parameter corresponding to the height of the cone is learned, and the computation time is reduced.Additionally, if we use the Euclidean model for the pretrained embedding, we show that our model is scalable to solvingother tasks, preserving the original space structure. Third, we show that the curvature can be optimized for the data bytaking advantage of the fact that the curvature of the embedding space can be varied. The curvature of a metric cone varieswith one parameter that corresponds to the generatrix of the cone. If we have the pretrained embedding, we show thatthe learning method is also computationally efficient, because curvature can be optimized by learning one-dimensionalparameters without changing the structure of the original space (i.e., without learning the embedding in the original space).The remainder of this paper is organized as follows. Section 2 describes related research. In Section 3, we propose themethod of graph embedding in a metric cone, with the introduction of 1) graph embedding in non-Euclidean spaces and 2)the definition and properties of metric cones. Section 4 presents the experimental results using graph data, followed by aconclusion and future perspectives in Section 5.
2. Related Work
DeepWalk (Perozzi et al., 2014) samples series by random walks on graphs and applies a method for embedding series data,such as the skip-gram model (word2vec (Mikolov et al., 2013)). The sampling method of DeepWalk’s random walk serieshas been modified to depend on edge weights in node2vec (Grover & Leskovec, 2016), and a method that simultaneouslyoptimizes hyperparameters (such as random walk length) has been proposed in (Abu-El-Haija et al., 2018). Unlike ourproposed method, these methods were not designed to represent a hierarchical structure but were simply designed to embedthe structure of a graph with high accuracy, which is evaluated by predicting edge links.
Dimensionality compression methods exist for data in Euclidean space, such as multidimensional scaling (MDS) (Kruskal,1964), IsoMAP (Tenenbaum et al., 2000), and locally linear embedding (LLE) (Roweis & Saul, 2000). By applying it to theadjacency matrix of a graph, an embedded representation of the graph can be obtained. There are also methods to obtain anembedded representation of a graph by applying dimensionality reduction to the graph’s Laplacian matrix (Belkin & Niyogi,2002) instead of the adjacency matrix. As an alternative to random walk models, these graph embedding methods are alsoused to accurately embed the structure of the graph but not to represent a hierarchical structure.
Methods to compress dimensionality of adjacency matrices (Cao et al., 2016)(Wang et al., 2016) or the graph’s Laplacianmatrices (Kipf & Welling, 2016) using a neural network (autoencoder) to obtain a representation of the graph embeddinghave been proposed. These methods can be easily applied to other tasks. However, as with the above categories, thesemethods are designed for simply embedding the structure of a graph but not for representing a hierarchical structure.
Poincar´e embedding (Nickel & Kiela, 2017) is a method to embed the adjacency matrix of a graph in a skip-gram model,large-scale information network embedding (LINE) (Tang et al., 2015), which is constructed on a Poincar´e sphere. Inaddition, there are other methods to embed graphs into Lorentz models (Nickel & Kiela, 2018) and to embed each nodeof a graph as a cone instead of a point (Ganea et al., 2018). These methods exist to describe the hierarchy of each graphaccording to the inclusion relation of the cones. In previous research, the hierarchical structure could be accurately captured,but the problem of invariance for isometric maps prevented the natural definition of a metric that would represent thehierarchical structure. Therefore, embedding parameters in non-Euclidean spaces makes them difficult to use in other nhancing Hierarchical Information by Using Metric Cones for Graph Embedding tasks. In contrast, our proposed method can use a Euclidean embedded representation of all parameters except for theone-dimensional parameters. The purpose is to provide an embedding method that is easy to use for other tasks and providesa natural indicator of hierarchical information.
3. Methods
From this point onwards, the set of edges in an undirected graph G is denoted by E , the set of vertices by V , and theembedded space by X . Following Poincar´e embedding, we learn the embedding of a graph G by maximizing the following objective function: L = (cid:88) ( u,v ) ∈ E log exp ( − d ( u, v )) (cid:80) v (cid:48) ∈ N ( u ) exp ( − d ( u, v (cid:48) )) , (1)where N ( u ) : { v (cid:48) ∈ V | ( u, v (cid:48) ) / ∈ E } denotes the set of points not adjacent to node u (including u itself), and d denotes thedistance function of the embedded space (for Poincar´e embedding and the Poincar´e sphere). We can say that this objectivefunction is a negative sampling approximation of a model in which the similarity is − times the distance, and the probabilityof the existence of each edge is represented by a SoftMax function on the similarity: (cid:88) ( u,v ) ∈ E log exp ( − d ( u, v )) (cid:80) v (cid:48) ∈ V exp ( − d ( u, v (cid:48) )) ≈ (cid:88) ( u,v ) ∈ E log exp ( − d ( u, v )) (cid:80) v (cid:48) ∈ N ( u ) exp ( − d ( u, v (cid:48) )) . (2)The maximization of the objective function is done by stochastic gradient descent on Riemannian manifolds (RiemannianSGD). The stochastic gradient descent over the Euclidean space updates the parameters as follows: u ← u − η ∇ u L ( u ) , (3)where η is the learning rate. However, in non-Euclidean, the sum of vectors is not defined, and ∇ u L ( u ) is the point of thetangent space T u X of u ; hence, SGD cannot be applied. Therefore, we update the parameters by using an exponential mapinstead of the sum: u ← exp u ( − η ∇ Ru L ( u )) . (4)With the metric of the embedding space as g u ( u ∈ V ) , the gradient on the Riemannian manifold ∇ Ru L ( u ) is the scaledgradient in the Euclidean space: ∇ Ru L ( u ) = g − u ∇ u L ( u ) . (5) The metric cone is similar to ordinary cones (e.g. circle cones) in the sense that it is defined as a collection of line segmentsconnecting an apex point to a given set. However, the metric cone has a notable property such that every point in the originalset is embedded at an equal distance from the apex point and this is a desirable property for hierarchical structure extraction.The metric cone has been studied as an analogy to the length metric spaces of the tangent cone for differential manifoldswith singularities. Length metric space is a metric space where the distance between any two points is equal to the shortestcurve length connecting them. Length metric space includes Euclidean spaces, normed vector spaces, manifolds (e.g.,Poincar´e ball; sphere), metric graphs, and many other metric spaces. Assume the original space Z is a length metric space,then the metric cone generated by Z is X := Z × [0 , / Z × { } with a distance function determined as follows: ¯ d β (( x, s ) , ( y, t ))= β (cid:112) t + s − ts cos ( π min ( d Z ( x, y ) /β, (6) nhancing Hierarchical Information by Using Metric Cones for Graph Embedding where β > is a hyperparameter corresponding to the length of the conical generatrix. Note that the metric cone itself alsobecomes a length metric space, and it embeds the original space (i.e., the space is one dimension larger than the originalspace). The distance in the metric cone corresponds to the length of the shortest curve on the circle section (blue linesegment(s) in the right two subfigures in Figure 1) whose bottom circumference is the distance of the original space Z andwhose radius is β .When the curvature is measured in the sense of CAT( k ) property, a curvature measure for general length metric spaces, thecurvature value k can be controlled by β . Other properties of the metric cone are examined in (Sturm, 2003), (Deza & Deza,2009). Because the metric cone can change the curvature of the space by changing parameter β , its usefulness has beenreported in an analysis using the structure of the data space (Kobayashi & Wynn, 2020). Figure 1.
The Left figure depicts a conceptual image of an original space and its metric cone. A circle section to compute the distance inthe metric cone is depicted in the middle figure (when the apex angle < π ) and the right figure (when the apex angle ≥ π ). The metric ¯ g of a metric cone is obtained by calculating the two-time derivative of the distance as follows (see Appendix Afor more details): ¯ g ( x,s ) = (cid:18) π s g x (cid:62) β (cid:19) , (7)where g x represents the metric of Z at x .3.2.1. S CORE F UNCTION OF H IERARCHY
The Poincar´e embedding defines an index, which is aimed to be an indicator of the hierarchical structure, that depends onthe distance from the origin (8). score(is-a(u , v)) = − (1 + α ( (cid:107) v (cid:107) − (cid:107) u (cid:107) )) d ( u, v ) (8)This score function is penalized by the part after α , so if v is closer to the origin than u , then it is easier to obtain largervalues. In other words, v is higher in the hierarchy than u (i.e., “ u is a v ” relationship holds). However, it is not appropriateto use this indicator for the Poincar´e embedding. This model learns the embedding by maximizing this loss function: L = (cid:88) ( u,v ) ∈ E log exp ( − d H ( u, v )) (cid:80) v (cid:48) ∈ N ( u ) exp ( − d H ( u, v (cid:48) )) , (9)where d H ( x, y ) := arcosh (cid:18) (cid:107) x − y (cid:107) (1 − (cid:107) x (cid:107) )(1 − (cid:107) y (cid:107) ) (cid:19) . (10)This loss function only depends on the distance between the two embeddings. However, an isometric transformationin Poincar´e ball exists, known as M¨obius transformation ((Loustau, 2020)). M¨obius transformation is defined as a map f : B n (open unit ball) → B n , which can be written as a product of the inversions of ˆ R n (:= R n ∪ {∞} ) through a sphere S nhancing Hierarchical Information by Using Metric Cones for Graph Embedding that preserves B n . In contrast to Poincar´e ball, the isometric transformation on the metric cone does not exist when thecoordinate in the original space is fixed (we prove this property in Section 3.2.2). When we embed a graph into a metriccone, we define an indicator of the hierarchical structure by replacing the norm with a coordinate corresponding to the heightof the cone (a one-dimensional parameter added to the original space). A point closer to the top of the cone is higher inthe hierarchy. By analogy, a point closer to the bottom of the cone is lower in the hierarchy. Therefore, we have a naturalindicator of the hierarchical structure.3.2.2. I DENTIFIABILITY OF THE H EIGHTS IN C ONE E MBEDDING
Let Z be an original embedding space (length metric space), and let X be a metric cone of Z with a parameter β > . Weassume that each data point z i ∈ Z ( i = 1 , . . . , n ) has its specific “height” t i ∈ [0 , in the metric cone X . Our proposedmethod embeds data points into a metric cone based on the estimated distances ˜ d β ( x i , x j ) ( i, j = 1 , . . . , n ) and tries tocompute the heights t , . . . , t n as a measure of the hierarchy level. However, it is not evident if these heights are identifiableonly from the information of the original data points in Z and the distances ˜ d β ( x i , x j ) ( i, j = 1 , . . . , n ) in the metric cone.The following theorem guarantees some identifiability. Theorem 1 (a) Let n ≥ and assume that z , . . . , z n are not all aligned on a geodesic in Z . Then, the heights t , . . . , t n are identifiable up to at most four candidates.(b) Let n ≥ and assume z , . . . , z n and t , . . . , t n take “general” positions and heights, respectively. Then, the heights t , . . . , t n are identifiable uniquely.(c) If d Z ( z i , z j ) ≥ β/ for all i, j = 1 , . . . , n, i (cid:54) = j , then the heights t , . . . , t n are identifiable uniquely. A rigorous version of Theorem 1, including the precise meaning of “general” in (b), is explained in Appendix C. Theorem1(a) indicates that the candidates of heights are finite, and we can expect the algorithm to converge to one of them, exceptfor a very special data distribution in the original space Z . Moreover, by (b), even the uniqueness can be proved undervery mild conditions. The statement in (c) implies that the uniqueness holds for arbitrary data distributions when we set β sufficiently small.Remark that the assumption of “general” positions in Theorem 1 (b) is satisfied easily for most data distributions. Forexample, if both z , . . . , z n ∈ R d and t , . . . , t n ∈ [0 , are i.i.d. from a probability distribution whose density functionexists with respect to the Lebesgue measure, then it is easy to see the assumption holds almost surely and thereforeuniqueness of the solution is guaranteed. Note that for n = 3 under the same setting, there can be multiple solutions with apositive probability.3.2.3. U SING P RETRAINED M ODEL FOR C OMPUTATIONAL E FFICIENCY AND S CALABILITY
Consider a situation where we already have a trained graph embedding on a Euclidean space (e.g., LINE), and we try tolearn the embedding in a metric cone of the Euclidean space to extract information about the hierarchical structure. In thiscase, we can reduce the computational cost by fixing the coordinates corresponding to the original Euclidean space andlearn only the one-dimensional parameters corresponding to heights in a metric cone added to the original space because themetric cone is one dimension larger than the original space. In addition, because the embedding in the original Euclideanspace is preserved, it can be used as an input to the neural network (when the task considers information about the hierarchy,and the added one-dimensional parameters are also used as input) and can be easily applied to other tasks. However, othernon-Euclidean embedding methods to extract hierarchical structures are not scalable because these methods cannot beapplied directly to solve other tasks. For example, deep neural networks cannot use a non-Euclidean embedding as inputbecause the sum of two vectors in the space and scalar product is not generally defined.3.2.4. V
ARIABLE C URVATURE
One of the essences of Poincar´e embedding is that a negative curvature of the Poincar´e sphere is suitable for embeddingtree graphs. The curvature of a metric cone has a similar property, i.e. a metric cone has a more negative curvature thanthe original space and, furthermore, the curvature can be controlled by hyperparameter β . We will verify these factsmathematically from two different aspects: (i) the scalar and the Ricci curvatures of a Riemannian manifold and (ii) theCAT( k ) property of a length metric space. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding First assume the original space M is a n -dimensional Riemannian manifold with a metric g . Then the metric ˜ g of thecorresponding metric cone with β can be defined except the apex and it becomes as (7). Let , . . . , n be coordinate indicescorresponding to x ∈ M and be the index corresponding to s ∈ [0 , . The Ricci curvatures ˜ R ij and the scalar curvature ˜ R at ( x, s ) become ˜ R αγ = R αγ − π ( n − β − ˜ g αγ , (11) ˜ R α = ˜ R α = ˜ R = 0 , (12) ˜ R = { π − R − n ( n − β − } s − (13)where α, γ are coordinate indices in , . . . , n and R ij and R are the Ricci curvatures and the scalar curvature of M ,respectively. See the Appendix B for the derivation of such curvatures. The scalar curvature and the Ricci curvatures ˜ R αγ becomes more negative than (a constant times of) the original curvature for β < ∞ and n ≥ . Moreover, the smaller valueof β makes the curvature more negative thus it becomes possible to control the curvature by tuning β . Note that the closer tothe apex, i.e. the smaller the value of s , the greater the change of the scalar curvature.Second assume the original space M is a length metric space. This doesn’t require a differentiable structure and is moregeneral than the Riemannian manifold. In this case, we cannot argue the curvatures using the Riemannian metric but theCAT( k ) property can be used instead. In (Kobayashi & Wynn, 2020), they proved the curvature of the metric cone is morenegative or equal to the curvature of the original space and it can be controlled by β in the sense of the CAT( k ) property.
4. Experiments
The claim in this paper is that “a hierarchical structure can be captured by adding a one-dimensional parameter andembedding it in a metric cone.” Therefore, we evaluate the proposed method in two experiments:• embedding graphs (e.g., a social network);• embedding taxonomies (e.g., WordNet).As a comparison, we compare the proposed method with three other methods: Poincar´e embedding, and ordinary embeddingin Euclidean space, which are known to capture the hierarchical structure of graphs.
Here, the accuracy of the metric cone embedding is evaluated by graph embedding. Following the experiments for Poincar´eembedding in (Nickel & Kiela, 2017), we evaluate the proposed method on four networks of coauthors of research papers:ASTROPH, CONDMAT, GRQC, and HEPPH. The graphs are designed so that each node represents a researcher. If thereis a paper co-authored by two researchers, there is an edge between the corresponding nodes. In the graph embedding,all of the graph data are used for training, and the results are evaluated according to the accuracy with which the graph isreconstructed from the learned embedding. Because the same data are used for training and evaluation, we evaluate thefittingness of the embedding method to the data (not generalization performance).The accuracy is calculated as follows:1. Fix one node and calculate the distance to all other nodes. Sort the nodes in order of proximity.2. Calculate the average of the rankings of neighboring nodes and the average precision.3. Perform the above two operations on all the nodes and take the average.The experimental results are shown in Table 1 and Table 2, where MR is the mean rank, and MAP is the mean averageprecision. The table confirms the effectiveness of the embedding method in a metric cone. See Appendix D for furtherexperimental results. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding
Model evaluation GRQC CONDMATmetric 10 20 50 100 10 20 50 100Euclidean MR 3.16 3.16 3.16 3.16 29.37 8.18 8.16 8.16MAP 0.999 0.999 0.999 0.999 0.745 0.990 0.996 0.997Poincar´e MR 27.77 26.62 25.74 26.12 415.34 389.34 382.65 380.41MAP 0.879 0.881 0.883 0.882 0.721 0.727 0.729 0.729Our Model MR 3.16 3.16 3.16 3.16 27.85 8.18 8.16 8.16(Metric Cone) MAP 0.999 0.999 0.999 0.999 0.746 0.990 0.996 0.997
Table 1.
MAP and rank score for Reconstruction and Link Prediction
Model evaluation ASTROPH HEPPHmetric 10 20 50 100 10 20 50 100Euclidean MR 117.67 29.43 3.83 3.83 14.11 2.73 2.61 2.61MAP 0.597 0.882 0.998 0.999 0.776 0.994 0.999 0.999Poincar´e MR 671.59 654.60 641.54 637.66 194.23 190.90 189.09 188.60MAP 0.536 0.540 0.544 0.545 0.658 0.661 0.661 0.661Our Model MR 108.57 26.82 3.83 3.83 13.58 2.71 2.61 2.61(Metric Cone) MAP 0.611 0.885 0.998 0.999 0.775 0.994 0.999 0.999
Table 2.
MAP and rank score for Reconstruction and Link Prediction
Following the Poincar´e embedding, we evaluate the embedding accuracy of the hierarchical structure using WordNet. Toverify this, we embed the nouns in WordNet into a metric cone and use the following score function, where the height in thecone is an indicator of hierarchy: score(is-a(( u , s) , ( v , t))) = − (1 + α ( s − t )) d ( u , v ) . (14)This score function is penalized by the part after α , so if t is closer to the origin than s , then it is easier to obtain largervalues. In other words, ( v, t ) is higher in the hierarchy than ( u, s ) (i.e., “ ( u, s ) is a ( v, t ) ” relationship holds). In addition,the first term is used to avoid misjudging the relationship between the words with no relationship. Note that hyperparameter α was set to . The output of this score function and the correlation coefficient of the HyperLex dataset (Spearman’s ρ ) are used to evaluate the ability to represent the hierarchical structure of the model. The correlation coefficients and theembedding accuracy (mean rank (MR) and mean average precision (MAP)) are shown in Table 3. The table shows that ourproposed model improves the score and captures the hierarchical structure better than other embedding methods.Furthermore, an example visualization of the hierarchical structure of the embedding vectors obtained by the training isshown in Figure 2. As the figure illustrates, the closer the coordinate corresponding to the height in the cone is to zero (closerto the top of the cone), the higher the noun in the hierarchy is located in the embedded representation. For visualization, theembedding vectors in Euclidean space were reduced to two dimensions by principal component analysis.10 20 50 100Euclidean MR 1471.70 232.88 2.51 1.82MAP 0.070 0.122 0.838 0.899Poincare MR 19.94 19.62 19.47 19.36MAP 0.528 0.534 0.537 0.538Our Model MR 1401.28 209.11 2.30 1.79(Metric Cone) MAP 0.052 0.126 0.853 0.902corr 0.183 0.372 0.409 0.411 Table 3.
Embedding accuracy for WordNet (corr represents Spearman’s ρ on HYPERLEX) nhancing Hierarchical Information by Using Metric Cones for Graph Embedding Figure 2.
Visualization of WordNet Embedding using metric cone: each point is a word, and the two points connected by directed edgesindicate that the word at the end of the arrow is a hyponym of the word at the start.
In conducting experiments on our proposed method, we tuned the parameters β for batch size, number of epochs, number ofnegative samples, learning rate, and metric cone. For tuning, we performed a grid search on the HEPPH dataset, whichis relatively small among the datasets used in this experiment, and trained on other graph datasets (GRQC, ASTROPH,CONDMAT) and WordNet using the tuned parameters. The tuning were conducted with the settings as follows:• Batch sizes: 2048, 1024, 512, 256• Number of epochs: 2000, 1500, 1000, 500• Number of negative samples: 50, 20, 10, 5• Learning rate: 100.0, 10.0, 1.0, 0.1• Metric cone parameters β : 50, 10, 1, 0.1Based on the results of the search with these settings, the experiments with the results in the table are Batch size: 256,number of epochs: 2000, number of negative samples: 50, learning rate: 10.0(in embedding graphs) and 1.0(in embeddingtaxonomies), beta: 50. As for the batch size, the number of epochs, and the number of negative samples, as with the tuningof a normal neural network, the smaller the batch size, the larger the number of epochs, and the larger the number of negativesamples, the better the results. As for the learning rate, due to the sparse size of the dataset used in this study, we were ableto obtain better results by varying the learning rate according to the data. The experimental results of our proposed methodwere highly dependent on β , the parameter of the metric cone. One possible reason for this is that by fixing β , the distancebetween two points in space is upper-bounded: d β (( x, s ) , ( y, t )) ≤ β. (15) nhancing Hierarchical Information by Using Metric Cones for Graph Embedding The curvature of the data space needs to be optimized in order to embed a graph data efficiently, since the curvature ofthe space changes with β . Remark that by (12) and (13), the effect of changes of parameter β on the Ricci and the scalarcurvatures is greater if the original space has higher dimensions. Therefore, delicate tuning of the parameters β is required,especially in the case of high dimensions.
5. discussion and future works
In this study, we have demonstrated that a graph embedding in a metric cone that is a dimension larger than the existingembedding methods has the following advantages: 1) we naturally define an index (score function) as an indicator ofhierarchy, 2) we improve computational efficiency and scalability to various tasks, and 3) we enhance accuracy by changingthe curvature.Future research topics include 1) efficient optimization of curvature, 2) development of an embedding method to updateexisting embeddings in learning, and 3) discovery of applications to other tasks. In this study, we measured the improvementof the accuracy of curvature optimization by the grid search. However, more efficient methods, such as gradient-basedmethods, can be used to optimize the embedding space. We also learned embedding in a metric cone under the constraint ofnot updating existing embeddings. We learned embedding for a metric cone that was randomly initialized; however, thetraining accuracy was not satisfactory. One possible reason for this is that optimization in metric cones is difficult and tendsto fall into local solutions. Therefore, to optimize the entire embedding, an efficient method of optimizing functions on ametric cone (or Riemannian manifold) should be developed in future work.
6. Acknowledgement
The idea of applying metric cones to data science was born during a collaboration with Henry P. Wynn.
References
Abu-El-Haija, S., Perozzi, B., Al-Rfou, R., and Alemi, A. A. Watch your step: Learning node embeddings via graphattention. In
Advances in Neural Information Processing Systems , pp. 9180–9190, 2018.Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In
Advances inneural information processing systems , pp. 585–591, 2002.Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. Translating embeddings for modeling multi-relational data. In
Advances in neural information processing systems , pp. 2787–2795, 2013.Cao, S., Lu, W., and Xu, Q. Deep neural networks for learning graph representations. In
AAAI , volume 16, pp. 1145–1152,2016.Cox, D., Little, J., and OShea, D.
Ideals, varieties, and algorithms: an introduction to computational algebraic geometryand commutative algebra . Springer Science & Business Media, 2013.Deza, M. M. and Deza, E. Encyclopedia of distances. In
Encyclopedia of distances , pp. 1–583. Springer, 2009.Ganea, O., Becigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. volume 80of
Proceedings of Machine Learning Research , pp. 1646–1655, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018.PMLR. URL http://proceedings.mlr.press/v80/ganea18a.html .Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDDinternational conference on Knowledge discovery and data mining , pp. 855–864, 2016.Hoff, P. D., Raftery, A. E., and Handcock, M. S. Latent space approaches to social network analysis.
Journal of the americanStatistical association , 97(460):1090–1098, 2002.Janson, S. Riemannian geometry: some examples, including map projections.
Notes , 2015. URL .Kipf, T. N. and Welling, M. Variational graph auto-encoders.
NIPS Workshop on Bayesian Deep Learning , 2016. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding
Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks.
In International Conferenceon Learning Representations (ICLR) , 2017.Klimovskaia, A., Lopez-Paz, D., Bottou, L., and Nickel, M. Poincar´e maps for analyzing complex hierarchies in single-celldata.
Nature Communications , 11(1):1–9, 2020.Kobayashi, K. and Wynn, H. P. Empirical geodesic graphs and cat (k) metrics for data analysis.
Statistics and Computing ,30(1):1–18, 2020.Kruskal, J. B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.
Psychometrika , 29(1):1–27, 1964.Loustau, B. Hyperbolic geometry. arXiv preprint arXiv:2003.11180 , 2020.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and theircompositionality. In
Advances in neural information processing systems , pp. 3111–3119, 2013.Nickel, M. and Kiela, D. Poincar´e embeddings for learning hierarchical representations. In
Advances in neural informationprocessing systems , pp. 6338–6347, 2017.Nickel, M. and Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. volume 80 of
Proceedings of Machine Learning Research , pp. 3779–3788, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018.PMLR. URL http://proceedings.mlr.press/v80/nickel18a.html .Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In
Proceedings of the 20thACM SIGKDD international conference on Knowledge discovery and data mining , pp. 701–710, 2014.Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. science , 290(5500):2323–2326, 2000.Sturm, K.-T. Probability measures on metric spaces of nonpositive.
Heat Kernels and Analysis on Manifolds, Graphs, andMetric Spaces: Lecture Notes from a Quarter Program on Heat Kernels, Random Walks, and Analysis on Manifolds andGraphs: April 16-July 13, 2002, Emile Borel Centre of the Henri Poincar´e Institute, Paris, France , 338:357, 2003.Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. Line: Large-scale information network embedding. In
Proceedings of the 24th international conference on world wide web , pp. 1067–1077, 2015.Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science , 290(5500):2319–2323, 2000.Wang, D., Cui, P., and Zhu, W. Structural deep network embedding. In
Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining , pp. 1225–1234, 2016. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding
AppendixA. Derivation of the metric tensor of a metric cone
Let M be an n -dimensional Riemannian manifold with a metric g . Then the metric ¯ g of the corresponding metric cone ˜ M = ˜ M β can be defined except the apex. Denote the square of the infinitesimal distance in ˜ M as | d ˜ s | , then | d ˜ s | = ¯ d β (( x, r ) , ( x + dx, r + dr )) = β (cid:0) r + 2 rdr + dr − r + dr ) r cos( π min( d M ( x, x + dx ) /β, (cid:1) ≈ β (cid:18) r + 2 rdr + dr − r + rdr ) (cid:18) − ( πd M ( x, x + dx ) /β ) (cid:19)(cid:19) ≈ β dr + π r (cid:88) i,j g ij dx i dx j + π rdr (cid:88) i,j g ij dx i dx j ≈ (cid:18) dxdr (cid:19) (cid:62) (cid:18) ( π r g ij ) 00 β (cid:19) (cid:18) dxdr (cid:19) . (16)Therefore, the metric tensor ¯ g becomes ¯ g = (cid:18) r π g β (cid:19) . (17) B. Derivation of the Ricci and the scalar curvatures of a metric cone
We will derive the Ricci and scalar curvatures of metric cone ˜ M Let , , . . . , n be the coordinate indices of metric cone ˜ M where corresponds to the radial coordinate s ∈ (0 , and , . . . , n correspond to x ∈ M . Claim 2
The Ricci curvatures ˜ R ij and the scalar curvature ˜ R become ˜ R αγ = R αγ − π ( n − β − g αγ , ˜ R α = ˜ R α = ˜ R = 0 , ˜ R = { π − R − n ( n − β − } s − (18) where α and γ are coordinate indices in , . . . , n and R ij and R are the Ricci curvatures and the scalar curvature of M ,respectively. Proof.
By Example 4.6 of (Janson, 2015), if the metric of ˜ M is defined by the squared infinitesimal distance | ds | in M and a C -class function w on an open interval J ⊂ R as | d ˜ s | = β | dr | + w ( r ) | ds | , (19)the Ricci curvature tensor becomes ˜ R αγ = R αγ − (cid:32) ( n − (cid:18) w (cid:48) w (cid:19) + w (cid:48)(cid:48) w (cid:33) ˜ g αγ = R αγ − (cid:32) ( n − (cid:18) w (cid:48) w (cid:19) + w (cid:48)(cid:48) w (cid:33) w g αγ , (20) ˜ R α = 0 , ˜ R = − ( n − w (cid:48)(cid:48) w (21) nhancing Hierarchical Information by Using Metric Cones for Graph Embedding and the scalar curvature becomes ˜ R = w − ( R − n ( n − w (cid:48) ) − nww (cid:48)(cid:48) ) . (22)Since the metric of a metric cone ˜ M is given by | d ˜ s | = β | dr | + π r | ds | , (23)by setting ˜ r := βr and w (˜ r ) := πβ − ˜ r , we obtain the following form similar to (19): | d ˜ s | = | d ˜ r | + w (˜ r ) | ds | . (24)By substituting w (˜ r ) = πβ − ˜ r = πr , w (cid:48) (˜ r ) = πβ − and w (cid:48)(cid:48) (˜ r ) = 0 , we obtain the Ricci and scalar curvatures in Claim 2 (cid:3) C. Identifiability of the hights in the cone embedding
In this section, we will prove Theorem 1 of the main article. Let us begin by rewriting Theorem 1 as a longer but moretheoretically rigorous form.
Theorem 3 (A rigorous restatement of Theorem 1)
Let Z be a length metric space and X be a metric cone of Z with aparameter β > . Let n be an integer at least 3. Fix z i ∈ Z and x i := ( z i , t i ) ∈ X with t i ∈ [0 , for i = 1 , . . . , n . Denotea matrix ˜ D := [ ˜ d β ( x i , x j )] ni,j =1 .(a) Assume z , . . . , z n are not all aligned in a geodesic. Given z , . . . , z n and ˜ D , the number of possible values of ( t , . . . , t n ) is at most four.(b) Let n ≥ and assume z , . . . , z n and t , . . . , t n are in a “general” position. Here “general” position means that,besides the assumption in (a), given any distinct 4 points z i , z j , z k , z l ∈ Z and corresponding heights t i , t j , t k ∈ [0 , ,but still t l can take infinitely many values. Then t , . . . , t n are determined uniquely by z , . . . , z n and ˜ D .(c) If d ( z i , z j ) ≥ β/ for all i, j = 1 , . . . , n, i (cid:54) = j , then t , . . . , t n are determined uniquely by z , . . . , z n and ˜ D . Before the proof, we will state some remarks.If n = 2 , the identifiability problem reduces to an elementary geometric question: given a circle sector as the right twosubfigures of Figure 1 of the main paper and the length of the blue line segment(s) connecting ( x, s ) and ( y, t ) , can s and t be determined uniquely? The answer is evidently no. But it is notable that there are two types of counterexamples. Thefirst type is as Figure 3(A), one point moves “up” and the other moves “down”. The other type as Figure 3(B) is maybecounter-intuitive: both moves “up” or “down”. Note that the second case does not happen if the angle θ is larger than orequal to π/ .If n = 3 , the picture becomes a tetrahedron as in Figure 4. Here the angles and edge lengths are defined by θ := π min( d Z ( z , z ) /β, , a := ˜ d β ( x , x ) θ := π min( d Z ( z , z ) /β, , a := ˜ d β ( x , x ) (25) θ := π min( d Z ( z , z ) /β, , a := ˜ d β ( x , x ) and θ + θ + θ is assumed to be at most π . Then the geometrical question becomes “when angles α, β, γ and edgelengths a , a , a of triangle (cid:52) x x x is given, can the position of the points x , x and x be determined uniquely?” If itis not unique and there are two different positions of x , x and x , at least one edge should move as in Figure 3(B) sinceit is impossible to move all the three edges as in Figure 3(A). But if all of the angles are larger than or equal to π/ , thiscannot happen. This gives actually a geometrical proof of Theorem 3 (c).If θ + θ + θ is larger than π , the geometric arguments become complicated. We do not need this kind of case analysiswhen we use algebraic arguments as in the following proof. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding O O x x x ,,, x x , x ,, x x , x ,, x ,, x , x (a) (b) Figure 3.
Two types of movement for a line segment of constant length O x x x z z z t t t θ θ θ a a a Figure 4.
Metric cone generated by three points z , z , z ∈ Z nhancing Hierarchical Information by Using Metric Cones for Graph Embedding Now we will prove the theorem. In the proof, we use the Gr¨obner basis as a tool of computational algebra. See for example(Cox et al., 2013) about definition and application of the Gr¨obner basis.
Proof. (a) Since the maximum number of possible values of ( t , . . . , t n ) does not increase with n , it is enough to provefor n = 3 . We set θ , θ , θ ∈ [0 , π ] and a , a , a ≥ as (25). Then by the law of cosine, t + t − t t cos θ = a ,t + t − t t cos θ = a , (26) t + t − t t cos θ = a . We consider this as a system of polynomial equations with variables t , t , t and compute the Gr¨obner basis of theideal generated by the corresponding polynomials by degree lexicographic monomial order (deglex) with t > t >t by Mathematica. Then the output becomes as in Note 6 and the basis includes − t + (2 cos θ ) t t − t + a , − t + (2 cos θ ) t t − t + a and v ( θ , θ , θ ) t + ( terms of degree ≤ where v ( θ , θ , θ ) := 1 + 2 cos θ cos θ cos θ − cos θ − cos θ − cos θ . (27)Note that when θ + θ + θ ≤ π , a a a v ( θ , θ , θ ) is a formula of the volume of the tetrahedron whose base triangle is (cid:52) x x x and, therefore, it has a positive value unless the tetrahedron degenerates. By the assumption, z , z , z are notaligned in a geodesic and therefore the tetrahedron does not degenerate and v ( θ , θ , θ ) must be nonzero. Note that thisbecomes negative when θ + θ + θ > π .On the other hand, it is known that the system of polynomial equations with a Gr¨obner basis G has a finite number of(complex) solutions if and only if, for each variable x , G contains a polynomial with a leading monomial that is a power of x . Now all variables t , t and t satisfy such property, thus we conclude there are at most a finite number of solutions.Then by B´ezout’s theorem, the number of solutions is at most the product of the degree of three polynomial equations,i.e. × × . But if ( t , t , t ) is a solution, ( − t , − t , − t ) is also a solution, and only one of each pair can satisfy t , t , t ≤ . Thus we conclude the number of possible values of ( t , t , t ) is at most four.(b) By the assumptions in (a), without loss of generality, we can assume z , z , z are not aligned in a geodesic. By theresult of (a), given z , z , z and distances ˜ d β ( x , x ) , ˜ d β ( x , x ) , ˜ d β ( x , x ) , there are at most four variations of the valuesof ( t , t , t ) . Here we assume t can take multiple values including ˆ t and ˇ t .Suppose, in addition to the above, the values of z and ˜ d β ( x , x )(=: a ) are given and let θ := π min( d Z ( z , z ) /β, .Then both ˆ t and ˇ t satisfy t + t − t t cos θ = a and therefore t cos θ = ˆ t + ˇ t must hold. Since ˆ t and ˇ t aredifferent non-negative values, ˆ t + ˇ t > and, therefore, cos θ (cid:54) = 0 . Hence we obtain t = (ˆ t + ˇ t ) / θ .This means if t takes values except (ˆ t + ˇ t ) / θ , at most only one of ˆ t and ˇ t can be a solution. We can reduce eachpairwise ambiguity of the (at most) 4 possibilities of ( t , t , t ) one by one similarly. Finally ( t , t , t ) are determineduniquely for all except at most (cid:0) (cid:1) = 6 values of t . But such finite values of t can be neglected thanks to the assumptionof “general” position in the theorem. Since the same argument holds for any triplets, the statement has been proved.(c) If ( t , . . . , t n ) can take multiple values, without loss of generality we can assume ( t , t , t ) takes multiple values. By theassumption in the theorem, θ , θ , θ ≥ π/ and therefore all coefficients in each equation of (26) become positive. Thus, if t i increases/decreases then t j must decrease/increase for ( i, j ) = (1 , , (2 , , (3 , but this cannot happen simultaneously.Hence ( t , t , t ) cannot take multiple values.Note that all of this proof works even when θ + θ + θ is larger than π . (cid:3) Remark 4
Assumption in Theorem 3 (a) is necessary. If the assumption fails, the tetrahedron degenerates and x , x , x and the apex O are all in a plane. When O happens to be on a circle passing through x , x and x , move O to anotherpoint O (cid:48) on the same circle. Then the angles corresponding to θ , θ , θ do not change by the inscribed angle theorem. Byan elemental geometrical argument, a new position of x , x , x and O (cid:48) gives another solution of t , t , t . Hence obviouslythere are infinite number of solutions. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding Remark 5
The assumption of “general” positions of z , . . . , z n in Theorem 3 (b) is satisfied easily for most data distribu-tions. For example, if both z , . . . , z n ∈ R d and t , . . . , t n ∈ [0 , are i.i.d. from a probability distribution whose densityfunction exists with respect to the Lebesgue measure, then it is easy to see the assumption holds almost surely and thereforeuniqueness of the solution is guaranteed. Note that for n = 3 under the same setting, there can be multiple solutions with apositive probability. Note 6
Computation of the Gr¨obner basis by Mathematica:For simplicity, we put x := t , y := t , z := t , a := 2 cos θ , b := 2 cos θ , c := 2 cos θ , d := a , e := a and f := a .Note that the second, first and last polynomials in the output correspond to − t + (2 cos θ ) t t − t + a , − t +(2 cos θ ) t t − t + a and v ( θ , θ , θ ) t + ( terms of degree ≤ in the proof, respectively. --------------------------------------------------------------In := GroebnerBasis[{xˆ2 + yˆ2 - a*x*y - d, xˆ2 + zˆ2 - b*x*z - e,yˆ2 + zˆ2 - c*y*z - f}, {x, y, z},MonomialOrder -> DegreeLexicographic]Out = {f - yˆ2 + c y z - zˆ2, e - xˆ2 + b x z - zˆ2,d - xˆ2 + a x y - yˆ2,d x - e x + a e y - x yˆ2 - b d z + b yˆ2 z + x zˆ2 -a y zˆ2, -c d x + c e x - a c e y + b f y + c x yˆ2 - b yˆ3 +b c d z - a f z + a yˆ2 z - c x zˆ2 - b y zˆ2 + a zˆ3,a f x + d y - f y - xˆ2 y - c d z + c xˆ2 z - a x zˆ2 + y zˆ2,b f x - c e y + c xˆ2 y - b x yˆ2 + e z - f z - xˆ2 z +yˆ2 z, -c e x + a b f x + c xˆ3 + b d y - b f y - b xˆ2 y -b c d z + a e z - a xˆ2 z + c x zˆ2 + b y zˆ2 - a zˆ3,a c d x - a c e x + b f x + c d y - c e y + aˆ2 c e y - a b f y -b x yˆ2 + a b yˆ3 - c yˆ3 - a b c d z + e z - f z + aˆ2 f z -xˆ2 z + yˆ2 z - aˆ2 yˆ2 z + a c x zˆ2 + a b y zˆ2 -aˆ2 zˆ3, -a e f - d x y + e x y + f x y + c d x z - c e x z +b d y z - b f y z - b c d zˆ2 + a e zˆ2 + a f zˆ2 - 2 x y zˆ2 +c x zˆ3 + b y zˆ3 - a zˆ4, -c d e + c eˆ2 - a b e f + c d xˆ2 -c e xˆ2 - b d x y + b e x y + b f x y - a f x z - d y z +bˆ2 d y z + 2 e y z + f y z - bˆ2 f y z - xˆ2 y z + 2 c d zˆ2 -bˆ2 c d zˆ2 + a b e zˆ2 - 2 c e zˆ2 + a b f zˆ2 + a x zˆ3 -3 y zˆ3 + bˆ2 y zˆ3 - a b zˆ4 + c zˆ4, -d x + aˆ2 d x + a b c d x +2 e x - aˆ2 e x - a b c e x - aˆ2 f x + bˆ2 f x - xˆ3 + b c d y -a e y + aˆ3 e y - b c e y + aˆ2 b c e y + a f y - a bˆ2 f y +x yˆ2 - bˆ2 x yˆ2 - a yˆ3 + a bˆ2 yˆ3 - b c yˆ3 + b d z -aˆ2 b d z + a c d z - a bˆ2 c d z + b e z - a c e z - b f z +aˆ2 b f z - 2 x zˆ2 + 2 aˆ2 x zˆ2 - aˆ3 y zˆ2 + a bˆ2 y zˆ2 -aˆ2 b zˆ3 + a c zˆ3, -c dˆ2 + c d e + a b d f + c d xˆ2 - c e xˆ2 +b f x y - a b d yˆ2 + 2 c d yˆ2 - 2 c e yˆ2 + aˆ2 c e yˆ2 -a b f yˆ2 - b x yˆ3 + a b yˆ4 - c yˆ4 - a d x z + a e x z -a f x z - 2 d y z + e y z - aˆ2 e y z - f y z + aˆ2 f y z +xˆ2 y z + 3 yˆ3 z - aˆ2 yˆ3 z,dˆ2 - 2 d e + cˆ2 d e + eˆ2 - cˆ2 eˆ2 - 2 d f + bˆ2 d f + 2 e f -aˆ2 e f + a b c e f + fˆ2 - bˆ2 fˆ2 - cˆ2 d xˆ2 + cˆ2 e xˆ2 +b c d x y - b c e x y - b c f x y - bˆ2 d yˆ2 + bˆ2 f yˆ2 +a c d x z - a c e x z + a c f x z + a b d y z + a b e y z -a b f y z + 4 d zˆ2 - 2 bˆ2 d zˆ2 - a b c d zˆ2 - 2 cˆ2 d zˆ2 +bˆ2 cˆ2 d zˆ2 - 4 e zˆ2 + aˆ2 e zˆ2 - a b c e zˆ2 + 2 cˆ2 e zˆ2 -4 f zˆ2 + aˆ2 f zˆ2 + 2 bˆ2 f zˆ2 - a b c f zˆ2 + 4 zˆ4 - aˆ2 zˆ4 -bˆ2 zˆ4 + a b c zˆ4 - cˆ2 zˆ4} D. Additional results on the metric cone embedding
In this section, we will examine how the learning of embedding in a metric cone proceeds. We will use the GrQc dataset,which is the smallest dataset used in the paper. The experiments were conducted by learning the Euclidean embedding andthe metric-cone embedding for 2-9 dimensions. The hyperparameters were set as in the paper: nhancing Hierarchical Information by Using Metric Cones for Graph Embedding f r e q u e n c y Figure 5.
Changes in the distribution of the heights of data points: The smaller the x axes, the higher the point embedded in the metriccone. For visualization, the display of y-axis is limited between 0 and 300. • Learning rate: 10.0• Epoch: 2000• Negative sampling rate: 50• Batch size: 256.The ratio of the scaling of the original space M , which is an Euclidean space for now, and the value of parameter β affectsthe accuracy of the metric-cone learning significantly. Since the distance between two points in a metric cone is bounded bya constant times β , we set β as the maximum norm of the embedding in the original Euclidean space. All initial values ofthe height are set as 1.Model evaluation metric 2 3 4 5 6 7 8 9Euclidean MR 88.99 37.17 17.15 9.42 5.78 4.27 3.42 3.18MAP 0.375 0.488 0.600 0.719 0.842 0.929 0.983 0.998Our Model MR 72.35 26.39 14.50 8.65 5.50 4.16 3.40 3.18(Metric Cone) MAP 0.450 0.551 0.614 0.726 0.851 0.935 0.986 0.998 Table 4.
Results of GrQc embedding into low-dimensional space
Table 4 shows the experimental results. In the experiments using GrQc in the paper, learning in the original Euclideanembedding was close to overfitting, so there was almost no difference in the accuracy between the Euclidean embedding andthe metric-cone embedding. On the other hand, when the training accuracy of the Euclidean embedding does not hit theceiling, the embedding in a metric cone can improve the test accuracy. In other words, the embedding into a metric cone canrepresent the hierarchical structure of the data more efficiently.Figure 5 shows how the distribution of the heights of the data points in a metric cone changes as the learning progresses. Forvisualization of the embedding after 20 epoch (equal to the Euclidean embedding), 100 epoch, and 500 epoch training.Figure 6 is a visualization of the distribution of the heights of data. The embedding vectors in Euclidean space were reducedto two dimensions by PCA. From the figure, it can be seen that as the learning progresses, the points closer to the center ofthe dense area gradually move toward the top of the cone. nhancing Hierarchical Information by Using Metric Cones for Graph Embedding
10 5 0 5 10 15105051015
Scatter Plot (a) The Euclidean embedding and edges
10 5 0 5 10 15105051015
Scatter Plot (b) The Euclidean embedding and the heights computedby the metric-cone embedding