PageRank algorithm for Directed Hypergraph
PPageRank algorithm for Directed Hypergraph
Loc Tran , Tho Quan , and An Mai John von Neumann Institute, VNU-HCM, Ho Chi Minh City, Vietnam (E-mail: [email protected]) Ho Chi Minh City University of Technology, VNU-HCM, Ho Chi Minh City, Vietnam (E-mail: [email protected]) John von Neumann Institute, VNU-HCM, Ho Chi Minh City, Vietnam (E-mail: [email protected])
Preprint submitted to RGN Publications on
During the last two decades, we easily see that the World Wide Web’s link structure is modeled as the directed graph. In this paper, we will model the World Wide Web’s link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself. The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network.
Research article 1 Introduction
With the quick growth of the World Wide Web during the last two decades, information retrieval introduces growing theoretical and practical challenges. With the huge amount of information inflowing the World Wide Web every second, it becomes more difficult and more difficult to retrieve information from the Web. This explains why the existence of a search engine is as important as the existence of the web itself. Since the appearance of the web, there has been a undamental talk in the web research community to develop the rapid, effective, and precise search engines. This paper will be chiefly discussing about the most common search engine nowadays which is Google. The mathematical theory behind the Google search engine is the PageRank algorithm, which was presented by Sergey Brin and Lawrence Page [1]. In 1998, Brin and Page were PhD students at Stanford University, USA. Then they took a leave of absence from their Ph.D. to concentrate on developing their Google model. Their original paper describing the PageRank algorithm is used nowadays by Google to create the rankings of the web pages of the World Wide Web. A search engine contains three important modules: a crawler, and an indexer, and a query engine [2]. The crawler gathers and stores data from the web. Data is stored in the indexer which mines information from the data gathered from the crawler. The query engine responds to the queries from customers. The PageRank algorithm (i.e. one of the ranking algorithms), part of the query engine, ranks the web pages in the order of their “importance” to the query. The ranking is attained by the contribution of a score to each web page of the World Wide Web. PageRank is a ranking algorithm of web pages of the World Wide Web. The PageRank algorithm exploits the link structure of the web. The World Wide Web's link structure forms a directed graph where the web pages are the nodes of the directed graph and the links are the directed edges of the directed graph. The web page (i.e. the node of the directed graph) is considered "important" if it is pointed to by other “important” web pages. During the last two decades, we easily recognized that the World Wide Web’s link structure was modeled as the directed graph. Moreover, the PageRank algorithm was developed for this directed graph only. In general, this model, the directed graph, is not the best and the generalized model for the World Wide Web’s link structure. In this paper, we will model the World Wide Web’s link structure as the directed hypergraph [3, 4]. This work, to the best of our knowledge, has not been investigated up to now. However, due to the lack of the World Wide Web directed hypergraph datasets, we will exploit the metabolic network dataset that is available from [5]. This metabolic network can easily be represented as the directed graph or the directed hypergraph. We will show clearly how to construct the directed graph and the directed hypergraph from this metabolic network. Then our next task is to develop the PageRank algorithm for the directed hypergraph. This is the novel work. Moreover, we will define the un-normalized and the symmetric normalized directed hypergraph Laplacian in this paper. In the future, if the World Wide Web’s directed hypergraph datasets are available, then the directed hypergraph Laplacian based semi-supervised learning will be developed in order to solve the spam detection problem. We can easily see that the applications of the directed hypergraph are huge. We will organize the paper as follows: Section 2 will introduce the preliminary notations and definitions used in this paper. Section 3 will introduce the PageRank algorithm for the directed hypergraph. Section 4 will introduce the definitions of the un-normalized and symmetric normalized directed hypergraph Laplacian and their applications. Section 5 will show the experimental results. Section 6 will conclude this paper and discuss the future direction of researches. iven the directed hypergraph
𝐻 = (𝑉, 𝐸) where V is the set of vertices and E is the set of hyper-arcs. Each hyper-arc 𝑒 ∈ 𝐸 is written as 𝑒 = (𝑒 𝑇𝑎𝑖𝑙 , 𝑒
𝐻𝑒𝑎𝑑 ) . The vertices of e are denoted by 𝒆 =𝑒 𝑇𝑎𝑖𝑙 ∪ 𝑒
𝐻𝑒𝑎𝑑 . 𝑒 𝑇𝑎𝑖𝑙 is called the tail of the hyper-arc e and 𝑒 𝐻𝑒𝑎𝑑 is called the head of the hyper-arc e . Please note that 𝑒 𝑇𝑎𝑖𝑙 ≠ ∅, 𝑒
𝐻𝑒𝑎𝑑 ≠ ∅, 𝑒
𝑇𝑎𝑖𝑙 ∩ 𝑒
𝐻𝑒𝑎𝑑 = ∅ . The directed hypergraph
𝐻 = (𝑉, 𝐸) can be represented by two incidence matrices 𝐻 𝑇𝑎𝑖𝑙 and 𝐻 𝐻𝑒𝑎𝑑 . These two incidence matrices 𝐻 𝑇𝑎𝑖𝑙 and 𝐻 𝐻𝑒𝑎𝑑 can be defined as follows ℎ 𝑇𝑎𝑖𝑙 (𝑣, 𝑒
𝑇𝑎𝑖𝑙 ) = {1 𝑖𝑓 𝑣 ∈ 𝑒
𝑇𝑎𝑖𝑙 ℎ 𝐻𝑒𝑎𝑑 (𝑣, 𝑒
𝐻𝑒𝑎𝑑 ) = {1 𝑖𝑓 𝑣 ∈ 𝑒
𝐻𝑒𝑎𝑑
Let 𝑤(𝑒) be the weight of the hyper-arc e . Let W be the diagonal matrix containing the weights of hyper-arcs in its diagonal entries. From the above definitions, we can define the tail and head degrees of the vertex v and the tail and head degrees of the hyper-arc e as follows 𝑑 𝑇𝑎𝑖𝑙 (𝑣) = ∑ 𝑤(𝑒)ℎ
𝑇𝑎𝑖𝑙 (𝑣, 𝑒
𝑇𝑎𝑖𝑙 ) 𝑒∈𝐸 𝑑 𝐻𝑒𝑎𝑑 (𝑣) = ∑ 𝑤(𝑒)ℎ
𝐻𝑒𝑎𝑑 (𝑣, 𝑒
𝐻𝑒𝑎𝑑 ) 𝑒∈𝐸 𝑑 𝑇𝑎𝑖𝑙 (𝑒) = ∑ ℎ
𝑇𝑎𝑖𝑙 (𝑣, 𝑒
𝑇𝑎𝑖𝑙 ) 𝑣∈𝑉 𝑑 𝐻𝑒𝑎𝑑 (𝑒) = ∑ ℎ
𝐻𝑒𝑎𝑑 (𝑣, 𝑒
𝐻𝑒𝑎𝑑 ) 𝑣∈𝑉 Let 𝐷 𝑣𝑇𝑎𝑖𝑙 , 𝐷 𝑣𝐻𝑒𝑎𝑑 , 𝐷 𝑒𝑇𝑎𝑖𝑙 , and 𝐷 𝑒𝐻𝑒𝑎𝑑 be four diagonal matrices containing the tail and head degrees of vertices and the tail and head degrees of hyper-arcs in their diagonal entries respectively. Please note that 𝐷 𝑣𝐻𝑒𝑎𝑑 and 𝐷 𝑣𝑇𝑎𝑖𝑙 are the 𝑅 |𝑉|∗|𝑉| matrices and 𝐷 𝑒𝐻𝑒𝑎𝑑 and 𝐷 𝑒𝑇𝑎𝑖𝑙 are the 𝑅 |𝐸|∗|𝐸| matrices. From [4], we know that the transition probability of the random walk on directed hypergraph can be defined as follows 𝑝(𝑢, 𝑣) = ∑ 𝑤(𝑒) ℎ
𝑇𝑎𝑖𝑙 (𝑢, 𝑒
𝑇𝑎𝑖𝑙 )𝑑 𝑇𝑎𝑖𝑙 (𝑢) 𝑒∈𝐸 ℎ 𝐻𝑒𝑎𝑑 (𝑣, 𝑒
𝐻𝑒𝑎𝑑 )𝑑 𝐻𝑒𝑎𝑑 (𝑒)
From the above definition, the transition probability matrix P of the random walk on the directed hypergraph can be defined as follows 𝑃 = 𝐷 𝑣𝑇𝑎𝑖𝑙−1 𝐻 𝑇𝑎𝑖𝑙 𝑊𝐷 𝑒𝐻𝑒𝑎𝑑−1 𝐻 𝐻𝑒𝑎𝑑𝑇
Next, we need to prove that the row sum of P is 1. We have that 𝑝(𝑢, 𝑣) = ∑ ∑ 𝑤(𝑒) ℎ 𝑇𝑎𝑖𝑙 (𝑢, 𝑒
𝑇𝑎𝑖𝑙 )𝑑 𝑇𝑎𝑖𝑙 (𝑢) 𝑒∈𝐸 ℎ 𝐻𝑒𝑎𝑑 (𝑣, 𝑒
𝐻𝑒𝑎𝑑 )𝑑 𝐻𝑒𝑎𝑑 (𝑒) 𝑣∈𝑉𝑣∈𝑉 = ∑ 𝑤(𝑒) 𝑒∈𝐸 ℎ 𝑇𝑎𝑖𝑙 (𝑢, 𝑒
𝑇𝑎𝑖𝑙 )𝑑 𝑇𝑎𝑖𝑙 (𝑢) 1𝑑
𝐻𝑒𝑎𝑑 (𝑒) ∑ ℎ
𝐻𝑒𝑎𝑑 (𝑣, 𝑒
𝐻𝑒𝑎𝑑 ) 𝑣∈𝑉 = ∑ 𝑤(𝑒) 𝑒∈𝐸 ℎ 𝑇𝑎𝑖𝑙 (𝑢, 𝑒
𝑇𝑎𝑖𝑙 )𝑑 𝑇𝑎𝑖𝑙 (𝑢) = 1𝑑
𝑇𝑎𝑖𝑙 (𝑢) ∑ 𝑤(𝑒)ℎ
𝑇𝑎𝑖𝑙 (𝑢, 𝑒
𝑇𝑎𝑖𝑙 ) 𝑒∈𝐸 = 1 So, we easily see that 𝑝(𝑢, 𝑣) ≥ 0, ∀𝑢, 𝑣 and ∑ 𝑝(𝑢, 𝑣) = 1 𝑣∈𝑉 , ∀𝑣 . Then we can conclude that P is the stochastic matrix. Following the work from [1], the PageRank vector 𝜋 of the directed hypergraph is the left dominant eigenvector of the transition probability matrix P of the random walk on the directed hypergraph. In the other words, the PageRank vector 𝜋 of the directed hypergraph is the solution of the following equation 𝜋 𝑇 = 𝜋 𝑇 𝑃 Moreover, we know that the above equation can easily be solved by the Power method.
In this section, we give two novel definitions of the directed hypergraph Laplacian which are un-normalized directed hypergraph Laplacian and symmetric normalized directed hypergraph Laplacian. Let S be the diagonal matrix containing all elements of PageRank vector 𝜋 of the directed hypergraph in its diagonal entries. The un-normalized directed hypergraph Laplacian can be defined as follows 𝐿 = 𝑆 − 𝑆𝐷 𝑣𝑇𝑎𝑖𝑙−1 𝐻 𝑇𝑎𝑖𝑙 𝑊𝐷 𝑒𝐻𝑒𝑎𝑑−1 𝐻 𝐻𝑒𝑎𝑑𝑇 + (𝐷 𝑣𝑇𝑎𝑖𝑙−1 𝐻 𝑇𝑎𝑖𝑙 𝑊𝐷 𝑒𝐻𝑒𝑎𝑑−1 𝐻 𝐻𝑒𝑎𝑑𝑇 ) 𝑇 𝑆2 The symmetric normalized directed hypergraph Laplacian can be defined as follows 𝐿 𝑠𝑦𝑚 = 𝐼 − 𝑆 𝐷 𝑣𝑇𝑎𝑖𝑙−1 𝐻 𝑇𝑎𝑖𝑙 𝑊𝐷 𝑒𝐻𝑒𝑎𝑑−1 𝐻 𝐻𝑒𝑎𝑑𝑇 𝑆 − + 𝑆 − (𝐷 𝑣𝑇𝑎𝑖𝑙−1 𝐻 𝑇𝑎𝑖𝑙 𝑊𝐷 𝑒𝐻𝑒𝑎𝑑−1 𝐻 𝐻𝑒𝑎𝑑𝑇 ) 𝑇 𝑆 From these two definitions, we can develop the directed hypergraph Laplacian Eigenmaps algorithms, the spectral directed hypergraph clustering algorithms, and the directed hypergraph Laplacian based semi-supervised learning algorithms. These are our future works.
Datasets Due to the lack of the World Wide Web directed hypergraph datasets, we use the metabolic network dataset that is available from [5]. This metabolic network is itself the directed hypergraph. Thus, we don’t need to transform it to the directed hypergraph. This metabolic network contains 72 metabolites (i.e. nodes of the directed hypergraph) and 95 reactions (i.e. hyper-arcs of the directed hypergraph). However, in the experiment, we just use 50 metabolites that are both in the ead AND in the tail of some hyper-arcs. In the other words, we can avoid the case that the tail degree OR the head degree of the metabolite is zero. Moreover, we just use 75 reactions that the tail degree AND the head degree of the hyper-arc are not zero. In the other words, we set that 𝑒 𝑇𝑎𝑖𝑙 ≠ ∅
AND 𝑒 𝐻𝑒𝑎𝑑 ≠ ∅ . Finally, we have the metabolic network (i.e. the directed hypergraph dataset) that has 50 metabolites (i.e. nodes of the dataset) and 75 reactions (i.e. hyper-arcs of the dataset). Experiments In the experiment, we initially construct 𝐻 𝑇𝑎𝑖𝑙 and 𝐻 𝐻𝑒𝑎𝑑 matrices. Then we can construct 𝐷 𝑣𝑇𝑎𝑖𝑙 , 𝐷 𝑣𝐻𝑒𝑎𝑑 , 𝐷 𝑒𝑇𝑎𝑖𝑙 , and 𝐷 𝑒𝐻𝑒𝑎𝑑 matrices. Finally, we can compute the transition probability matrix P of the random walk on the directed hypergraph. In order to obtain the PageRank vector of the directed hypergraph, we compute the left dominant eigenvector of the matrix P . The following table 1 and table 2 show the 10 highest rank values of 10 metabolites in the directed hypergraph and the names of these 10 metabolites that have highest ranks. Rank order Rank values 1 0.6366 2 0.2640 3 0.2321 4 0.2180 5 0.2087 6 0.2039 7 0.2006 8 0.1941 9 0.1798 10 0.1701 Table 1: The rank values of 10 highest ranks Rank order Names of metabolites with highest ranks 1 H 2 Nicotinamide-adenine-dinucleotide-reduced 3 ADP 4 Phosphate 5 ATP 6 Nicotinamide-adenine-inucleotide-phosphate 7 H 8 Pyruvate 9 Nicotinamide-adenine-dinucleotide 10 Coenzyme-A Table 2: The names of 10 metabolites that have highest ranks In this paper, we develop the PageRank algorithm for the directed hypergraph. We successfully apply it to the metabolic network which is the directed hypergraph itself. This is the novel work not only in web mining field but also in bio-informatics field. In the future, we will develop the directed hypergraph Laplacian based semi-supervised learning in order to solve the spam detection problem since two directed hypergraph Laplacian was defined in this paper. Moreover, in the future, we will also try to develop the directed hypergraph p-Laplacian semi-supervised learning method. This method is worth investigated because of its hard nature and its close connection to PDE on directed hypergraph field.
Acknowledgement
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2018-20-07.
References [1] Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale hypertextual web search engine."
Computer networks and ISDN systems
A course on the web graph . Vol. 89. American Mathematical Soc., 2008. [3] Ausiello, Giorgio, Paolo G. Franciosa, and Daniele Frigioni. "Directed hypergraphs: Problems, algorithmic results, and a novel decremental approach."
Italian conference on theoretical computer science . Springer, Berlin, Heidelberg, 2001. [4] Ducournau, Aurélien, and Alain Bretto. "Random walks in directed hypergraphs and application to semi-supervised image segmentation."