Link Analysis for Communities Detection on Facebook
Mohamed Adnane Mellah, Abdelmalek Amine, Reda Mohamed Hamou, A.V. Senthil Kumar
LLink Analysis for Communities Detection on Facebook Mohamed Adnane Mellah , Abdelmalek Amine , Reda Mohamed Hamou , A.V. Senthil Kumar GeCoDe Laboratory, Department of Computer Science, Tahar Moulay University of Saida, Saida, Algeria. Director, Department of MCA, Hindustan College of Arts and Science, Bharathiar University, Coimbatore–28, Tamil Nadu, India Email id : *[email protected]; [email protected]; [email protected]; [email protected] Abstract
Social networks have become a part in the daily life of millions of users, which offer wide range of interests and practices. The main characteristic of social networks is its ability to gather different individuals around a common point of view or collective beliefs. Among the current social networking sites, Facebook is the most popular, which has the highest number of users. However, in Facebook, the existence of communities (groups)is a critical question; thus, many researchers focus on potential communities by using techniques like data mining and web mining. In this work, we present four approaches based on link analysis techniques to detect prospective groups and their members.
Keywords:
Social Network Analysis, Link analysis, Facebook, Data mining, Groups. INTRODUCTION
Facebook and Twitter are social actors similar to individuals or organizations who are linked together by social interactions. It describes a dynamic social structure by a set of nodes and links. The analysis of socialnetworks, mainly based on graph theories and sociological analysis, aims to study different aspects of these networks. The main aspects are community detection, identification of influential actors, and the study and prediction of the evolution of networks. The discovery of communities [1, 2, 3, 4, 5,6] is an important problem in social network analysis, where the goal is to identify the groups (communities) as well as their members and the ones that belong to several communities. Researchers focus on different methods to detect communities from social networks; the majority of methods supposes that communities are separated where each member of these communities constitutes a node that is categorized under one label. In the real world, a member can be interested by various topics—for instance students can belong to more than one community. Assign multiple labels to the same node is the best representation of the properties of a social network. The most common definition of a community is as follows: ‘A community is a part of a graph where the nodes are trongly related together compared to the other nodes of the same graph’; numerous approaches to detect communities in a social network were proposed in the past. In this work, we present four approaches which are based on link analysis algorithms for the detection of communitieson Facebook, which are PAGERANK, HITS, PHITS and SALSA. Practically, the social communities are manually managed by their administrators, and any user can join these communities creating links to these pages. If the ranking attributed by these algorithms to a node is higher, it is probable that the node constitutes a page which determines a community. In the second step, we verify if this node does not contain outgoing links; in this case, it constitutes a page of group and the ingoing links towards this page are the members of this community which share common characteristics and interests. It is to be noted that Facebook does not permit the administrators of communitiesto send invitations and create outgoing links from the pages of the communities, and for this reason, we verified the presence of outgoing links of these nodes. The rest of this paper is organized as follows: Section 2 presents related works. Section 3 deals with algorithms and theory of link analysis. Section 4 shows the methodology of research. Section 5 shows an experimental analysis. Finally, Section 6 concludes this paper. Related Works
In the domain of recognition of web communities, many studies are available. In Gibson et al [7], the hyperlink is used as a basis for reasoning. The major contribution in this area is the HITS algorithm of Kleinberg [7] which defines the notions of authority and hubs, structuring a community around a given topic. Imafuji and Kitsuregawa [8] suggest that a page belonging to a community (if this page is referenced primarily from interior of the community more than for its exterior) uses the maximum number of algorithms to isolate the nodes belonging to the same community. Based on the proposed algorithm by Flake et al. [9], Dourisboure et al. [10] identified a graph of Web communities as heavy subgraphs with a bipartite in this graph. The bipartite graph represents,on the one hand, the interests of the community (authorities according HITS) and, on the other hand, those who cite the community (hubs). This method allows highlighting the potential of sharing similar interests by several communities of actors or vice versa. These approaches provide an advanced analysis of the links between different pages structuring a thematic community, but do not bring users together by their interests or activities: sharing hyperlink is no longer necessarily the basis of community activity in social exchanges collaborative Web (content evaluation by the user, affixing tags, etc.). Link Analysis Algorithms rom its origins in the bibliometric analysis, the analysis of reasons for referrals (link analysis) has come to play an important role in the salvaging of modern information. The algorithms of link analysis [11, 12, 13, 15, 16, 17] have been successfully applied to Web hyperlinking data to identify sources of official information and citation data for the most important items. Currently, with the conventional classification techniques, link analysis is based only on some research engines on the Internet. An important feature of the World Wide Web is its dynamic nature; the references can be modified so that it becomes inaccessible or simply not found by the search engine. If link analysis provides a notion of robustness in such a context, it is natural to ask whether robustness means being stable to perturbations of the link structure. Indeed, a completely unstable search engine that changes its results every day can cause lots of confusion; it is for this reason that several algorithms and strategies have emerged.
Let us consider a set of Web documents interviewing each other; this collection can be seen as a directed graph. Links analysis algorithms construct the adjacency matrix that represents the reflection of the graph based on the model of citation that is used. For example, the bond between two documents i and j is represented by the value 1 of the element Wij. The most interesting pages can be then extracted by computing the eigenvectors of the system. Based on the meaning of Kleinberg, these pages can be divided into two categories:
Hubs : pages containing little relevant information, but many hyperlinks.
Authorities : pages with few links, but a lot of relevant information.
The main idea of this algorithm[19] is very simple; such pages are considered as most popular if they have much more incoming links than other pages. Studies have shown that the INDEGREE algorithm is not sophisticated enough to capture the authority of a node. In graph G: for each node i, Ai = | B (i) |.
The PageRank algorithm proposed by L. Page and S. Brin[13] allows the assignment of a reputation score for each page found on the Internet. This algorithm quantifies the reputation of a page by counting the number of hyperlinks that point to it: thus a page with many incoming links is considered very popular and therefore enjoys a high reputation. However, a hyperlink from page i to page j is considered as a vote of i for the page j: for each Web page i referenced by Google, a local vector reputation score ci is alculated, where ci, j = 0 if there is no link from i to j and ci, j = 1/Li if at least one link exists (Li is the number of links on the page i). The PAGERANK Ri of a page i is calculated by the sum of PageRank (weighted by the inverse of the number of links) pages which point towards B. Ri = (1) The formula for calculating the PAGERANK of a page is recursive; the PAGERANK is approximate within an iterative method. We initialize the algorithm by a non-zero constant value of PAGERANK for each page; and at each iteration we re-compute the PAGERANK of each page using the formula. Iterations are repeated to achieve a convergence of PAGERANK values.
An algorithm called HITS proposed by Kleinberg [15] will be able to identify the best hubs and authorities in a hyperlinked collection. This algorithm exploits the structure of the web graph. Each document is seen as a directed graph node, and any link between two documents is interpreted as an edge between the two nodes. Based on a specific query, called σ, the algorithm first creates a subgraph. It then calculates the weight of the hubs and authorities for each node Sσ. The principle used by the HITS algorithm is as follows: a document has a high weight authority if is pointed to many documents with high hub weight and vice versa, and a document has a high hub weight if it points to many documents with high authority weight. More specifically, starting from a hyperlinked set of documents, the HITS algorithm builds the directed graph associated with the collection. Ideally, the collection S must satisfy the following properties: (i) S is relatively small. (ii) S contains many relevant pages. iii) S contains most of the best authorities. The graph is represented by an adjacency matrix W n × n, where n represents the number of used materials. The element Wij takes the value 1 if there is an edge between nodes i and j in the directed graph, and 0 otherwise. Generally, the third condition is not satisfied and the S collection should be extended by exploring a number of links of the graph (Kleinberg, 1998). The algorithm can then calculate the relationships of mutual reinforcement between hubs and authorities iterating rules by following the update: Where « i → j » means that the documenti points to the documentj. ‘The Stochastic Approach for Link Structure Analysis’ (SALSA)is an algorithm based on the theory of Markov chainsproposed by Lempel and Morgan [16]. The algorithm uses the properties of a random walk performed on a collection of hyperlinked documents. Similar to the Kleinberg algorithm, SALSA starts by building a basic collection ‘base set’ issue of the link graph. SALSA is based on the intuition that such an ‘authoritative’ page must be visible for thousands of pages of data set. Through a random walk of the graph, the algorithm indexes some authorities with relatively high probability. The theory of random walks combined the notion of hubs and authorities, which leads us to analyse two different Markov chains: the chain of hubs visited and the chain of authorities visited, which gives us, for each page, two distinct weights—the hubs and that of the authorities. To generate transition states of each of these chains, two edges of the graph must be traversed: the first forward (following an outgoing link) and the second backward (by following a link returning) or vice versa. Authorities weights are defined as the distribution of the stationary chain exploring in first random rearward link and then a forward link, while the weight of the hubs are defined as the distribution of the stationary chain exploring a first random forward link nd then a backward link. More precisely, starting from a collection of hyperlinked documents, we can build the directed graph G. Let us consider Back(i) = { k: k → i}, the set of nodes that point to i , for example nodes that can be reached from i by following a link back, and let Forw (i) = { k: i → k }, the set of all nodes that can be reached starting from i by following a link to the before . | Back (i) | is the number of nodes that point to i, as | Forw (i) | is the number of nodes to which i points. We can now define two stochastic matrices, which contain the transition probabilities for Markov chains, respectively, for hubs and authorities. The matrix for the hubs, H: The matrix for the authorities, H: Element ai, j > 0 means that at least one node k points to the two nodes i and j. The node j is reachable from the node i in two steps: the first going up the link i → k and the second by following the link k → j.
Other approaches for determining hubs and authorities were also tested. Cohn and Chang proposed a statistical algorithm to determine these two categories [17]. The model that the authors have constructed attempts to explain two types of variables, the quotes c of a document d, based on a small number of common variables z which are called aspects or factors. These common variables can be considered as subjects or community pages. The model can then be described statistically: a document d ∊ D is enerated with a probability P (d), factor, or subject z ∊ Z corresponding to d is selected in accordance with a probability P (z| d) , and since this factor, quotes c ∊ C are generated based on the probability P ( c | z ). The probability of each pair (document, quote) (d, c) is then described by the following: Considering the matrix A representing pairs (document, quote), where the entry A [i, j] is non-zero if the document i has a link to the document j, the probability matrix A citation is as follows: The problem then is to find the values of P (d), P (z | d) and P (c | z) that maximize the likelihood function L (A) of the observed data. To solve this new problem, the authors propose to use the EM algorithm of Dempster[18]. This fully probabilistic model has the advantage of providing more information than the model used by the HITS algorithm. An analogy can be made, however, considering the authorities on a given subject as the conditional probability P (c | z) which indicates how a document c is quoted from a community z. But other information can be extracted from the model such as the probability P (z | c), which allows us to know the community to which a given document c, or the discovery of documents that features a community in determining the product P (z | c) • P (c | z). Nevertheless, this algorithm imposes to know in advance the number of factors z to take into account. In addition, it is possible that the EM algorithm can get stuck in a local maximum and compromising convergence to the global maximum corresponding to the solution of the problem.
Another set of algorithms that try to eliminate some of the drawbacks of HITS is also proposed by [11]. Thus, the ‘Hub-Averaging-Kleinberg’ algorithm is a combination of HITS and SALSA in which it tries to reduce the TKC effect (
Tightly-Knit Community Effect ). The calculation of scores of authority is the same as HITS, but the hub score is the average of scores of authority. The principle of the algorithm is that a page is a good hub (authority) if it links to (referenced by) good authorities (hubs) and scores hub (authority) are calculated by considering only the scores of authority (hub) that are greater than or equal to the average score.
In this section, we present theprocess of data collection and the rank of the nodes using link analysis algorithms to detect communities and their members. Figure 1 represents the architecture of our work. It consists of four components: (i) the extraction of profiles with their links, (ii) the ranking of profiles according to their importance, (iii) the verification of profiles with higher rank if they have outgoing links, a page of community must not have outgoing links towards other nodes because the administrator can not send friendship requests from the community page, and (iv) the detection of communities and their members.
Fig1
Approaches for detecting groups
The set of data which have been used are obtained using Facebook. It later launched its API in May 2007 to attract those who are interested in the development of web applications. This API is available in numerous programming languages and it provides developers access to a vast quantity of information on the profiles of users. During this work, we have reached 1200 profiles on Facebook randomly with their links and friends to define the links. All data are presented in the form of objects and connections Objects : persons, events, pictures, pages, groups, messages, .etc. Connections : friendship, shared content, like, etc. acebook will allow us to access these objects and then use the connections and get access to other objects. For this, the query will be constructed using an URL and answers will be returned in XML format. In simple terms, Facebook has only one entry point: The course of the SocialGraphis then just as simple. This will be donebyan object identifier and a connection definition: The identifier is used to uniquely define an object in Facebook. Facebook member scan choose a string to create an alias identifiable by humans. For example, ausersimply chooseshis name "Jean-M. cornier" We can see that this query returns:
Connection: we can then leave this identifier and the object it refers to access other objects through connections. Here are some examples: To access the list of friends of a person:
To access the list of videos posted by a person http://graph.facebook.com http://graph.facebook.com/identifiant/connexion http://graph.facebook.com/jean-M.cornier http://graph.facebook.com/ jean-M.cornier /videos
To access the list of photos posted by a person
The links of each node are extracted by the Facebook API according to the characteristics of the social network structures; we have composed the content as a matrix of links. The matrix is shown below. The matrix will be used for ranking. The links will be treated as parameters of link analysis algorithms PAGERANK, HITS, PHITS and SALSA. The experiment on performance approach measure will be discussed in the next section.
In order to evaluate the prediction performance, we designed a series of experiments with the four algorithms. Figure 2 shows the experimental conception of this work. The results of this evaluation are shown in Table1. http://graph.facebook.com/jean-M.cornier/pictures
Fig. 2 Groups detection
A social network is seen as a dynamic structure presented with nodes and links. The nodes are generally designed by individuals or organizations and they are connected by social interactions. The visualization method that we have used is JAVA3D which proposes a general view with the possibility to develop a nod so as to reach the details of its profile. Figure 3 shows the data visualization.
Fig. 3 Data visualization
Concerning the algorithms HITS, PHITS and SALSA, we take just the values of authorities as values of important nodes and we consider the role of the hubs that indicates the authorities. The results in Table 1 are for the nodes which have higher rank.
Table 1:
Performance evaluation summarization table of groups detection with PageRank
PageRank HITS SALSA PHITS
RI Groups and their hubs Autho Groups and their Hubs Autho Groups and their Hubs Autho Groups and their members rities members rities members rities members
Fig4
Groups and their members detected by PageRank
Fig5
Groups and their members detectedby Hits
Fig6
Groups and their members detected by SALSA
Fig7
Groups and their members detected by PHITS Concerning PAGERANK algorithm, the existence of communities which are heavy with hubs and authorities is meaningless for the algorithm because it is not based on a mutual reinforcement to calculate the weight of authority. We remark that PAGERANK attributes an important weight to the isolated node with the maximum degree. Generally, we observe that PAGERANK favours the isolated nodes with high grade, and the hubs which point towards the isolated node transfer all their weights directly to this node augmenting its weight. PAGERANK detects the overlapping between the communities; the ability of the lgorithm in detecting overlapping between communities is due to the recursive leaps which are performed by PAGERANK. The recursive leaps are probably responsible which enable the algorithm to be the best among the other algorithms that we used in this work. We also observe that the results of PHITS are low and have shown the weakness of this algorithm. The results also show that HITS and SALSA have nearly the same performance with a little advancement of HITS.
We conducted an experimental analysis ranking with link analysis by evaluating the ability of each algorithm to classify the user profiles on Facebook, and its ability to affect the existence of communities in a graph. We observed that PAGERANK is the most efficient. We plan to apply these algorithms to detect communities in other social networks such as Twitter, Plurk, and Blogger, etc., and subsequently to develop approaches that detect the behaviour of members of different groups. [1] Tang L., and Liu, H. 2010.
Community detection and mining in social media . Morgan & Clay pool Publishers. [2] Newman M. 2006.Modularity and community structure in networks.
PNAS . The Proceedings of CIKM2010 . [4] Craswell N., and Szummer M. 2007. Random walks on the click graph.
The proceedings of the 30th annual international ACM SIGIR conference , pp. 239–246. [5] Palla G., Derényi I., Farkas I., and Vicsek T.2005.Uncovering the overlapping community structure of complex networks in nature and society.
Nature.
Physica A: Statistical Mechanics and Its Applications.
HYPERTEXT’98: Proceedings of the ninth ACM conference on hypertext andhypermedia: Links, objects, time and space—structure in hypermedia systems , ACM, New York, pp. 225–234. [8] Imafuji N., and Kitsuregawa M. 2002. Effects of maximum flow algorithm on identifying web community.
WIDM’02: Proceedings of the 4th international workshop on web information and data management , ACM, New York, pp. 43–48. [9] Flake G.W., Lawrence S., and Giles C.L. 2000.Efficient identification of web communities.
KDD’00: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining , ACM, New York, pp. 150–160. 10] Dourisboure Y., Geraci F., and Pellegrini M. 2007. Extraction and classification of dense communities in the web.
WWW’07: Proceedings of the 16th international conference on world wide web , ACM, New York, pp. 461–470. [11] Borodin A., Roberts G.O., Rosenthal J.S., and Tsaparas P. 2001. Finding authorities and hubsfrom link structures on the world wide web.
Proceedings of the international world wide web conference (WWW) , pp. 415–429. [12] Asano Y., Tezuka Y., and Nishizeki T. 2008. Improvements of HITS algorithmsfor spam links.
IEICE - Transactions on Information and Systems . Volume E91-D Issue 2, pp. 200-208. [13] Page L., Brin S., Motwani R., and Winograd T. 1998.
The pagerank citation ranking: Bringing order to the web . Technical report, Stanford Digital Library Technologies Project. [14] Chikhi N.F., Rothenburger B., and Aussenac-Gilles N.2008.A new algorithm for community identification inlinked data . Knowledge-Based Intelligent Information and Engineering Systems,
LNCS, Springer. 5177, pp. 641-649. [15] Kleinberg J.M. 1999.Authoritative sources in a hyperlinked environment.
Journal of the ACM . 46(5): 604–632. [16] Lempel R., and Moran S.2000. The stochastic approach for linkstructure analysis (SALSA) and the TKC effect.
Computer Networks .33: 387–401. [17] Cohn D., and Chang H. 2000. Learning to probabilistically identify authoritative documents.
Proceedings of the 17th international conference on machine learning , Morgan Kaufmann, San Francisco, pp. 167–174. [18] Dempster A.P., Laird N.M., and Rubin D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm.
JRSSB .39:1–38. [19] Upstill T., Craswell N., and Hawking D. 2003. Predicting fame and fortune: Pagerank or in degree?