Parallelization in Extracting Fresh Information from Online Social Network
PParallelization in Extracting Fresh Information fromOnline Social Network
Rui Guo, Hongzhi Wang, Mengwen Chen, Jianzhong Li, Hong Gao
Harbin Institute of Technology
Abstract
Online Social Network (OSN) is one of the most hottest services in the pastyears. It preserves the life of users and provides great potential for journalists,sociologists and business analysts. Crawling data from social network is abasic step for social network information analysis and processing. As the net-work becomes huge and information on the network updates faster than webpages, crawling is more difficult because of the limitations of band-width, po-liteness etiquette and computation power. To extract fresh information fromsocial network efficiently and effectively, this paper presents a novel crawlingmethod and discusses parallelization architecture of social network. To dis-cover the feature of social network, we gather data from real social network,analyze them and build a model to describe the discipline of users’ behavior.With the modeled behavior, we propose methods to predict users’ behavior.According to the prediction, we schedule our crawler more reasonably and ex-tract more fresh information with parallelization technologies. Experimentalresults demonstrate that our strategies could obtain information from OSNefficiently and effectively.
Keywords:
Online Social Network, Crawler, Freshness
1. Introduction
Social Network Service is one of the hottest services in the last few years.It has a tremendous group of users. For instance, Facebook has 874 millionactive users [1] and Twitter reaches 500 million users. It is estimated that atleast 2.3 billion tweets have been published on Twitter during a 7-month pe-
Preprint submitted to Knowledge-Based System Journal November 9, 2018 a r X i v : . [ c s . S I] D ec iod, more than 300 million tweets per month [2]. And The Yahoo! Firehousehas reached 750K ratings per day, 150K comments per day [3].There are a few social media datasets such as Spinn3r [4]. About 30million articles (50GB of data), including 20,000 news sources and millionsof blogs were added to Spinne3r per day [2]. As people access OSN frequently,advertisement could be broadcast according to users’ behavior. [5] studiesvarious Super Bowl ads by applying data mining techniques through Twittermessages. Using Twitter, [6] detects earthquakes and [7] studies influenzaspread.For crawling, one of the most important factors is the freshness. Denev,Mazeika, Spaniol and Weikum [8] designed a web crawling framework namedSHARC. It pays more attention to the relationship between freshness andtime period. Moreover, Olston and Pandey proposed a crawling strategyoptimized for freshness [9], it concerns more about the current time point.Even though OSN crawling is related to web crawling, crawling OSN for freshinformation is different from web crawling in following points and brings newtechnical challenges.1. New messages are published more frequently. Everyone can conve-niently register, post, comment and reproduce messages at twitter.com whilea server and special skill is required to maintain a website. As a result,the hottest topic may change in few hours on the OSN. With this feature,freshness metrics for web crawlers could not be applied on OSN crawlersperfectly.2. The messages on OSN are shorter than web pages. The former areoften made up by a few sentences (e.g. Twitter limits its messages to 140characters) while the latter often contain titles and long contents.3. The OSN is closely related to users’ daily life. Ordinary people postmessages in the day and not at night. Hence we must consider people’s workand rest time when crawling for OSN messages. As a comparison, this featureis not considered in web crawling.4. OSN network is more complex. The relationships between close usersare shown explicitly. We can trace the friend relationship and forwardedmessage easier than on the web. The web pages are usually open to everyonewhile some OSN mes-sages are only available for friends, such as Facebookmessages.With these features, new techniques for crawling fresh OSN informationare in demand. What’s more, in a real crawler system, many machines arerunning the crawler program at the same time, leading to a series of problem2uch as task schedule and workload balance. So it is necessary to design aparallelization crawler architecture.With the consideration that the goal of crawling OSN information is togather new information, this paper aims to crawl as more new messages aspossible with the limited resources. For crawling, the limitations on resources include bandwidth, computa-tion power and politeness etiquette. For instance, for twitter.com, we may bepermitted to get at most 200 tweets with a Twitter API call. The restrictionon the method call is 350 calls per hour for one authorized developer account[10]. Actually, the API restriction is the bottleneck of most OSN crawlers.To leverage the limited resources and freshness requirements of the crawler,we classify the users according to their behaviors and model their behaviorof updating posts respectively. With these models, the time of post updatingfor different users is predicted and the crawler could access the posts of usersonly when corresponding is updated. As a result, the latest information iscollected with limited resources.Combing the steps discussed above, we propose Crawl based on UserVisiting Model (CUVIM in brief) based on the observations and classifica-tions of users’ behaviors. In this paper, we focus on the messages in OSN.Considering the relationship updating between users as a special type of mes-sage, the relationship of updating information could also be crawled with thetechniques in this paper.According to different behaviors on updating the posts, we classify OSNusers into 4 kinds: the inactive account, frequently changing account, reason-able constant ac-count, and authority account. This is the first contributionof this paper.We design different updating time predication model and accordinglydevelop efficient crawling strategies. This is the second contribution of thispaper.Concretely, for the inactive accounts and frequently changing accountswhich change not very frequently, the changes can be described by the Poissonprocess and the changing rate can be predicted by statistics methods. Thuswe build the Pois-son model and take web crawling strategy to crawl OSNdata.From reasonable constant accounts and authority accounts who post mes-sages frequently, we observe that the frequency of new messages is related3o users’ daily life. According to this observation, we can crawl many freshand useful messages in the day while almost no new messages at night. Andwe build the Hash model to visit those active users to crawl the informationefficiently.As the third contribution, extensive experiments are performed to verifythe effectiveness and efficiency of the techniques in this paper. We crawled thelast 2,000 messages of 88,799 randomly selected users. The result shows that80,616,076 messages are collected. From experimental results, the Poissonprocess model collects 12.14% more messages than a Round-and-Robin (RR)style method. The Hash model collects about 50% more messages than theRR style method. The parallelization method limits the workload differenceof the Poisson process model to less than 13.27% of a random method. Wealso tested our parallelization crawler architectures. The results show thatthe speed up of the architectures are linear while the workload differencebetween the machines is almost ignorable.This paper is organized as follows. Section 2 reviews related work atcrawler field; Section 3 introduces our work and discovery from the data;Section 4 introduces our crawl altimeters; Section 5 introduces the paral-lelization technology in the crawler system; Section 6 shows the experimen-tal results; Section 7 draws the conclusions and proposes further researchproblems.
2. Related Work
Only a few methods are proposed to crawl OSN data. [5] describes aTwitter Crawler developed by Java. They pay more attention to the imple-mentation detail of the crawler and the data analysis. Instead, we focus onthe crawling method and develop algorithms to gather more information ofthe specific OSN users.TwitterEcho is an open source Twitter crawler developed by Boanjak etal. [11]. It applied a centralized distributed architecture. Cloud computingis also used for OSN crawler [12]. Noordihuis et al. collect Twitter data andrank Twitter users through the PageRank algorithm. Another attempt is tocrawl parallel. Duen et al. implemented a parallel eBay crawler in Java andvisited 11,716,588 users in 23 days [13]. The three methods aim to get morecalculating resource, while we focus on a more reasonable crawling sequencewith the given resource. 4hitelist accounts were once available on Twitter. Kwak et al. crawledthe entire Twitter site successfully, including 41.7 million user profiles and106 million tweets by Twitter API [14]. However, whitelist accounts are nolonger available now. It is the same for [12]. As the API has rate-limitingnow, we propose algorithms to improve the crawl efficiency.Another related work to OSN message collector is web crawlers. Generallythe page changing follows the Poisson process model. The Poisson processmodel assumes that the changing rate λ i in every same time unit ∆ is thesame. And the changes are modeled as following formula. P [changes in ∆ is k ] = e − λ i ∆ (∆ λ i ) k k !This equation can predict the possible changes of web pages and canbe applied on only the inactive OSN users, since they usually change con-stantly. However, when considering the active OSN users who post messagesfrequently, the change rate is not equal all the time. The discipline of OSNmessages follows the users’ daily life. A user may post many messages in theday and much less at night, so the changing rate is not same for the day andnight.Many web crawlers are proposed. Reprehensive measures for web crawl-ing are sharpness [8] and freshness [9]. The strategies define sharpness orfreshness to the crawling, and schedule crawling to achieve those targets.Differently, we choose the total number of new OSN messages as our target,and schedule according to the OSN users’ behavior.There are other matured web crawling strategies. J. Cho, H. Garcia-Molina and L. Page improve the crawling efficiency through URL ordering[15]. They define several importance metrics including ordering schemes andperformance evaluation measures to obtain more important URL first. J.Cho, H. Garcia-Molina also pro-poses a strategy for estimating the changefrequency of pages to make the web crawl-er work better [16] by identifyingscenarios and then develop frequency estimators. C. Castillo, M. Marin, A.Rodriguez and R. Baeza-Yates combine the breadth-first ordering with thelargest sites first to crawl pages fast and simply [17]. J. Cho and U. Schonfeldmake PageRank coverage guarantee high personalized to improve the crawler[18]. 5 . OSN Data Analysis And Classification At first, to design the proper crawler for OSN, we discuss the featuresof OSN data and the classification of OSN users by their behavior in thissection. Considering the four features of the OSN in Section 1, we proposenovel methods. At first, we define audience and channel as following in OSNrelationships to study the OSN, where both A and B are users.
Definition 1. (audience) Audience is a one way relationship on OSN. A is B ’s audience, that means, A can check B ’s OSN messages. Definition 2. (channel) Channel is a one way OSN relationship on OSN. A is B ’s channel, that means, A ’s messages will be checked by B . To study the behavior of OSN users of messages updating, we crawl andstudy OSN messages. For effective crawling, we choose several top OSN usersfrom the influence ranking list in Sina weibo (http://weibo.com), which is afamous OSN with millions of users. They are added to the crawling list asthe seeds and some of their channels are randomly selected. The channelsbehave more active than the audiences, thus we can avoid invalid accounts inthe list. The channels are added to the crawling list. Then they are treatedas the new seeds and their channels are accessed. We can end the iterationswhen we get enough users in the crawling list.With a seeds, k channels are chosen and n -hop channels for each seed aretraversed, the crawling list contains (cid:80) ni = 0 ak n = a ( k n +1 − / ( k −
1) users.From this formula, the smaller n is, the more representative the users in thelist are: at the beginning the seed users are famous, and later more and moreordinary people are added into the list. Since without any information, it isimpossible to predict the frequency of posting, initially, we have crawled thedata for all users in a Round-and-Robin style.We crawled 10,000 users’ data for two months, and got 1,853,085 mes-sages in total. According to the experimental results, we have followingobservations.1. Users are quite different in posting frequency. The users in the crawlinglist have at least one audience. It means that those accounts are valid andare or were once active. However, during the experiment period, about 1/2users post less than one message per day. And 1/5 users post more than10 messages per day. This observation means that the crawl frequency fordifferent users should be different. 6 igure 1: The Updating Ratio of an Inactive Account
2. The frequency of new messages may change with time. An extremeuser post 18 messages in the first day but did not post anything in thefollowing three days.3. The frequency of posting new messages is related to the users’ dailylife. Experimental result shows that an actress posts about half messages inthe late night while a manager posts most messages in the afternoon.4. Some accounts are maintained by professional clerks or robots, such asa newspaper’s official account. Those accounts have more audiences, changemore frequently and have more influence than the personal users.With above observations, we classify all users into 4 types by their be-havior. To illustrate the features of these four kinds of users, we show theexperimental results of four user’s belonging to each type in Figure 1 to 4.It shows the total number of new messages in each 15 minutes of a day. Thehorizontal axis means time from 00:00 to 24:00 in a day, and the longitudinalaxis means the total number of new messages post with 15 minutes as theunit. In Figure 1, 3 and 4, the line in the figure means the number of allmessages in the 2 months. In Figure 2, each line means a day.1. Inactive account. The users in this type post nothing in a long periodor have little channels and audiences. From Figure 1, it can be observed thatthe figure of inactive account has at most a few points. It means that theuser has hardly post any messages in a day, but may a few messages in a fewweeks. When observing for a long period, those users behave as web pages.It means that an inactive user may post three messages a month while a pagemay change three times this month. The number of possible changes betweeneach equal time unity ∆ when ∆ is large enough. Thus we can describe thebehavior of this type by Poisson Process.7 igure 2: The Updating Ratio of an Instable Changing Account
2. Instable changing account. Such type of users’ behavior is instableand cannot be predicted. Users in this class are often very active at a shortperiod and remain silence later. For example, the one who post 18 messagesand did not post anything in later 3 days. It is hard to design an effectivecrawling strategy for those people.Figure 2 shows the behavior of a instable changing account. The userposted 2 messages on Monday, 13 messages on Tuesday and nothing in thelater three days. It is the most irregular one among all figures. It may beillustrated that the user has a sudden trip and cannot connect to the OSN, orthe work those days are busy so he pays little attention to OSN. And thoseusers may become reasonable constant users when he returns or finishes thework. There is no effective strategy to crawl this kind of users, for we cannotpredict their behavior, thus we cannot schedule the crawler well. We treatthose users as the inactive accounts to save crawling resource, and put theminto the reasonable constant users once. We found that they post manymessages every day in the recent week.3. Reasonable constant user. Most valid accounts belong to this type. Forexample, the users with work far from computer or mobile phone have to postmessages after work and the users working with computer will post in workday in the office. The frequency of the new messages is obviously influencedby the users’ daily life, and the new messages often occur frequently in theafternoon and at night when the user takes a rest. Hence we can predicttheir behavior by historic data.Figure 3 shows the behavior of a reasonable constant user. Such userslove the OSN very much and post messages very regularly. This kind offigures often has two peaks, the afternoon and the night. The curve in the8 igure 3: The Updating Ratio of an Reasonable Constant AccountFigure 4: The Updating Ratio of an Authority Account left of the peak grows up and the one after the peak goes down.4. Authority account. Such accounts are maintained by several clerksor just robots. The content of those accounts is carefully updated far morefrequent than ordinary people and is reviewed by more users. For example,the New York Time updates news quickly on Twitter and has a large groupof audience.Figure 4 shows the behavior of an authority account. The kind of figures ishigher than the usual users’ and has more peaks than the reasonable constantuse. From the result, there are one peak for each hour, and the peaks arealmost the same high. The reason is that clerks or robots may be asked topost messages once an hour.
4. User Behavior Model And Crawl Strategy
According to the behavior of the users, we build two models for effectivecrawling, the Poisson process model and Hash model.9he Poisson process model is built for inactive users, who behave similarto web pages, and the effective web crawling strategy such as SHARC [8],trade-offs [19] and sampling [20] can be applied to those users.The Hash model records the principle of very active users’ behavior. Forthose users, we consider the frequency change in a day, and the change rate λ i is not same all the day long. Hence the Poisson process cannot describeit. One effective way is to record the historic data change rate to predict theparameters for crawling. According to the observation of inactive users, we build a Poisson processmodel for their posting frequency, since those users behave like the updatingof web pages, including changing constantly in a comparative long period (e.g.a month). We assume that the change rate λ i can be estimated accordingto historic data, the number of audiences and channels, just as estimatingchange rate of pages by types, directory depths, and URL names.However, the web crawling metrics require to be modified to fit OSN. TheSHARC [8] define the blur for web pages, it describes the difference betweenthe new and old pages of the same URL. Yet this could not be applied toOSN. The blur means the possibility that a page changes, but for the OSN,the question is the possible number of new messages, rather than whether anew message is post or not. Thus the blur cannot describes the messages.To describe the freshness of messages in OSN, we define the potentiality forthe users and the crawling target is to minimize the total potentiality. Wedefine the potentiality for OSN as follows. Definition 3. (Potentiality) Let u i be a OSN user crawled at time t i , Thepotentiality of the user is the expected number of new messages between t i and query time t , averaged through observation interval [0, n ∆ ]: P ( u i , t i , n, ∆) = 1 n ∆ (cid:90) n ∆0 λ i | t − t i | dt = λ i ω ( t i , n, ∆) n ∆, where ω ( t i , n, ∆) = t i − t i n ∆ + ( n ∆) U = ( u , ..., u n ) be OSN users crawled at the corresponding times T = ( t , ..., t n ). The blur of the crawling is the sum of the potentiality of theindividual users is defined as, P ( U, T, n, ∆) = 1 n ∆ n (cid:88) i =0 λ i ω ( t i , n, ∆). ( t i , n, ∆) denotes the crawling penalty. It depends on the crawling time t i and the period of the crawling interval n ∆, but not on the user. And weobtain Theorem 1. Theorem 1. (Properties of The Schedule Penalty) Double crawling delaywill lead to fourfold crawling penalty and double potentiality: ω ( i ∆ , n, ∆) = ∆ ω ( i, n, P ( U, T, n, ∆) = ∆ P ( U, T, n, u to u is λ =0, λ =1, λ =2, λ =3, λ =4 and λ =0, then we should crawl those users in the order u , u , u , u , u , u , thus we get the potentiality minimum. Algorithm 1 Crawl Schedule with Poisson Model input : sorted users ( u , ..., u n − ) output : crawl schedule ( u D , ..., u Dn − )1 for i ← to n do if i is even3 then u Di = u i/ else u Di = u n − ( i − / Algorithm 1 depicts the Process model algorithm for inactive users inOSN. All the users are known advance. We can sort and scan all the usersonly once to schedule the crawler. The time and space complexity are bothO( n ). Thus we scan the user list only once.11 .2. Hash Model According to the observation of reasonable constant users and authorityaccounts, we build the Hash model. The changing rate λ i for those users isstable from the observations for a long time (e.g. a week or longer). However,the new OSN messages are posted so frequently that they need to be crawledvery frequently. If we visit these users according to the averaged λ i of theday, we will waste much resource.For example, we visit the user in the Figure 3 according to the Poissonprocess model. In the Poisson process, the number of possible changes ineach time unit is the same, so the time span between each extraction shouldbe the same. If we start crawling at 00:00 and crawl twice a day, it maybe 00:00 and 12:00. However, crawling at 03:00 and 19:00 seems the beststrategy. Thus the Poisson process model does not fit those active usersperfectly.The number of active users’ new messages changes frequently and is com-parative randomly. Hence it is hard to find a precise and suitable model. Onthe other hand, the number of such kind of users is not large enough and astatistics model such as Gaussian model cannot be constructed according tothe behaviors. One effective way to predict the users’ behavior is to recordthe historic data.With such considerations, we define Hash model to obtain the messages ofthe active users better, including the reasonable constant users and authorityaccounts. This model is built for the users who need to be crawled frequently,at least twice or more a day. This model uses a hash table to record thenumber of new messages in a short recent period and the earlier data has lessweight. According to this historic data, we can calculate the possible numberof new messages that are post in a given time span, so we can schedule thecrawler better.For example, we maintain an array of 24 bytes ( a , ..., a ) to record thenumber of new messages posted by the same user in each hour. a i values 0at the beginning and the new value a (cid:48) i = a i ∗ . n ∗ . n is the number of messages posted by the user at theith hour of the day. The weight for k days before is 0 . k . As a result, theparameter for today is 0.5 and 0.25 for yesterday.If a user does not post any messages for a few days, the values of the hashtable decrease very fast. To avoid such phenomenon, the longest crawling12pan threshold s is required, that means, we will crawl the user at least once s hours or days.Although the hash-based method can be used for active user crawling, itis not suitable for some special cases. For some special dates, such as theweekend, people behave differently from workday. The experiment results of100 users in 2 weeks show that 1496 messages were post on Tuesday whileonly 874 messages on Sunday. Thus we could predict users’ behavior bythe last weekend data according to this feature. Similar predications canbe applied for the public holiday, such as the national day. It is requiredto consider about the near and similar holiday data, or even the last year’svocation data with the hash model.Algorithm 2 depicts the Hash model for one user. Algorithm 2 Crawl Schedule with Poisson Model input : a , ..., a k , n , ..., n k output : crawl time list L lastCrawlT ime = 0 , sum = 0 − remainingM essage for i ← to k do a i = a i ∗ . n i ∗ . sum = a for i ← to k do sum i = sum i − + a i for i ← to k do if ( sum i − sum lastCrawlT ime > c ) or ( i − lastCrawlT ime > s )9 L .add( i )10 lastCrawlT ime = i remainingM essage = sum k − sum lastCrawlT ime The length of the hash table is k , we crawl c messages each time and theuser post n i messages yesterday at the time span i , remainingM essage isthe number of messages that are not crawled the day before, the sum i meansthe sum from a to a i and sum is the number of minus remainingM essage ,thus number of the remaining messages that are not crawled yesterday willbe count, and the longest crawling span threshold is s . We input a ,..., a k , n ,..., n k and it will output the crawl time list L.First, we update the lastCrawlT ime and sum , and calculate the valuesof the hash table and the value of sum ... sum n . Second, we scan the sums.If there are enough messages to crawl ( sum i − sum lastCrawltime > c ) or the13rawl time span exceed the threshold ( i − lastCrawltime > s ), then we addthe time point i to the crawl time list L and update the lastCrawltime .The time and space complexity is O( n ).
5. Parallelization
As the information updates very fast on OSN, effective applications forOSN data require to collect messages on OSN at a high speed to keep thefreshness of crawled data. However, two factors prevent the high speed crawl-ing. One is the API quota for data accessing is seriously limited by the OSNplatform (350 calls per hours on Twitter, and 150 calls on Sina Weibo). An-other is the computation required by information extraction from the crawledinformation and the size of social network is very large. It is natural to designparallel algorithms to obtain fresh information from OSN. With a number ofIPs and machines, we are able to run the crawler on several computers at thesame time to increase the throughout capability of the system. For parallelprocessing, task scheduling is a crucial problem to solve. For this problem,task scheduling is to assign crawling tasks for various users to machines tokeep the balance of the loads of machines.In this section, we will discuss the parallelization strategy for the crawlertechniques presented before. We propose load balance methods to make thecrawler loads balanced on multiple machines. In the following part of thissection, first, we discuss parallel optimization of Poisson process model (inSection 5.1). As for the Hash model, it focuses on scheduling crawling oneuser, rather than a large group of users as Poisson process model does. It isnot necessary to crawl one user with multiple machines, because the resourcerequired to crawl one user is very few. We can easily parallelize such processby using a few machines while every machine crawls different users. Second,we propose two crawler system architectures (the centralized and distributedsystem), to make the crawler work in parallel (in Section 5.2.1). Both thePoisson process model and Hash model can be applied to the two systemarchitecture.
When the crawler development is based on Poisson process model as inSection 4.1, the figure of the message frequency and crawl sequence is likeorgan pipes as Figure 5 shows. The benefit of such model is the parameters14 igure 5: Organ Pipes in Poisson Process Model are simple and the computation resource requirement for the parameters issmall.The hight in 5 is the changing rate λ in Poisson process. For the OSN,the height means the user post frequency. By analyzing such model, we havefollowing observations.1. For those inactive and unimportant OSN users, we can collect theirinformation at a low speed, e.g. once a month.2. The λ , or post frequency, can be easily predicted according to thehistoric data of the target user.3. Such a prediction does not cost much computing resource, what weneed to do is to count how many messages the user post, and how longthe time period is during his post behavior.Thus we can improve our crawler efficiency without calculating a lot.The experiment shows it is 5.56% to 12.14% better than a Round-and-Robinstyle.To make the best of the parallelization, we attempt to schedule the loadfor every machine as balanced as possible.We propose two methods to achieve this goal, the Round-and-Robinmethod and the set-devision method. A intuitive idea to allocate the tasks is to assign the crawl task of eachuser to the machines in Round-and-Robin style.All crawl tasks S = { u , u , ..., u k } , each element in which is a crawl task fora user. The number of machines is denoted by n . To make these computers15 igure 6: Round-and-Robin Method to Parallel Poisson Process Model into Two Parts work in parallel, we want to divide the sequence S into n parts, and thenmake each of the n machines pick up one part and crawl for the users in thispart. The Round-and-Robin method is to divide the sequence:For machine i , we schedule machine it to crawl user u j ( j ≤ k ) if j mod n = i .Thus we can get n crawl lists, each of which is for one machine.Figure 6 describes an example of the algorithm. At the beginning, theto-crawl sequence of 6 users is u D , ...u D , and we divide the users into twoparts: P art ( u D , u D , u D ) and P art u D , u D , u D . For the user u Di in P art , i mod 2 is 0, and for the user u Di in P art i mod 2 is 1. We can conduct twomachines, one to crawl P art and another to P art . Thus we can crawl the u D , ...u D in parallel.Assume that in the users sequence sorted by their post frequency u , u , ..., u n ,the average post frequency difference between two adjoining users f i +1 − f i is a constant ∆, where f i is the post frequency of user u i . In fact, the moreusers we count, the more stable f i +1 − f i is. Then the workload differencebetween P art and P art (cid:88) P art − (cid:88) P art = n mod 4 = 0 f n mod 4 = 1 − ∆ n mod 4 = 2 f − ∆ n mod 4 = 3 (1) Proof.
We proof Equation 1 in these four cases. For the convenience of dis-cussions, we further divide
P art and P art into two segments respectively: P art A and P art B , P art A and P art B .We use f i to represent the post frequency of user u i , f Di to the frequency16f user u Di in the Algorithm 1. And there is f Di = (cid:26) f i ∗ i < n +12 f n − i ) − i ≥ n +12 In the first case, n mod 4 = 0. (cid:98) n +12 (cid:99) = n and n mod 2=0. Thus, theworkloads of the four parts are shown as follows. W P art A = f D + f D + ... + f D n − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − W P art A = f D + f D + ... + f D n − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − W P art B = f D n + f D n +2 + ... + f Dn − = f n − n ) − + f n − ( n +2)] − + ... + f = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − W P art B = f D n +1 + f D n +3 + ... + f Dn − = f n − ( n +1)] − + f n − ( n +3)] − + ... + f = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − P art and P art is W P art − W P art = W P art A + W P art B − ( W P art A + W P art B )= [ W P art A − W P art A ] + [ W P art B + W P art B ]= [( f − f ) + ( f − f ) + ... + ( f n − − f n − )] (cid:124) (cid:123)(cid:122) (cid:125) n − f n − − f n − ) + ( f n − − f n − ) + ... + ( f − f )] (cid:124) (cid:123)(cid:122) (cid:125) n − − ( n − n − W P art and W P art mean the (cid:80) P art and (cid:80) P art in Equa-tion 1, when n mod 4 = 0, the (cid:80) P art − (cid:80) P art = 0 .In the second case, n mod 4 = 1. (cid:98) n +12 (cid:99) = n +12 and n +12 mod 2=0. Thus,the workloads of the four parts are shown as follows. W P art A = f D + f D + ... + f D n +12 − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art A = f D + f D + ... + f D n +12 − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art B = f D n +12 +1 + f D n +12 +3 + ... + f Dn − = f n − ( n +12 +1)] − + f n − ( n +12 +3)] − + ... + f n − ( n − − = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art B = f D n +12 + f D n +12 +2 + ... + f Dn − = f n − n +12 ) − + f n − ( n +12 +2)] − + ... + f = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − itemsThen the difference of the workloads of P art and P art is W P art − W P art = W P art A + W P art B − ( W P art A + W P art B )= [ W P art A − W P art A ] + [ W P art B + W P art B ]= [ f + ( f − f ) + ( f − f ) + ... + ( f n − − f n − ) (cid:124) (cid:123)(cid:122) (cid:125) n − items ]18[( f n − − f n − ) + ( f n − − f n − ) + ... + ( f − f )] (cid:124) (cid:123)(cid:122) (cid:125) n − items= f + n −
54 (2∆) − n −
54 (2∆)= f Since the W P art and W P art mean the (cid:80) P art and (cid:80) P art in Equa-tion 1, when n mod 4 = 1, the (cid:80) P art − (cid:80) P art = f .In the third case, n mod 4 = 2. (cid:98) n +12 (cid:99) = n and n mod 2=1. Thus, theworkloads of the four parts are shown as follows. W P art A = f D + f D + ... + f D n − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art A = f D + f D + ... + f D n − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art B = f D n +1 + f D n +3 + ... + f Dn − = f n − ( n +1)] − + f n − ( n +3)] − + ... + f = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art B = f D n + f D n +2 + ... + f Dn − = f n − n ) − + f n − ( n +2)] − + ... + f = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − itemsThen the difference of the workloads of P art and P art is W P art − W P art = W P art A + W P art B − ( W P art A + W P art B )= [ W P art A − W P art A ] + [ W P art B + W P art B ]19 [( f − f ) + ( f − f ) + ... + ( f n − − f n − ) (cid:124) (cid:123)(cid:122) (cid:125) n − items + f n − ]+[ − f n − + ( f n − − f n − ) + ( f n − − f n − ) + ... + ( f − f ) (cid:124) (cid:123)(cid:122) (cid:125) n − items ]= [( n −
64 )( − f n − ] + [ − f n − + ( n −
64 )(2∆)]= f n − − f n − = − ∆Since the W P art and W P art mean the (cid:80) P art and (cid:80) P art in Equa-tion 1, when n mod 4 = 2, the (cid:80) P art − (cid:80) P art = − ∆ .In the forth case, n mod 4 = 3. (cid:98) n +12 (cid:99) = n +12 and n +12 mod 2=0. Thus, theworkloads of the four parts are shown as follows. W P art A = f D + f D + ... + f D n +12 − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art A = f D + f D + ... + f D n +12 − = f + f + ... + f n − (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art B = f D n +12 + f D n +12 +2 + ... + f Dn − = f n − n +12 ) − + f n − ( n +12 +2)] − + ... + f n − ( n − − = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − items W P art B = f D n +12 +1 + f D n +12 +3 + ... + f Dn − = f n − ( n +12 +1)] − + f n − ( n +12 +3)] − + ... + f = f n − + f n − + ... + f (cid:124) (cid:123)(cid:122) (cid:125) n − itemsThen the difference of the workloads of P art and P art is20 P art − W P art = W P art A + W P art B − ( W P art A + W P art B )= [ W P art A − W P art A ] + [ W P art B + W P art B ]= [( f − f ) + ( f − f ) + ... + ( f n − − f n − ) (cid:124) (cid:123)(cid:122) (cid:125) n − items ]+[( f n − − f n − ) + ( f n − − f n − ) + ... + ( f − f ) (cid:124) (cid:123)(cid:122) (cid:125) n − items + f ]= n −
34 ( − n −
74 )(2∆) + f ]= f − f − ∆Since the W P art and W P art mean the (cid:80) P art and (cid:80) P art in Equa-tion 1, when n mod 4 = 3, the (cid:80) P art − (cid:80) P art = f − ∆ .The workload difference between the two parts depends on the differencebetween the single user’s post frequency ∆ and the lowest post frequency f .The Round-and-Robin algorithm has two advantages. First, with n in-creasing, the difference in workloads between two machines would not in-crease. This property could assure the workload balance is still assured withlarge data size. Second, the workload difference is negligible to a crawlingcomputer. In a real crawler, each machine crawls more than one thousandusers a day, and the difference of one user is not influential.The round-robin method only requires to scan the posting frequencies ofusers only once to partition each user u i into P art i mod k . Hence the spaceand time complexity for this algorithm is both O( n ), where n is the numberof users in the sequence.In some extreme cases, there are a few high active users, the load balancewill be difficult. Figure 7 shows such case. The user u is very active whileother users are quiet silent on OSN. And the difference between P art and P art are larger than in Figure 6. Hence the load of two crawling sequencesdivided by the Round-and-Robin method is difficult to balance. However,this is hardly met in the real crawling because of two reasons:First, from the experimental results, the most active user among 88779users post 2051 messages in a period of 4 years. Comparing with the workload21 igure 7: Round-and-Robin Method to Parallel Poisson Process Model Into Two Imbal-anced PartsFigure 8: Round-and-Robin Method to Parallel Poisson Process Model into Four Parts of a machine as millions of messages each day, Two thousand messages are notinfluential to the workload a crawling machine. Therefore even the highestactive users would hardly damage the load balance. Second, in practise, themessages of much more users are to be crawled than the experiments. Themore we collect, the more balance the Round-and-Robin method are.In cases that we need to divide the Round-and-Robin model into multipleparts, we can do the division recursively. First, we divide the model into twoparts, and then, we apply the same method to divide the two parts into fourparts, and then eight or more parts. Figure 8 describes such a division. Atthe beginning, there is just one part, and after Step 1, there are two parts,and after Step 2, there are four. Thus we can crawl the original users withfour machines at the same time.This method can only divide the model into a power of 2 parts. In face,since the number of machines is often a power of 2, the method is practicalfor such cases.As for the time and space complexity, we proof that this algorithm is aPTAS as follow: Theorem 2.
The algorithm to parallel Poisson process model into multiple arts is a PTAS.Proof. First, we attempt to prove that in the i th partition, the size of thelargest share is at most (1 + (cid:15) (cid:48) ) i n i and the size of the smallest share is at most(1 − (cid:15) (cid:48) ) i n i . We prove this statement with induction. As the basic step, forthe first partition, according to the approximate ratio of FPTAS algorithmof SSP [21], the size of the smaller part should be larger than (1 − (cid:15) (cid:48) ) n .Accordingly, the size of the larger part is smaller than n − (1 − (cid:15) (cid:48) ) n =(1 + (cid:15) (cid:48) ) n .The inductive assumption is that in the ( i − (cid:15) (cid:48) ) i − n i − and the size of the smallest share is at most(1 − (cid:15) (cid:48) ) i − n i − . According to the approximate ratio of FPTAS of SSP, thesize of the smallest share of this step should be smaller than the lower boundof the smallest share, which is (1 − (cid:15) (cid:48) ) · (1 − (cid:15) (cid:48) ) i − n i − =(1 − (cid:15) (cid:48) ) i n i . For the samereason, the size of the largest share of this step should be larger than theupper bound of the largest share, which is (1 + (cid:15) (cid:48) ) · (1+ (cid:15) (cid:48) ) i − n i − =(1 + (cid:15) (cid:48) ) i n i .Then the statement is proven.Then, we prove that the time complexity of the algorithm is in polynomialtime of n , k . All the algorithm has (cid:100) log k (cid:101) steps. The time complexity ofstep i is the polynomial time of n and k , since the number of shares is 2 i ≤ k and time complexity is in the polynomial time of f racn i and (cid:15) (cid:48) . Since thereare at most (cid:100) log k (cid:101) steps. The total time complexity is in polynomial timeof n and k .Then we attempt to prove that the ratio of sizes of the largest share andthe smallest share is smaller than a 1 + (cid:15) with a given (cid:15) . Since in the bestcase, the ratio of sizes of the largest share and the smallest share equals to1. Thus the ratio bound is 1 + (cid:15) .Since the size of the largest share is at most (1 + (cid:15) (cid:48) ) i n i and the size ofthe smallest share is at most (1 − (cid:15) (cid:48) ) i n i in step i , the ratio of the sizes of thelargest share and the smallest share is at most (1+ (cid:15) (cid:48) ) ( (cid:100) log k (cid:101) )(1 − (cid:15) (cid:48) ) ( (cid:100) log k (cid:101) ) =( (cid:15) (cid:48) − (cid:15) (cid:48) ) ( (cid:100) log k (cid:101) ).Since (cid:15) (cid:48) = R − R , where R = 2 log(1+ (cid:15) ) (cid:100) log k (cid:101) , we have (1 + R ) (cid:15) (cid:48) = 1 − R . Thus, R = (cid:15) (cid:48) − (cid:15) (cid:48) . That is,2 log(1+ (cid:15) ) (cid:100) log k (cid:101) = (cid:15) (cid:48) − (cid:15) (cid:48) .Then log(1+ (cid:15) ) (cid:100) log k (cid:101) =log (cid:15) (cid:48) − (cid:15) (cid:48) . We have log(1 + (cid:15) )= (cid:100) log k (cid:101) log (cid:15) (cid:48) − (cid:15) (cid:48) . Thus 1 + (cid:15) =2 log(1+ (cid:15) ) =2 (cid:100) log k (cid:101) log (cid:15) (cid:48) − (cid:15) (cid:48) =( (cid:15) (cid:48) − (cid:15) (cid:48) ) ( (cid:100) log k (cid:101) ). It shows that the ratio of the sizesof the largest share and the smallest share is at most 1 + (cid:15) . Thus the ratiobound of the algorithm is 1 + (cid:15) . 23 .1.2. Set-Division method As discussed in the Section 5.1.1, the round-robin strategy divides theprevious part into several small parts, and each smaller parts are as thesame large as possible. The Set-Division method tries to solve the divisionby separating the previous part into a few number of smaller parts, and eachsmaller part contains different number of users.If we treat the user post frequency as a set of integers, then the jobschedule problem (JS for brief) is defined as follows.Given a set of integers S = u , u , ..., u n , how to divide S into k sets S , S , ..., S k to make the difference between the k sets (cid:88) ≤ i JS is a NP-hard problem.Proof. We attempt to reduce the Number Partitioning Problem (NPP)[22]to a special case of JS with k =2.For a NPP problem with a given input of a set S = { a , a , · · · , a n } withintegers, we construct a input of JS with the input set of integers as S and k = 2. The solution of such JS, S and S , has minimal | S − S | . Thus thisis a solution of such NPP problem. Since NPP is an NP-hard problem [23]and the reduction can be accomplished in polynomial time clearly, JS is anNP-hard problem.The problem can be converted to the Subset Sum Problem (SSP). Thatis, given a set of integers S = u , u , ..., u n , how to select m numbers from S so that those m numbers are close to but no more than a given constant c [24].The SSP is NP-Hard [23], but it can be solved in pseudo-polynomial time.[24] gives such algorithm, and its time complexity is O(Max( n − log c ) c, c ∗ log c ), space complexity is O( n + c ). In this subsection, we propose two parallel system architectures, a cen-tralized and distributed ones. 24n the centralized architecture, the clients are connected to a server, butall of them work independently. While in the distributed architecture, all thecomputers work independently. So you can apply either the Poisson processmodel or the Hash model to the two architectures, or apply the two modelsat the same time.The major difference between the two architectures is whether the systemhas a central server. For the centralized system, the server maintains thecrawling sequences of all users, and assigns the crawling jobs to clients. Theclients are only responsible to do the crawling job sent by the server. Forthe distributed system, every machine i is in charge of a crawling task set S i .They determine the assignment the task in S i to machines. Note for eachtask u in S i , machine i just determines which machine will perform u but u does not have to be performed on machine i .In this section, we will introduce these two architectures in Section andSection, respectively. In the centralized crawler system, a center server is required to maintainthe target crawling user sequence and schedule all clients’ work.The central server goes through the Poisson process model or Hash modeland updates the model parameters. It decides 1) who is the next user tocrawl; and 2) which machine to crawl the user. And other client machinesall have a to-crawl user list. Each client crawls information for the users inthe user list one by one.To make the best use of the API quota, when we want to collect a user’sinformation, we first find the client machine whose to-do list is the shortest,and make it to do the job. Hence the to-do list of each machine can be asthe same long as possible, and the workload and the API quota of differencecan be more similar. If there is more than one machine with a shortest to-dolist, then we randomly select a client to do the corresponding job.In a word, the tasks for the server is:1) Build the crawling model: Poisson process, Hash Model and so on.2) Maintain the crawling sequence: decide who is the next user to crawl.3) Send target user id to clients: let the client whose workload is currentlyminimal to crawl the target user.4) Receive data from clients: store the data that clients collected.And the tasks for a client is:1) Crawl the target user: this is the job sent by the server.25 igure 9: The Centralized Crawler Architecture 2) Send the data to the server: so the server can manage the storage.Figure 9 describes such a centralized system.It is convenient for a centralized system to do job scheduling, however, itmay not make the best use of the bandwidth of the server. As the task forthe server is to compute and update the parameters for models, and that forthe clients is to crawl, the limit for the server is computing resource whilethat for the client is API quota and bandwidth. Therefore, it is possible thatthe server has run out of its computing resource while the clients still haveenough API quota and bandwidth. As a result, the centralized system maynot make the best use of all machines.On the other hand, the centralized system has a high requirement tothe bandwidth between the server and clients. If we use machines outsidelaboratory as clients, e.g. the plantlab, then it is hard to guarantee thebandwidth and the clients may lost connection.To avoid the above two disadvantages, we propose a distributed architec-ture in the following subsection. In a distributed system, there is no center server. Every machine acts asboth a server and client at the same time.26 igure 10: The Distributed Crawler Architecture In the centralized architecture, the central server has a long list of OSNusers that will be crawled. The server is accessible to the users’ data, and itmanages the crawling strategy according to the users’ behavior and crawlingmodel. As a comparison, the client machines do not have any user list. Theclient machines are given the target users’ names or ids, crawl according tothe names or ids, and then return the users’ data to the server.In the distributed system, however, every machine has their own userlist. All of the machines manage their crawling dependently. They are ableto build different models, and rule the crawling strategy according to theirown data respectively. In a real crawling system, every machine runs twokinds of process: the Management Process and the Crawling Process. Themanagement process is responsible to the crawling strategy. It visits theusers’ data that has been collected before, runs crawling model (Poissonprocess model, Hash model or any other model), decides which user to crawlnext, and sends the target user’s name or id to the crawling process. Thecrawling process is only visible to the target user’s name or id sent by themanagement process. The only duty of the crawling process is to crawl thetarget user, and then send the user’s data back to the management process.Because there are much more crawling job than management job, the numberof crawling processes is much larger than that of management processes. Inother word, the management and crawling process serves like the server andclient in the centralized system.Figure 10 describes the distributed architecture of a crawling system.The curves in the figure means the Management Process, while the fold linesmeans the Crawling Process.Every machine under this architecture works independently, and theycan use different crawling strategies of both Poisson process model or Hashmodel. They also have different target user list. They decide the crawlingsequence of the users, and also do the crawling job assigned to it.A number of processes are run on the computers at the same time. There27re two kinds of processes: 1) a computing process to maintain the modeland adjust algorithm parameters, and 2) a few crawling processes to collectinformation from the online social network. Because nowadays CPU alwayshave multicores, to improve the performance of the crawling system, we canlet the processes, no matter computing or crawling ones, run on differentCPU cores.As for the workload, in the centralized system, our purpose is to makethe workload as balanced as possible as we discussed in Section 5.1.1. Thecentralized system is always built in the laboratory, because it requires ahigh speed network connection the server and clients. In such cases, the clientmachines, or the clusters in the laboratory always have the same performance.And to make the best use of the same machines, we just need to make theload as same as possible.The distributed system, however, is always built outside the laboratory,for it has a lower requirement of the machines and network. Under suchcircumstance, it is necessary to manage the workload according to the ma-chine’s performance. In a distributed system, since the crawling work ofevery machine is independent, the workload is independent as well. To makethe workload between the machines balance, you can adjust by 1) switch thecrawling algorithm, and 2) add or remove users from the user list of the com-puters. For example, if you have a lot of inactive users to collect, and youwant to crawl them just only once, then choose the Poisson process model,otherwise choose the Hash model. And if a computer run out of its APIquota, you can remove some users from its user list, and add the users toanother computer whose API quota is enough.However, the distributed architecture brings a problem: the data is storeddistributively in many machines, which means we have to do more work tocollect all the data in every machine.In a word, the centralized system works better in laboratory, and thedistributed system is more suitable outside the laboratory. 6. Experimental Evaluation To verify the effectiveness and efficiency of our strategies, we performexperiments in this section. The experiments are performed on a PC with2.10 GHz Intel CPU and 4G RAM. We crawl the Weibo.com, which is oneof the hottest OSN in China. The Weibo has 503 million registered users in2012.12 and about 100 million messages are post everyday [25].28e crawled the last 2,000 messages of 88,799 randomly selected users.The result shows that 80,616,076 messages are collected. To test the effectiveness of Poisson process model, we conducted the Pois-son process model twice.First, we crawled the 10,000 users in a period of 2 months from 1012.11.01to 2013.01.01 , and accessed every user once to get the last 100 messages.From observations, the Poisson process model crawled 421,722 messages,while the Round-and-Robin style crawled 376,053 messages. Thus the Pois-son process model is 12.14% better.Second, we crawled the 88,799 users in a period of 4 years from 2009.08.31to 2013.09.01, and accessed every user once. From observations, the Poissonprocess model crawled 7,369,498 messages, while the Round-and-Robin stylecrawled 7,147,965 messages. Thus the Poisson process model is 3.10% better.3.10% is lower than 12.14%. In fact, when the experimental period lastslonger and the total messages number grows larger, there will be more mes-sages to collect, and every time we can access more messages. And if we cancollect more than 100 messages every time, the experimental result will bethe same. Therefore, the longer the experimental period, the more similarthe results of various methods will be. We crawl the 10,000 users by hash model. We assume the crawling limitis 100 messages at one crawling, the longest crawling span threshold s is 30days, and the weight for the last one day (in the example, it is 0.5) rangesfrom 0.4 to 0.9 step by 0.1. The result is shown in Table 1. For comparisons,we also conduct the experiment for RR, and crawl one user 2 to 5 timesduring the experiment period. The result is described in Table 2.The data in Table 1 shows that with the weight increases, the total crawltime increases and more messages will be crawled, while the crawling ef-ficiency decreases. Thus the weight can be considered as a parameter toadjust the crawling efficiency and the limited crawling resource such as theAPI restriction. The data in Table 2 shows that the crawling efficiency forthe RR style method is reasonable stable when the crawling frequency is low. We show the experimental result of the parallelization methods discussedin Section 5 as follow: 29 able 1: The Hash Model Experiment Results Weight Message Number Total Crawl Time Avg Table 2: The RR Style Method Experiment Results First, we calculated the post frequencies (how many messages users postin a day) of 1000 to 7000 randomly selected user step by 1000, and triedboth Round-and-Robin method and randomly select method to divide thefrequencies into two parts. Table 3 shows the result.From Table 3, we can find that the Round-and-Robin method is muchbetter than the Random method. In the worst experimental case (728.22 and83.47), the workload difference of Round-and-Robin method is only 11.46%of that of random method.Second, we divide the 1000, 2000, 4000 and 8000 post frequencies into 16parts recursively using both Round-and-Robin method and random method.Table 4 shows the result.From Table 4, we can find that the Round-and-Robin method is muchbetter than the Random method. In the worst experimental case (2557.92and 339.55), the workload difference of Round-and-Robin method is only13.27% of that of random method. 30 able 3: The Round-and-Robin Parallelization Results 1 User Num. Fre. Tot. Random Method Diff. RR Diff.10000 49943.17 1995.68 145.1120000 96473.30 3413.03 37.5430000 140484.31 5828.93 37.0340000 187524.68 5421.92 85.7250000 237834.48 728.22 83.4760000 289173.84 8205.22 430.9070000 335352.75 5651.32 343.31 Table 4: The Round-and-Robin Parallelization Results 2 User Num. Fre. Tot. Random Method Diff. RR Diff.10000 49943.17 2557.92 339.5520000 96473.30 3096.89 276.6340000 187524.68 3264.10 185.3980000 381714.54 6585.32 643.81 To test the efficiency of the centralization architecture, we use 1, 2, 4, 8and 16 machines respectively and crawl 2,000 users with one machine duringa period of one year (from 2012.09.01 to 2013.09.1). Table 5 and Figure 11describe the experimental result. In Table 5, the ’Machine Num.’ meansthe total number of machines in the experiment, the ’Tot. Messages’ meansthe number of total messages that are crawled, and the ’Workload Diff.’means the difference between the maximum and minimum workload of themachines. In the Figure 11, the x-axis means the number of experimentalmachines, and the y-axis means the total messages crawled.From Table 5 and Figure 11, we can find that the speed-up ratio is almost Table 5: The Centralized Architecture Experimental Result Machine Num. Tot. Messages Workload Diff.1 22474 02 42495 6274 83271 40798 172473 353016 344540 500831 " $ % t & ’ (( $ ) ’ ( t ( uNN t & ’ (( $ ) ’ ( !"! &$*+,-’ t ./01’2 ! $ & ’ ( Figure 11: The Centralized Architecture Experimental ResultTable 6: The Distributed Architecture Experimental Result Machine Num. Tot. Messages Workload Diff.1 20916 02 43886 14044 87679 35758 177534 427116 348800 5269linear. To test the efficiency of the distributed architecture, we use 1, 2, 4, 8and 16 machines respectively and crawl 2,000 users with one machine duringa period of one year (from 2012.09.01 to 2013.09.1). Table 6 and Figure 12describe the experimental result. The raws in Figure 6 and axises in Figure 12have the same meaning in Figure 6 and Figure 12 mentioned in Section 6.3.2From the table, we can find that the speed-up ratio is almost linear.Compare the distributed architecture with the centralized architecture,we find the workload difference of the distributed one is larger than thecentralized one. For the centralized system, the minimal slot to managethe workload is one crawling operation, while for the distributed system it’sone target user. We may crawl one user several times, so the slot for thecentralized system is smaller and the workload is more balanced.32 " $ % " s & ’ ( ) " * s r + nn s ! " $ % " ( !"! !$,-./" s &’()"* ! $ & ’ ( Figure 12: The Distributed Architecture Experimental Result 7. Conclusion And Future Work To use the information in OSN effectively, it is necessary to obtain freshOSN in-formation. However, OSN crawler is quite different from web crawlersfor the change rate is faster and the requirements for the latest messages aremuch more intensive. Therefore, the traditional web crawl method cannotbe applied for OSN information crawling. To obtain fresh information inOSN with resource constraint, we classify users according to their behaviors.Their behaviors are modeled and crawling algorithms are proposed accordingto the models. Experimental results show that the Hash model is about 50%better than the Round-and-Rabin method, and the Poisson process modelis 12.14% better than the RR method with randomly selected users. Andthe parallelization method effectively controls the workload difference of themachines in the crawler system. What’s more, the centralized and distributedarchitectures all show a linear speed up with the number of machines.There are some possible future research directions. One is to crawl withdifferent weight for different users, for the celebrities are more influential.How to define the weights for users to crawl optimally remains a challenge.Another direction is to obtain the hottest messages as early as possible.Study of information transmit is required so that we can predict the hotpot in OSN. As for the crawler system development, the centralized anddistributed architectures may be tested with more machines and crawl moreusers, even the whole OSN users if possible.33 eferences [1] Facebook, ,2013.[2] S. G. Set., http://snap.stanford.edu/data/ , 2013.[3] J. Leskovec, Social media analytics: tracking, modeling and predictingthe flow of information through networks, in: Proceedings of the 20thinternational conference companion on World wide web, ACM, 2011, pp.277–278.[4] Spinn3r, , 2013.[5] C. Byun, H. Lee, Y. Kim, Automated twitter data collecting tool fordata mining in social network, in: Proceedings of the 2012 ACM Re-search in Applied Computation Symposium, ACM, 2012, pp. 76–79.[6] T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes twitter users:real-time event detection by social sensors, in: Proceedings of the 19thinternational conference on World wide web, ACM, 2010, pp. 851–860.[7] E. Aramaki, S. Maskawa, M. Morita, Twitter catches the flu: Detectinginfluenza epidemics using twitter, in: Proceedings of the Conferenceon Empirical Methods in Natural Language Processing, Association forComputational Linguistics, 2011, pp. 1568–1576.[8] D. Denev, A. Mazeika, M. Spaniol, G. Weikum, Sharc: framework forquality-conscious web archiving, Proceedings of the VLDB Endowment2 (2009) 586–597.[9] C. Olston, S. Pandey, Recrawl scheduling based on informationlongevity, in: Proceedings of the 17th international conference on WorldWide Web, ACM, 2008, pp. 437–446.[10] Facebook, https://dev.twitter.com/docs/rate-limiting , 2013.[11] M. Boanjak, E. Oliveira, J. Martins, E. Mendes Rodrigues, L. Sarmento,Twitterecho: a distributed focused crawler to support open researchwith twitter data, in: Proceedings of the 21st international conferencecompanion on World Wide Web, ACM, 2012, pp. 1233–1240.3412] P. Noordhuis, M. Heijkoop, A. Lazovik, Mining twitter in the cloud: Acase study, in: Cloud Computing (CLOUD), 2010 IEEE 3rd Interna-tional Conference on, IEEE, 2010, pp. 107–114.[13] D. H. Chau, S. Pandit, S. Wang, C. Faloutsos, Parallel crawling for on-line social networks, in: Proceedings of the 16th international conferenceon World Wide Web, ACM, 2007, pp. 1283–1284.[14] G. Dziczkowski, L. Bougueroua, K. Wegrzyn-Wolska, Social network-an autonomous system designed for radio recommendation, in: Com-putational Aspects of Social Networks, 2009. CASON’09. InternationalConference on, IEEE, 2009, pp. 57–64.[15] J. Cho, H. Garcia-Molina, L. Page, Efficient crawling through url or-dering, Computer Networks and ISDN Systems 30 (1998) 161–172.[16] J. Cho, H. Garcia-Molina, Estimating frequency of change, ACM Trans-actions on Internet Technology (TOIT) 3 (2003) 256–290.[17] C. Castillo, M. Marin, A. Rodriguez, R. Baeza-Yates, Scheduling algo-rithms for web crawling, in: WebMedia and LA-Web, 2004. Proceedings,IEEE, 2004, pp. 10–17.[18] J. Cho, U. Schonfeld, Rankmass crawler: a crawler with high person-alized pagerank coverage guarantee, in: Proceedings of the 33rd inter-national conference on Very large data bases, VLDB Endowment, 2007,pp. 375–386.[19] R. Baeza-Yates, A. Gionis, F. P. Junqueira, V. Murdock, V. Plachouras,F. Silvestri, Design trade-offs for search engine caching, ACM Transac-tions on the Web (TWEB) 2 (2008) 20.[20] J. Cho, A. Ntoulas, Effective change detection using sampling, in:Proceedings of the 28th international conference on Very Large DataBases, VLDB Endowment, 2002, pp. 514–525.[21] C. E. Leiserson, R. L. Rivest, C. Stein, T. H. Cormen, Introduction toalgorithms, The MIT press, 2001.[22] S. Mertens, The easiest hard problem: Number partitioning, Computa-tional Complexity and Statistical Physics 125 (2006) 125–140.3523] M. R. Gary, D. S. Johnson, Computers and intractability: A guide tothe theory of np-completeness, 1979.[24] N. Y. Soma, P. Toth, An exact algorithm for the subset sum problem,European Journal of Operational Research 136 (2002) 57–66.[25] Wikipedia, http://en.wikipedia.org/wiki/Sina_Weibohttp://en.wikipedia.org/wiki/Sina_Weibo