Sampling from Social Networks with Attributes
Claudia Wagner, Philipp Singer, Fariba Karimi, Jürgen Pfeffer, Markus Strohmaier
SSampling from Social Networks with Attributes
Claudia Wagner ∗ GESIS & U. of Koblenz-Landau [email protected] Philipp Singer ∗ GESIS & U. of Koblenz-Landau [email protected] Fariba Karimi
GESIS & U. of Koblenz-Landau [email protected]ürgen Pfeffer
Technical University of Munich [email protected] Markus Strohmaier
GESIS & U. of Koblenz-Landau [email protected]
ABSTRACT
Sampling from large networks represents a fundamental chal-lenge for social network research. In this paper, we explorethe sensitivity of different sampling techniques (node sam-pling, edge sampling, random walk sampling, and snowballsampling) on social networks with attributes. We considerthe special case of networks (i) where we have one attributewith two values (e.g., male and female in the case of gender),(ii) where the size of the two groups is unequal (e.g., a malemajority and a female minority), and (iii) where nodes withthe same or different attribute value attract or repel eachother (i.e., homophilic or heterophilic behavior). We evaluatethe different sampling techniques with respect to conservingthe position of nodes and the visibility of groups in suchnetworks. Experiments are conducted both on synthetic andempirical social networks. Our results provide evidence thatdifferent network sampling techniques are highly sensitivewith regard to capturing the expected centrality of nodes,and that their accuracy depends on relative group size differ-ences and on the level of homophily that can be observed inthe network. We conclude that uninformed sampling fromsocial networks with attributes thus can significantly impairthe ability of researchers to draw valid conclusions about thecentrality of nodes and the visibility or invisibility of groupsin social networks.
Keywords: social networks; sampling methods; samplingbias; homophily
1. INTRODUCTION
Sampling from large networks represents a fundamentalproblem for social network research. In order to draw validconclusions from network samples, understanding how accu-rately samples reflect the position of nodes in the originalnetwork is essential. Previous research has studied robust-ness of network samples from different angles, for example byexamining the accuracy of network measures such as degreeor betweenness centrality. A range of network properties hasbeen found to be sensitive to the choice of sampling methods[4, 6, 11, 13, 15, 16, 18, 30]. ∗ Both authors contributed equally to this work. © WWW 2017,
April 3–7, 2017, Perth, Australia.ACM 978-1-4503-4913-0/17/04.http://dx.doi.org/10.1145/3038912.3052665.
Motivation and problem.
In this paper, we focus on thespecific problem of sampling nodes and edges from a socialnetwork with attributes, i.e., a network where nodes are col-ored. For example, the color of nodes might be determinedby gender, ethnicity, or age. We consider the special caseof networks (i) where one binary attribute can be observed(e.g., a male and a female group of nodes), (ii) where thesize of the two groups is unequal (e.g., a male majority anda female minority), and (iii) where nodes with the sameor different attribute value attract or repel each other, i.e.,homophilic [26] or heterophilic networks [3]. While the gen-eral impact of sampling on network characteristics has beenstudied thoroughly in the past [4, 6, 11, 13, 15, 16, 30], therole of attributes in combination with fundamental socialmechanisms such as homophily [21, 27] has only received lit-tle attention so far [19]. In fact little is known about whetheror how different sampling techniques are able to conserve the ranking of nodes or the visibility of groups from the originalnetwork. Accurately capturing network characteristics ofgroups of nodes in sampled data, however, is crucial not onlyfor researchers interested in directly studying these groups(e.g., gender or sociological studies), but also for researchersinterested in analyzing the structure of the complete networksince attributes of actors can impact the overall networkstructure [5, 21, 27].
Research questions.
In this paper, we thus ask: Howsensitive are different sampling techniques with respect toconserving the ranking of nodes and the visibility of groupsin synthetic and empirical social networks with (i) differentminority and majority group proportions, and (ii) variouslevels of homophily?
Methods and materials.
We evaluate different samplingtechniques (node sampling, edge sampling, random walksampling, and snowball sampling) with respect to reflectingthe ranking of nodes and the visibility of groups in networksamples (see Figure 1). Instead of putting the focus on thewhole population as in previous work, we specifically focus onsub-populations (or groups) ; we call the larger group majority and the smaller group minority . Our work is guided by theintuition that an ideal sample would allow to accuratelypreserve the original degree centrality ranking of nodes, andtherefore preserve the relative importance between nodes andgroups . That means, an ideal sample would not systematicallyrank nodes of one group higher and nodes of the other grouplower than expected. This would be considered a biasedsample or sampling error . a r X i v : . [ c s . S I] F e b A I DCJ H F B E
NodeSampleDegreeCentrality EdgeSample
RandomWalkDegreeCentrality SnowBall
CBHG HGBJ CJHA FHBJ FDEC BGHF
G A I DC EJ H F B
Figure 1:
Illustration.
This example shows a heterophilic and a homophilic network with a red minority and ablue majority group. We illustrate that sampling methods may differ in their ability to preserve the visibilityof the minority group when ranking sampled nodes by their degree centrality.
We construct synthetic social networks and vary the struc-tural mechanisms guiding the growth of the network (i.e., ho-mophily, preferential attachment, and group sizes), to studythe extent to which they impact the accuracy of samples.We additionally showcase observed artifacts on empirical net-works. Based on the obtained insights, we provide indicatorsof why samples might have issues with capturing expectedgroup characteristics.
Contributions and Findings. (i) We propose a methodto measure the robustness of samples from networks with twoattributes. (ii) Using synthetic and empirical networks, weprovide evidence that different network sampling techniqueshave issues with capturing the expected centrality of nodesand the visibility of minority / majority groups in socialnetworks. (iii) We discuss network characteristics that leadto observed discrepancies and quantify the impact of relativegroup size differences and homophily on sampling errors.
2. BACKGROUND AND RELATED WORK
Network analysis has long been plagued by issues of mea-surement error, usually in the form of missing data. Un-derstanding the robustness of basic network measures is ex-tremely important in order to assess the validity of networkresearch. Prior research explored the impact of missing dataon various network measures, but mainly focused on smallsociometric networks [6, 11], small bipartite collaborationgraphs [15], and random networks [4, 15].Smith and Moody [28] extended this line of research andanalyzed four classes of network measures on 12 relativelysmall ( <
3. METHODS
In this work, we are interested in studying the accuracy ofsamples drawn from networks with unequally sized groupsand various levels of homophily. We (i) describe used sam-pling techniques and (ii) explain how we assess the accuracyof a sample.
Our goal is to sample K nodes from the overall set of N nodes in a network. As pointed out in [18], we can splitsampling algorithms into three groups: methods based onrandomly selecting nodes, randomly selecting edges, andexploration techniques simulating random walks or viruspropagation to find a representative sample of nodes. Wefocus on one sampling technique from each group: Random node sampling.
This is the most basic samplingtechnique where a random subset of K nodes is selected. Thesampled network then contains these K nodes and all linksbetween them. Random node sampling is e.g., used when asample of individuals is first selected and then their contactbehavior is observed. Numerous surveys and data collectionsuse this method, e.g., measuring contact pattern among highschool students using wearable sensors [20]. Random edge sampling.
This strategy randomly samplesedges from the network and filters the complete network bysampled edges. To be consistent with the other samplingstrategies, we successively sample edges until K nodes areselected. The sampled network then contains these K nodesand sampled links, but not those links between selected nodesthat have not been sampled. Random edge sampling is com-monly used to construct a social graph by using informationabout contacts—e.g., phone calls are sampled and a graphof callers and receivers is constructed [12]. Snowball sampling.
In snowball sampling, we randomlysample one starting node and add all its neighbors as well asthe neighbors’ neighbors to the set of sampled nodes—i.e.,two step snowball sampling. We repeat this until we havegathered K nodes for the sample. If a full iteration doesnot catch K nodes, we repeat the process again with a newrandomly selected starting node. The sampled network thencontains these K nodes and all the links connecting them.Traditionally, snowball sampling is used when the populationunder study is not easily accessible (e.g., to study homelesspeople or illegal immigrants). Indeed, the promise of thesnowball sampling is to access hard-to-reach population [1]. Random walk (RW) sampling.
This strategy samplesnodes by walking through the network. The walker startsat a random node in the network and chooses in each stepone out-going link randomly and traverses it. All visitednodes are then added to the sample until K nodes have beenadded. A teleport probability can be set for teleporting toanother random node in the network instead of traversing alink in this iteration; we use 0 .
15 throughout this work. The sampled network then contains these K nodes and all linksbetween them. This technique of sampling is usually usedin online social networks such as Facebook or Twitter, inwhich retrieving information about the whole population isoverwhelming and computationally costly, but we can accessand navigate the original network. The ubiquity of sampled network data makes the under-standing of the robustness of network measures crucial. Here,we focus on the most basic and widely used centrality mea-sure: degree centrality [10]. The degree centrality of a nodeis defined as the fraction of nodes it is connected to.Previous work explored the robustness of centrality mea-sures in samples of networks without taking heterogeneousattributes of nodes into account. Therefore, simple rank cor-relation (see e.g., [6, 16, 28, 30]) and overlap measures (seee.g., [4]) have been used to assess how well a sample capturesthe ranking of nodes according to various network measures.In this work, we are interested in assessing how well a samplecaptures, on average, the overall position of nodes in theoriginal network for each group of nodes separately. Thatmeans, we aim to reveal if the positions of nodes in bothgroups are equally well captured in a way that the relativegroup and node importance are preserved.If we would compute the overall rank correlation (or over-lap) between the two lists and ignore the group memberships,then the ranking of majority nodes would contribute more tothe correlation coefficient (or overlap). A naive group-specificmeasure would be to compute a separate rank correlation (oroverlap) for each group. However, this measure would onlyallow us to assess how well the relative importance of nodeswithin each group in the original network is preserved in thesample, but the relation between nodes across groups wouldbe neglected. Therefore, simple rank correlation or overlapmeasures cannot be used to assess whether the relevance ofnodes and groups is accurately captured in a sample.In this work we define an ideal sample as a sample thatallows to accurately reconstruct the original degree centralityranking of nodes and therefore preserves the relative impor-tance between nodes and groups . That means, an ideal sampledoes not systematically rank nodes of one group higher andnodes of the other group lower than expected. To assess the accuracy of the relative importance of nodes and groups , wepropose the following two evaluation measures. Both evalua-tion measures focus on the top k or top k percent of the data,since (i) users focus on the first few results in ranked lists and(ii) the distribution of degree centralities are usually heavytail distributions. Therefore, the contribution of disorders inthe long tail (unpopular nodes) would dominate disorders inthe head (popular nodes) if we would not limit our analysisto the head [32]. Top k bias. To assess the accuracy of group visibility in asample , we compare the fraction of minority nodes in the top k nodes of a sample with its fraction in the top k nodes ofthe complete network. bias topk = expected topk − observed topk (1) Observed topk refers to the fraction of minority nodes that weobserve in the top k nodes of the sample, while expected topk refers to the fraction of minority nodes in the top k nodesof the original network. As sample size grows, the observedfraction in the sample approaches the expected fraction. ✁✁ ✂✄☎✆✝ ✞✟✂✄✠✟✡☛ ✂✄☎✆✝ ✞(cid:0)☞✄✠✟✡☛ ✂✄☎✆✝ Figure 2:
Degree distribution of synthetic networks.
The average degree distribution of majority (80% of nodes)and minority (20% of nodes) in a synthetically generated preferential attachment network with various levelsof homophily. One can see that the degree distributions are almost equal if homophily does not play a role( h = 0 . ). In heterophilic networks ( h < . ) the group-specific differences are much more pronounced than inhomophilic networks ( h > . ).Normalized Cumulative Group Relevance (nCGR). The top k ratio is a binary measure that does not take theimportance of individual nodes into account. That means,we cannot measure how much lower the ranking of a nodeis in the sample compared to its ranking in the completenetwork. To overcome this limitation, we first compute therelevance for each node i by ranking nodes based on theircentrality in the original network. The relevance of node i isdefined as the inverse rank that belongs to node i normalizedby the rank sum of all nodes ( N ) in the original network: rel i = inv rank i (cid:80) Nj =1 rank j (2)The relevance shrinks linearly with the position of nodesin the list, but different weighting is possible. We computefor each group g its cumulative group relevance ( CGR ) atrank k in the original ranked list and compare it with thecumulative relevance at rank k in the sample:
CGR topk = k (cid:88) j =1 rel j ∈ g (3) nCGR topk = CGR topk ( sample ) + (cid:15)CGR topk ( original ) + (cid:15) (4)The nCGR topk measures the extent to which the relevanceof a group in the sample is above or below what we wouldexpect from the original network with respect to the top k nodes. If e.g., this normalized cumulative group relevance forthe minority is 2, then that means that the minority is twiceas relevant in the sample than in the original network (forsome top k ). If it is 0.5 then the group is half as relevant in thesample than in the original network. If it is 1 then the grouphas equal relevance in the original network and the sample.We analyze the log of the normalized cumulative relevancesince otherwise the measure is bound by zero; thus, the idealnCGR is zero. To avoid division by zero and logarithm ofzero, we add a small (cid:15) = 0 .
4. SIMULATION EXPERIMENTS
We construct synthetic networks and explore the effectof homophily and group size on the accuracy of samples ina controlled environment. First, we describe the networkmodel which we use to create synthetic network data andsecond, we discuss the accuracy of centrality measures in samples drawn from these networks using different samplingmethods.
Preferential attachment (the tendency of nodes to connect topopular nodes) [2, 33] and homophily (the tendency of nodesto connect to similar nodes) [21, 27] have been extensivelyobserved in many real-world social networks [7, 9, 23, 31]and information networks [22, 24]. Homophily implies theexistence of at least one fixed or mutable attribute (e.g., gen-der, ethnicity, education status). Based on these attributessimilarities between nodes can be defined.We use an existing preferential attachment growth modelwith a homophily parameter that can be tuned and thusallows us to create networks with different levels of homophilyand heterophily (see [8, 14] for details). The homophilyparameter h ranges between 0 to 1, h ∈ [0 , . h ,because they share the same attribute value and thus have thesame distance to other groups with different attribute values.We generate all synthetic networks with 10 ,
000 nodes and afixed minority ratio of 20% (except when noted otherwise).An incoming node connects to 10 nodes based on a specifichomophily parameter and popularity (see [14]).Figure 2 shows the degree distribution of both groups ofnodes in networks that only vary in their degree of homophily.One can see that if we have two groups of unequal size and thenetwork is heterophilic ( h < . h = 0 . h = 1 . ✁✂✄✂ ☎(cid:0)✆✝✆✞✟✠(cid:0)✡☛✡☞✌ (cid:0)✁✂✄✍✎ (cid:0)✁✂✄✎ (cid:0)✁✂✄✏✎ (cid:0)✁✑✄✂ ☎(cid:0)✟✒✟✠(cid:0)✡☛✡☞✌ (a) Node sampling (b) Edge sampling (c) RW sampling (d) Snowball sampling Figure 3:
Accuracy of group visibility in top-100.
The y-axis visualizes the average percentage of minoritynodes that show up in the top 100 nodes ranked by degree centrality computed on samples of different size.The x-axis depicts the respective sampling size and the last point 1.0 refers to the original network (see Eq. 1).The lines refer to different homophily parameters that were used to generate the original network. Differentsubplots refer to different sampling techniques. Overall, each point depicts the mean of 100 simulation runsbased on 10 random network generation steps each having 10 sample steps; error bars mostly fall within themarkers. One can see that in samples drawn from extreme and moderate heterophilic networks ( . ≤ h ≤ . ),the visibility of the minority is underestimated in small samples compared to what one would expect fromthe original network where sample size = 1 . . If we compare the degree distribution of the two groups in amoderate heterophilic network with h = 0 .
25 and a moderatehomophilic network with h = 0 .
75, we see that the differencesbetween the degree distributions are more pronounced in theheterophilic case. This asymmetric effect can be explained bythe interplay between group size differences and homophily.The majority benefits from moderate homophily (e.g. h =0 .
75) more than from high homophily (e.g., h = 0 . h = 0 .
0) thanfrom moderate heterophily (e.g. h = 0 . for theminority to gain popularity, it is better if they do not have tocompete with the majority while the majority benefits froma competitive environment. In the next section, we willanalyze how these group-specific differences in the degreedistributions relate to sample biases.
To assess sample bias, we generate synthetic networks,draw samples of varying size from them using different sam-pling techniques and assess the average visibility and rele-vance of different groups in samples. We repeat the randomnetwork generation process 10 times and draw 10 samplesfrom each network; thus, in our evaluation, we report meanand standard error over 100 samples.Figure 3 shows the visibility of the minority group in thetop 100 nodes in samples of different size which have beencreated via different sampling methods. For example, inFigure 3 (a), the point for the green line at an x-value of0 .
10 indicates that the top 100 ranked nodes based on degreecentrality in a 10% sample from a moderate heterophilicnetwork with h = 0 .
25, contains on average around 40%minority nodes. We can compare this observed percentage with the expected percentage from the original network (100%sample). In this case, we would expect to see close to 80%of minority nodes in the top 100 nodes indicating that theminority is underrepresented in small samples drawn frommoderate heterophilic networks with unbalanced group sizesusing node sampling.Results show that especially node and snowball samplingreduce the visibility of minority groups in the top k list ifsamples are drawn from extreme and moderate heterophilicnetworks. For node sampling, this is not surprising sinceall nodes have equal probability to be picked and therefore,a node’s sampling probability is proportional to its groupsize. Snowball samples aggregate the 2-hop neighbourhoodof randomly selected seed nodes which likely are majoritynodes. Since most majority nodes are unpopular (skeweddegree distribution), the probability for picking a majoritynode that has only a few minority nodes as neighbours is high.Thus, we underestimate the visibility of the minority groupin the top k . Figure 4 shows that in the heterophilic network,the bias of node and snowball samples decreases linearly withdecreasing group size difference. Note that group sizes arebalanced if the minority ratio is 0 .
5. We further find that
RW samples are very robust against relative size differencesbetween groups in homophilic and heterophilic networks.
In Figure 5 we show to what extent the original relevanceof each group is preserved in the sample. We find that inmost cases the relevance of the minority is underestimated.
Only in moderate homophilic networks, minority is over-represented. However, one needs to note that the extentwith which the relevancy of the minority is overestimated inmoderate homophilic networks ( h = 0 .
75, 4th row) is lowerthan the extent with which the relevancy of the majority isoverestimated in moderate heterophilic networks ( h = 0 . ✁✂✄ ☎✆✝✞✟✠(cid:0)✡ ✄✂✡✄ ☎✆✝✞✟✠(cid:0)✡ ☛☞ ☎✆✝✞✟✠(cid:0)✡ ☎(cid:0)✁☞✌✆✟✟ ☎✆✝✞✟✠(cid:0)✡ (a) Heterophilic ( h = 0 .
25) (b) Neutral ( h = 0 .
5) (c) Homophilic ( h = 0 . Figure 4:
Relative group size differences.
The y-axis shows the relevance of the minority group in the top100 nodes of the sample network compared to the original network. The x-axis shows the relative size of theminority group. The sample size is 10% of the original network. One can see that in samples drawn fromheterophilic networks, the relevance of the minority is always underestimated; especially node and snowballsampling fail when group size differences are large in heterophilic networks. In homophilic networks therelevance of the minority is overestimated if the fraction of the minority group is very low. Node and edgesampling produce the most biased samples in this condition. Overall, we see that the more balanced thegroup sizes (0.5 means that 50% of the nodes belong to minority) are, the more accurate the sample andthe more similar the performance of different sampling techniques are. RW sampling performs best in allconditions and sampling errors are always higher in heterophilic networks than in homophilic ones. the sampling error is always higher in heterophilic networksthan in homophilic networks if the same sampling techniqueand group size differences are considered.
Regression analysis.
To compare the impact of differentfactors on the sampling bias, we fit eight simple linear regres-sion models, one model for each sampling technique and eacherror measure (top k minority bias bias topk and the absolutesum of the normalized cumulative group relevance nCGR topk of the minority and the majority group). Each model was fit-ted to 3,200 observations (samples drawn from syntheticallygenerated networks). Table 1 shows that across all samplingmethods—perhaps not surprisingly—smaller samples lead tohigher sampling errors and larger top k lists lead to highererrors because the size of the network is constant. Interest-ingly, we see that only for node and snowball samples, thesampling error increases, if group size differences and theinfluence of the attribute on the edge formation behavior(i.e., the homophily parameter is closer to 0 or 1) increase. Ifonly one of these factors changes, no significant effects on thesampling error can be observed, except for snowball samples.The bias of snowball samples also increases significantly ifonly homophily increases, because in extreme homophilicnetworks a snowball sample can only contain nodes of onegroup also if groups are of equal size. One can see that thesampling error of RW and edge samples cannot be explainedby group size differences and homophily, which confirms ourobservation that these methods are rather robust against thesefactors.
5. EMPIRICAL EXPERIMENTS
Next, we analyse two empirical networks and explore theaccuracy of samples drawn from these networks. We describethe statistical properties of these networks and contrast em-pirical findings with the findings obtained from simulation.
Dataset.
We study publicly available data obtained fromthe most popular Slovakian social network “Pokec” [29]. Weadded all friendship relations as undirected edges. The net-work contains 1 , ,
640 nodes (users) and 22 , ,
602 edges(friendship relations). The average degree of nodes is 27 . . , ,
314 nodes connectedby 14 , ,
771 edges. For coloring nodes as minority andmajority, we take the 80% percentile of the overall age dis-tribution, and color all nodes with an age higher than thispercentile as belonging to the minority (old users), and allbelow as belonging to the majority . This results in an agecut-off of 31 years, meaning that the minority—18 .
8% of allnodes—captures the oldest users in the network. Overall,around 92% of all edges in the network are between nodes ofthe same color—i.e., between two minority or two majoritynodes. This exceeds the expectation of around 81 .
3% if edgeswould form totally at random. From that we can assertthat the Pokec social network is moderately homophilic withrespect to the defined age groups. Figure 6 shows the degreedistribution of young and old users. One can see that themost popular users are part of the majority.
Results.
Figure 7 shows that the visibility of the minorityand the relevance of both groups is very well preserved inall samples. This is in line with what our model suggestsfor very homophilic networks (see Figure 7). Interestingly,random walk sampling produces the most accurate sample,which is also suggested by our model, especially for largerelative groups size differences (see Figure 4(c)).
Dataset.
We use a network of claimed sexual contacts be-tween Brazilian escorts (prostitutes) and sex buyers [25]. https://snap.stanford.edu/data/soc-pokec.html ✁✂ ✄☎✆✝✞✟ (cid:0)✁✂ ✄☎✆✠✡✟ (cid:0)✁✂ ☛☎✆✝✞✟ (cid:0)✁✂ ☛☎✆✠✡✟ (cid:0)✁✂ ✄☎☎✆✝✞✟ (cid:0)✁✂ ✄☎☎✆✠✡✟ (cid:0)✁✂ ☞☎☎✆✝✞✟ (cid:0)✁✂ ☞☎☎✆✠✡✟ Figure 5:
Normalized Cumulative Group Relevance.
Each column depicts a different sampling technique, whileeach row refers to a different world for which the homophily level of the original network varies. The axisare aligned within each row, but not within each column, since the extent of error varies depending on theworld. Again, each point refers to an average evaluation over total iterations. One can see that in extremeheterophilic networks (first row) the relevance of the majority is overestimated in small and also in largersized samples, while the relevance of the minority is slightly underestimated especially in small sized samples.In extreme homophilic networks (last row), it is the other way around, however the extent to which therelevance of the minority is overestimated is smaller than the extent to which the relevance of the majority isoverestimated in the extreme heterophilic case. Overall, random walk sampling produces the most accuratesamples, followed by edge sampling. In samples based on node and snowball sampling, the relevance of theminority is usually underestimated, except in moderate homophilic networks (4th row). ✁✁ ✂✄☎✆✝ ✞✟✂✄✠✟✡☛ ✂✄☎✆✝ ✞(cid:0)☞✄✠✟✡☛ ✂✄☎✆✝
Figure 6:
Degree distribution of empirical social net-works (Pokec and Sexworker).
In the homophilicPokec social network, nodes with the highest degreetend to belong to the majority (young users). Forthe heterophilic sexworker network, the most popu-lar nodes belong to the minority (women) since themajority (men) is attracted by the minority and theother way around.
The network consists of 16 ,
730 nodes (6,624 sex workers and10,106 sex buyers) and 50 ,
632 edges between them. Theminority of nodes with a share of around 40% are sex work-ers, while the majority are sex buyers. The network is fullybi-partite, meaning that sex workers only connect with sexbuyers to capture sexual contacts. Consequently, all edgeswithin the networks are between nodes of different color andthus, the network is 100% heterophilic. The degree distribu-tions of minorities and majorities show that minorities aremore popular than majorities (see Figure 6). This is notsurprising because the network is an example of an extremeheterophilic network since the majority nodes are attractedby the minority nodes and the other way around.
Results.
Figure 7 shows that the minority (escorts) are veryvisible in the top 100 nodes ranked by degree centrality alsoin samples of small size. Node-based samples are the mostinaccurate samples, since they underestimate the visibilityand relevance of the minority most. Edge-based samplescapture the visibility of the minority in the original networkbest if the original network is extremely heterophilic. Ourmodel suggests that no large differences in the performance ofdifferent sampling techniques (as suggested by Figure 4) willexist because group size differences are rather small (40:60); but, edge and RW sampling will produce more accuratesamples than node and snowball sampling. Further, we canexpect that all samples will underestimate the relevance ofthe minority. These expectations are confirmed empirically(cf. Figure 7, bottom row).
6. DISCUSSION
If homophily (or heterophily) is the driving force behindthe formation of edges in social networks with unbalancedattribute distributions, then the attribute and the degree ofnodes become statistical dependent, i.e., P ( attribute | degree ) (cid:54) = P ( attribute ) P ( degree ) and P ( degree | attribute ) (cid:54) = P ( degree ) P ( attribute ). Our workshows that if a statistical dependency between the networkstructure and the attribute of interest exists, all samplingmethods introduce bias w.r.t. capturing the importance ofnodes compared to when no relationship exists. However,not all sampling techniques are equally prone to group sizedifferences and attribute influence on edge formation behav-ior which lead to statistical dependency between the networkstructure and the attribute of interest. While sampling er-rors in node and snowball samples clearly increase if groupsize differences and attribute influence are increased, randomwalk and edge sampling are more robust against these factors.This can be explained by the fact that e.g., random walkand edge sampling favor high degree nodes and aim to pre-serve the degree distribution of nodes. Therefore, systematicdifferences in the degree of nodes in different groups can, tosome extent, be captured. The sampling error in snowballsamples also increases, if only the influence of attributes onthe edge selection behavior increases (see Table 1). Thisindicates, that even if group sizes are balanced, homophilyor heterophily may cause problems in snowball samples.Interestingly, the overestimation of the importance of a ma-jority in heterophilic networks is more pronounced than theoverestimation of the importance of minorities in homophilicnetworks. This can be explained by an asymmetry in thedifferences in degree distributions. In heterophilic networks,the difference between minority and majority degree distri-butions is larger than in a comparable homophilic network(same group sizes and similar impact of group membershipon formation of edges). Our observations from two real-worldsocial networks confirm our simulation results and show thatin heterophilic networks, the relevance of majority nodes is Table 1: Coefficients of eight linear regression models, one for each sampling technique and sampling errormeasure. Each model was fitted to 3,200 observations (samples drawn from synthetically generated networks).The interaction term between group size difference and attribute influence is significant in node and snowballsamples, but not in RW and edge samples. This indicates, the sampling error increases in node and snowballsamples if the group size difference and the influence of attributes on the edge formation behavior are bothincreased. Edge and RW samples are rather robust against these factors. We compute the sampling errorfor lists of different length k and control for the effect of k in the model. The larger k, the higher the error.We also observe on average larger sampling errors on smaller sample sizes. Note: ∗∗ p < . ; ∗∗∗ p < . . node sampling snowball sampling RW sampling edge sampling nCGR topk bias topk nCGR topk bias topk nCGR topk bias topk nCGR topk bias topk Intercept 0.4064 ∗∗∗ ∗∗ ∗∗∗ ∗∗ -0.0905 0.3491 0.0365 0.1153 -0.0506grp. size diff 0.0245 -0.0602 -0.4272 -0.1106 0.0873 0.0277 0.0395 -0.0403attr. infl. : grp. size diff 1.6851 0.5817 ∗∗ ∗∗∗ ∗∗∗ ∗∗∗ -0.1233 ∗∗∗ -1.1096 ∗∗∗ -0.1278 ∗∗∗ -0.5488 ∗∗∗ -0.0796 ∗∗∗ -0.5131 ∗∗∗ -0.0659 ∗∗∗ top k ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ R2 0.281 0.135 0.418 0.428 0.225 0.152 0.275 0.135 ✁✂ ✄☎✆✝✞✟ (cid:0)✁✂ ✄☎✆✠✡✟ (cid:0)✁✂ ☛☎✆✝✞✟ (cid:0)✁✂ ☛☎✆✠✡✟ (cid:0)✁✂ ✄☎☎✆✝✞✟ (cid:0)✁✂ ✄☎☎✆✠✡✟ (cid:0)✁✂ ☞☎☎✆✝✞✟ (cid:0)✁✂ ☞☎☎✆✠✡✟ (a) Pokec Node sample (b) Pokec Edge sample (c) Pokec RW sample (d) Pokec Snowball sample(d) Sexworker Node sample (e) Sexworker Edge sample (f) Sexworker RW sample (g) Sexworker Snowball sample
Figure 7:
Normalized Cumulative Group Relevance in empirical networks.
The first row refers to the Pokecnetwork, the second to the Sexworker network. In samples drawn from the Pokec network, visibility andrelevance of the minority correspond to what one would expect from the original network. Only for smalltop k , the minority is slightly more visible than expected if samples are generated via node, snowball or edgesampling. In samples drawn from the Sexworker network, we see that the minority (escorts) is very visible inthe top 100 nodes ranked by degree centrality. Edge-based samples capture visibility of the minority in theoriginal network best. The relevance of the majority is overestimated as also suggested by our model. Edgesampling produces the most accurate samples. overestimated while in homophilic networks, it is slightlyunderestimated.One limitation of our network generation model is that welimit it to two groups and that it assumes that all nodes ina group are equally active and behave equally homophilic orheterophilic. In real world social networks, more groups andgroup-specific and individual behavioral differences can bepresent. Future research is necessary to study the effect ofgroup-specific activity difference and asymmetric homophilicbehavior and needs to explore the presence of multiple groups.Furthermore, we focus on one specific network measure andundirected networks warranting further explorations aboutthe accuracy of various network measures in samples drawnfrom directed networks. Our work can be extended to morethan one binary attribute by simply defining a similarityfunction that takes several attributes into account.
7. CONCLUSIONS
In summary, our work shows that the combination oftwo factors leads to sampling error in social networks withattributes: (i) group size differences and (ii) homophily.If unequal sized groups are present, random walk samplingalways leads to the most accurate samples—independentof the level of homophily. The sampling error is alwayslarger if samples are drawn from heterophilic networks withunequally sized groups compared to homophilic ones. Inheterophilic networks with unbalanced groups, random walk and edge sampling perform similar well, while in homophilicnetworks edge sampling produces more biased samples than random walk sampling . This can be explained by the fact that in homophilic networks edge sampling overestimatesthe importance of minority nodes, since minority nodes withhigh degree are more likely to be selected. Edge samplesonly include sampled edges, but not all other edges betweenselected node. Therefore, the difference in degree betweenminority and majority nodes can be skewed. Most samplingtechniques produce accurate samples if the groups are ofequal size. Only snowball samples can also be biased ifhomophily is a driving force behind the edge formation ofnodes that belong to two equally sized groups.Since researchers often do not have information aboutgroup size differences and homophily in the original net-work, random walk sampling is a robust choice. However,researchers cannot always choose their sampling methodfreely. Therefore, our results provide important guidance inestimating which groups will be over- or underestimated insamples drawn from social networks with unequally sizedgroups and various level of homophily. It is our hope thatthe research presented in this paper motivates more researchinto sampling from social networks with attributes.
Acknowledgements.
We want to thank Robert West forvaluable discussions and input to this work. eferences [1] R. Atkinson and J. Flint. Accessing hidden andhard-to-reach populations: Snowball research strategies.
Social research update , 33(1):1–4, 2001.[2] A.-L. Barab´asi and R. Albert. Emergence of scaling inrandom networks.
Science , 286(5439):509–512, 1999.[3] P. S. Bearman, J. Moody, and K. Stovel. Chains ofaffection: The structure of adolescent romantic andsexual networks1.
American journal of sociology ,110(1):44–91, 2004.[4] C. K. Borgatti, S.P. and D. Krackhardt. Robustness ofcentrality measures under conditions of imperfect data.
Social Networks , 28(1):124–136, 2006.[5] M. B. Brewer. In-group bias in the minimal intergroupsituation: A cognitive-motivational analysis.
Psychological Bulletin , 86(2):307–324, 1979.[6] E. Costenbader and T. W. Valente. The stability ofcentrality measures when networks are sampled.
SocialNetworks , 25(4):283–307, Oct. 2003.[7] D. Crandall, D. Cosley, D. Huttenlocher, J. Kleinberg,and S. Suri. Feedback effects between similarity andsocial influence in online communities. In
Proceedingsof the 14th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , KDD ’08,pages 160–168, New York, NY, USA, 2008. ACM.[8] M. L. de Almeida, G. A. Mendes, G. M. Viswanathan,and L. R. da Silva. Scale-free homophilic network.
TheEuropean Physical Journal B , 86(2):1–6, 2013.[9] A. T. Fiore and J. S. Donath. Homophily in onlinedating: when do you like someone like yourself? In
CHI’05 Extended Abstracts on Human Factors inComputing Systems , pages 1371–1374. ACM, 2005.[10] L. C. Freeman. Centrality in social networks:Conceptual clarification.
Social Networks , 1(3):215–239,1979.[11] J. Galaskiewicz. Estimating point centrality usingdifferent network sampling techniques.
Social Networks ,13(4):347–386, Dec. 1991.[12] C. A. Hidalgo and C. Rodriguez-Sickert. The dynamicsof a mobile phone network.
Physica A: StatisticalMechanics and its Applications , 387(12):3017–3024,2008.[13] M. Huisman. Imputation of missing network data:some simple procedures.
Social Structure , 10(1):1–29,2009.[14] F. Karimi, M. G´enois, C. Wagner, P. Singer, andM. Strohmaier. Visibility of minorities in socialnetworks. arXiv:1702.00150 , 2017.[15] G. Kossinets. Effects of missing data in social networks.
Social Networks , 28:247–268, 2006.[16] J. Lee and J. Pfeffer. Estimating centrality statistics forcomplete and sampled networks: Some approaches andcomplications. In , pages 1686–1695, 2015.[17] S. H. Lee, P.-J. Kim, and H. Jeong. Statisticalproperties of sampled networks.
Physical Review E ,73(1):016102, 2006. [18] J. Leskovec and C. Faloutsos. Sampling from largegraphs. In
Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discovery anddata mining , pages 631–636. ACM, 2006.[19] J.-Y. Li and M.-Y. Yeh. On sampling type distributionfrom heterogeneous social networks. In
Proceedings ofthe 15th Pacific-Asia Conference on Advances inKnowledge Discovery and Data Mining - Volume PartII , PAKDD’11, pages 111–122, Berlin, Heidelberg, 2011.Springer-Verlag.[20] R. Mastrandrea, J. Fournet, and A. Barrat. Contactpatterns in a high school: A comparison between datacollected using wearable sensors, contact diaries andfriendship surveys.
PLoS ONE , 10(9):e0136497, 092015.[21] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birdsof a feather: Homophily in social networks.
AnnualReview of Sociology , 27(1):415–444, 2001.[22] F. Menczer. Growing and navigating the small worldweb by local content.
Proceedings of the NationalAcademy of Sciences , 99(22):14014–14019, 2002.[23] A. Mislove, B. Viswanath, K. P. Gummadi, andP. Druschel. You are who you know: inferring userprofiles in online social networks. In
Proceedings of thethird ACM international conference on Web search anddata mining , pages 251–260. ACM, 2010.[24] S. Redner. How popular is your paper? an empiricalstudy of the citation distribution.
European PhysicalJournal B , 4(2):131–134, 1998.[25] L. E. C. Rocha, F. Liljeros, and P. Holme. SimulatedEpidemics in an Empirical Spatiotemporal Network of50,185 Sexual Contacts.
PLoS Computational Biology ,7(3), Mar. 2011.[26] W. Shrum, N. H. Cheek Jr, and S. MacD. Friendshipin school: Gender and racial homophily.
Sociology ofEducation , pages 227–239, 1988.[27] ¨O. ¸Sim¸sek and D. Jensen. Navigating networks by usinghomophily and degree.
Proceedings of the NationalAcademy of Sciences , 105(35):12758–12762, 2008.[28] J. A. Smith and J. Moody. Structural effects ofnetwork sampling coverage i: Nodes missing at random.
Social Networks , 35(4):652–668, 2013.[29] L. Takac and M. Zabovsky. Data analysis in publicsocial networks. In
International Scientific Conferenceand International Workshop Present Day Trends ofInnovations , pages 1–6, 2012.[30] D. J. Wang, X. Shi, D. A. McFarland, and J. Leskovec.Measurement error in network data: A re-classification.
Social Networks , 34(4):396–409, 2012.[31] D. J. Watts, P. S. Dodds, and M. E. J. Newman.Identity and search in social networks.
Science ,296:1302–1305, 2002.[32] W. Webber, A. Moffat, and J. Zobel. A similaritymeasure for indefinite rankings.
ACM Transactions onInformation Systems , 28(4):1–38, Nov. 2010.[33] G. U. Yule. A mathematical theory of evolution, basedon the conclusions of dr. j. c. willis, f.r.s.