Estimation for network snowball sampling: Preventing pandemics
EEstimation for network snowball sampling:Preventing pandemics
Steve ThompsonDepartment of Statistics and Actuarial ScienceSimon Fraser University, Burnaby, BC, [email protected] 11, 2020
Abstract
Snowball designs are the most natural of the network samplingdesigns. They have many desirable properties for sampling hidden andhard-to reach populations. They have been under-used in recent yearsbecause simple design-based estimators and confidence intervals have notbeen available for them. The needed estimation methods are supplied inthis paper. Snowball sampling methods and accurate estimators withthem are needed for sampling of the people exposed to the animals fromwhich new coronavirus outbreaks originate, and to sample the animalpopulations to which they are exposed. Accurate estimates are neededto evaluate the effectiveness of interventions to reduce the risk to thepeople exposed to the animals. In this way the frequencies of majoroutbreaks and pandemics can be reduced. Snowball designs are neededin studies of sexual and opioid networks through which HIV can spreadexplosively, so that prevention intervention methods can be developed,accurately assessed, and effectively distributed.
Keywords:
Snowball sampling, Network sampling, Adaptive sampling,Coronavirus pandemics, animal surveys, HIV, Design-based inference
In network sampling, social links are followed from one person to another inselecting the sample of people from a subpopulation that might otherwise behard to reach. The most natural of the network designs is snowball sampling,in which any number of the links out from each person may be followed. So ifa person in the sample has two partners or contacts, either zero, one, or two ofthose contacts are added to the sample, depending the selection probabilitiesof the links. For another person in the sample with ten links to contacts, anynumber from zero to ten links may be followed. In turn, links out from the1 a r X i v : . [ s t a t . M E ] S e p nowball network tree sample Wave
Figure 1: A sample from a regular snowball design. Repeat selections of thesame person are not allowed. This results in a network sample with tree struc-ture. 2 nowball network non−tree sample
Wave
Figure 2: A sample from a snowball design allowing repeat selections of thesame person, but only by a different recruiter. This results in a sample withnetwork structure more general than a tree.3ew people in the sample are followed at the next wave. In this way thesample may grow fast or slowly, depending on the link-tracing probabilitiesand also depending on the numbers and configuration of the links in the socialnetwork of the target population.In some snowball surveys, a respondent is not allowed to recruit a personwho has already been recruited into the sample. This results in a networksample with tree structure (Figure 1). In other cases it may be natural toallow re-recruitments. Usually the survey protocol will not allow re-cruitmentof a person by the same person that recruited them earlier, but will allowre-recruitment of a sample member by a different recruiter. This results insamples with network structure more general than a treee (Figure 2). Notethat most people in the sample have roles as both recruitees and recruiters.The nature of snowball sampling brings different people into the samplewith different probabilities. To estimate the characteristics of the targetpopulation, such as mean number of partners or proportion of people with acharacteristic such as a risk-related behavior, the different inclusionprobabilities for different sample members should be taken into account. Asample mean or sample proportion gives each person equal weight and soresults in a estimate that is biased in comparison to the actual populationmean or proportion. Unbiased estimators in this design-based sense have beenavailable for only simple forms of snowball sampling, such as a completeone-wave snowball sample where the inclusion probabilities for the initialsample are known. The problem with more general snowball samples, withmany waves and link-tracing probabilities less than one, is that the inclusionprobabilities are not known. In fact they can not be exactly calculated fromthe sample data because they depend on the population network outside thesample as well as within it, as well as depending on the sampling design.This inference problem has kept snowball sampling designs from beingmore widely used than they should be. In this paper we present newestimators for key population characteristics from snowball samples. Themethod is based on estimating the unit inclusion probabilities from the sampledata in interaction with the network sampling design. The method isdesign-based and data-based. It does not assume a statistical form of modelfor the population network.Snowball designs are currently being used to collect data on people withexposure to animals, in order to understand and prevent the emergence ofnovel species of the coronavirus family that jump from animals to humans andlead to emergence of epidemics or pandemics every few years. Snowballdesigns are used in some studies of key populations at high risk for HIV, suchas sex workers, clients of sex workers, and networks of people who mis-useopioids and other drugs that are in some cases injected. Contact-tracingdesigns of people exposed to sexually transmitted diseases or exposed tosomeone with the new disease covid-19, where the purpose is primarily to findcases and make interventions, rather than inference, also are of snowball form.Because the of the importance of the snowball studies of in relation toemergence of outbreaks of new coronavirus species in humans, and the fact4hat those studies are recent and not so well known, that example will bedescribed at some length next. Subsequent sections will describe the methodproposed here and evaluate it in relation to alternatives.Snowball sampling designs have many important uses in the real world. Inthis paper we will focus on three. One concerns the prevention of futurepandemics caused by emergence of new species in the coronavirus family. Thesecond, related to the first, involves spatial adaptive designs for animal specieswith uneven distributions. The third involves studies of sexual and opioidnetworks with high risk of HIV, the human immunodeficiency virus.
Virus species of the coronavirus family are adept at transfering from onemammal host species to another, adapting to the new host environmentthrough genetic mutations and recombinations ([Latinne et al., 2020]). Thesecan include transfers from bats, civets, raccoon dogs, pangolins and otherspecies to humans ([Graham and Baric, 2010], [Tang et al., 2020],[Andersen et al., 2020], [Montgomery and Macdonald, 2020]). Bat species areconsidered likely to be the main reservoir for viruses in the the coronavirusfamily. These transfers to humans are moderately common. Most of these dieout as the person infected recovered or dies, or they infect just a few peopleclose to that person before dying out ([Zhou et al., 2020]). But a smallproportion of these transfer viruses catch on and become major outbreaks.Since the start of the 21st Century there have been three major outbreaksof coronavirus species in humans. These are SARS ( the species SARS-CoV-1)in 2002, the Middle East Respiratory Syndrome (MERS), in 2012, andcovid-19 (caused by the coronavirus species SARS-CoV-2), in 2019, and at thetime of this writing a worldwide pandemic. More such outbreaks andpandemics are widely expected by virologists and epidemiologists unless someinterventions can make conditions less favorable to them.[Li et al., 2019] used snowball sampling to obtain a sample across threeprovinces in the south of China of 1596 people at-risk through their exposureto animals. Eight study regions were selected in areas known to have diversecoronaviruses in bat populations roosting close to human dwellings. The studytargeted “human populations that are highly exposed to bats and otherwildlife, including people who visit or work around bat caves, work in local liveanimal markets, raise animals, or are involved in wildlife trade (e.g., wildanimal harvest, trade, transportation, and preparation), as identified byprevious exploratory ethnographic interviews.”Sample members were given a questionnaire about risk-related andprotective behaviors and their experiences with flu-like and SARS-likesymptoms, and were given antibody tests for evidence of past infection withbat coronavirus. The blood tests identified 9 individuals (0.6%) who werepositive for bat coronaviruses. 73 (5%) reported symptoms fever with coughand shortness of breath or difficulty breathing within the last 12 months, and227 (14%) reported fever with muscle aches, cough, or sore throat symptoms5ithin the last 12 months.When asked about protective measures, among the 502 participants whopurchased animals from wet markets in the past 12 months 194, (39%) hadtaken measures such as washing hands, purchasing live animals less often , orpurchasing meat at supermarkets instead of live animal markets. Only 7 (1%)had considered wearing a mask while visiting the markets, with the samenumber having considered wearing gloves.Each reported percentage is the unweighted sample mean (divided by 100).Much more accurate estimates for each of those quantities can be calculatedusing the method proposed in this paper.The need for accurate estimates becomes acute when we considerevaluating the effectiveness of an intervention. Suppose an interventionprogram is introduced to a community to increase the wearing of gloves whenworking in the markets. We need to compare wearing gloves in the communitybefore and after the intervention, or with the rate in a community without theintervention program. If the estimates are highly biased or erratic in theirstatistical properties, then the comparisons may be in error and lead tochoosing the less effective intervention for roll-out on a larger scale.A number of studies have sampled bats from their natural habitaats, testedthem for coronavirus species, and through phylogenetic analyses establishedtheir relationship to species that have infected or can infect humans([Ge et al., 2013], [Li et al., 2005], [Wang et al., 2006],[Wang and Anderson, 2019], [Olival et al., 2017b], [Olival et al., 2017a],[Hu et al., 2017], [Corman et al., 2015]).[Huong et al., 2020] sampled the animals, rather than the humans, along asupply chain in the wildlife trade in Vietnam. The animals on their way tomarket and restaurants were tested for infection with bat coronavirus species.They found that the infection rate increased along the supply chain. Forexample, field rats for food, in the possession of the trappers who catch themwild in the field, had a 21% infection rate. For rats in the markets, theinfection rate was 32%. At the restaurants that served wild animal dishes, theinfection rate in the rats was 56%. The presumed reason for the amplificationof infection rate along the supply chain was overcrowding in the cages theanimals were transported and held in, so that infection rapidly spread betweenthe animals.If the infection rate of the animals is higher at a point in the supply chain,then the risk to the people who work with them will be higher there.The way to prevent future outbreaks and pandemics of emerging species ofcoronavirus in humans is to decrease the frequency of occurrences of transfersof coronavirises from animals to people. Most of those transfer strains die outbefore producing any significant outbreak, but some small proportion of themwill catch on to produce serious outbreaks as well as new pandemics. If thereare less frequent transfers, there should be correspondingly less frequentoutbreaks and pandemics.So far in the 21st Century the serious outbreaks have come 7 and 10 yearsapart, so the next one could be anticipated to arrive not far in the future. If6hrough interventions the frequency of transfers could be reduced to one-thirdits current value, then the next outbreak instead of 7 or 10 years, might be 21or 30 years in the future. The anticipated outbreaks before then would havebeen prevented.Potential interventions suggested by the currently available studies wouldinclude programs to increase the use of protective measures like washing handsand use of gloves and masks in working with animals. To reduce theamplification of infection rates along the supply chain, safer and more humanetransfer cages would need to be designed and required.Sampling both of the people exposed to animals and of the animals theyare potentially exposed to will be required to accurately assess theeffectiveness of different intervention strategies and select thereby the bestinterventions to make. The key missing tool up to this point has been anaccurate, robust estimation method for snowball sampling designs. Thepurpose of this paper is to present such a method.
In addition to snowball sampling of the people exposed to animals, we need tosample the animals they are exposed to, in order to estimate the prevalence ofcoronavirus infection in those animals, and to obtain the genetic sequences ofthose virus. In some cases the sampling of the animals will be straightforward,using conventional sampling design features such as random sampling withoutreplacement and stratification. This may be the the case in markets andwildlife farms. If we need to sample wild animals living nearby the people,such as bats with natural roosting sites in trees, adaptive spatial designs couldbe useful.The adaptive designs work by starting with an initial conventional sample,such as a random sample of spatial units. Whenever a significant number ofanimals is found in a sample unit, its neighboring units are added to thesample. If any of those has significance abundance, its neighbors in turn areadded, and so on ([Thompson, 1990]). In ([Thompson, 2006], which was firstabout network designs for estimating prevalence of risk behaviors for HIV, it isshown that any of the spatially adaptive design can be re-framed as a networksampling problem (see also [Thompson, 2011]. In ([Thompson, 2006] this wasdone for a spatially uneven population of wintering waterfowl, and proved veryeffective there. In the same way the network sampling estimators proposed inthe present paper can be used for adaptive spatial sampling designs and havesome advantages in terms over previous estimators for adaptive samplingdesigns in terms of simplicity and flexibility.The re-framing of an adaptive spatial design as network sampling works asfollows. In the spatial setting supposse for simplicity the sampling units aresquare plots covering the study region. Convert each square plot to a circlerepresenting a node in a network. If plot i has animals in it, then if we observe i it will lead us to any of its neighbors. So for a neighbor j draw an arrow from i to j . If unit j has animals in it, then if it was observed first it would lead us7o i , so draw an arrow from j to i . With arrows in both directions, the linkbetween i and j is symmetric, so it can also be drawn as a simple line. Nowsuppose unit j has no animal in it. Then it will not lead us to its neighbor i .So there is an arrow from i to j but not from j to i . So the link is directionalhere, and we get a network that is partially directional. The estimationmethod work fine with the directional links in the network. This re-framing isillustrated in the figures of [Thompson, 2006] and [Thompson, 2011]. With the HIV epidemic the key populations at risk include people who injectdrugs non-medically, people who sell or buy sex, or trade it for drugs ortangible goods, and men who have sex with men. To study these people andtry to alleviate the epidemic, ethnographers and health outreach workers usedlink-tracing methods. In the 1980s and 1990s these were usually in the form ofsnowball designs.Snowball sampling methods were used in [Potterat et al., 1999] to obtaindata on the entire network and people of a population at risk for heterosexualand drug-related spread of HIV. In the study, investigators endeavoured tofind the individuals at both ends of every relevant relationship. Because of therelative geographic isolation of the high-risk population, they were able toobtain essentially the entire population and its network structure. Thispopulation data set has been invaluable to studies of network sampling andestimation methodologies. It is used in this paper for the simulations toevaluate the effectiveness of the proposed estimators for snowball networkdesigns in comparison with other estimation methods.Snowball sampling designs, the most freely-branching of the networkdesigns, are useful for studying the sexual and injection-related networks inwhich HIV spreads. [Peters et al., 2016] and [Campbell et al., 2017] report ona study in which contact tracing was used after an HIV outbreak associatedwith opioid misuse in Indiana had spread rapidly. The traced networktogether with phylogenetic data from sequencing of virus strains was used todetermine where the outbreak started and how it spread.Snowball designs are generally the preferred network sampling method forbringing interventions to benefit a population. Contact tracing for sexuallytransmitted diseases has long used snowball sampling. An individual who testspositive for the disease is asked to identify all their recent sexual partners.Investigators attempt to find each of those partners, inform them of theirpotential exposure, test then for the infection and, if the test is positive, totreat them to cure the infection and then in turn try to trace all their partnersand do the same. In the current coronavirus pandemic, contact tracing ofindividuals who test positive involves finding all their recent contacts and atminimum to advise those individuals to self-isolate for a period. And in themore thorough versions, the contacts when found are also tested and, if a testis positive, that person’s contacts are traced as well.For HIV there is currently no cure or vaccine available. There are effective8nterventions such as increasing use of condoms and targeting condom use tothe early period in any new relationship. In [Thompson, 2017] it is shown thatdistributing such an intervention program to some proportion of thepopulation using a snowball network design is highly effective in comparison tothe same intervention distributed to the same proportion of the population byother methods.
Early uses of network sampling for hard-to-reach populations typically usedsnowball sampling methods. Ethnographers studying drug users and otherhidden populations wanted to know as many of the relationships in thecommunity so tried to follow every referral to social partners. The data wassummarized by unweighted sample means and proportions [[Spreen, 1992],[Heckathorn, 1997], and [Thompson and Collins, 2002]]. Design-basedestimates of population values from relatively simple network sampling designswere obtained by [Birnbaum and Sirken, 1965] and [Frank, 1977],[Frank and Snijders, 1994], and [Birnbaum and Sirken, 1965].[Snijders, 1992] reviewed the literature on snowball sampling for inferenceabout population values. The estimates at that time were limited to the moresimple snowball designs, such as one-wave designs with symmetric. Heconcluded that snowball sampling is more suited for inference about the linksthan for inference about the nodes. [Heckathorn, 2011][Handcock and Gile, 2011] look at different uses of the term “snowballsampling”.Model-based methods can work well for snowball designs but, in additionto requiring a stochastic network model for the population, may be limited toospecific forms of snowball designs, such as ones in which all links are tracedout to a certain number of waves ([Chow and Thompson, 2003a],[Chow and Thompson, 2003b], [Thompson and Frank, 2000]).[Pattison et al., 2013] assume an exponential random graph model anddescribe a conditional estimation strategy for estimating parameters of themodel, with attention to computational efficiency so that large samples can behandled. The focus is on identifying structural characteristics of thepopulation network, such as stars and triangles of various types.. Values ofnodes, such as whether an animal is infected or the person handling it wearsgloves, are not usually included in this type of model. The method appears towork well for its intended purposes. [Handcock and Gile, 2007] provide aninformative discussion of design characteristics in relation to model-basedinferences for network samples. [Handcock and Gile, 2010] describe a Bayesianmodel based approach for inference from snowball samples. The method workswell but is limited to specific types of snowball designs, such as those whereevery like is followed out to some wave. The main limitation of themodel-based estimation methods for snowball sampling may be to scale-up,because the Markov Chain Monte Carlo method required in the computationsget slow as sample size increases. [Atkinson and Flint, 2001] review the9ethodology for snowball sampling and conclude that snowball sampling ishighly effective for finding members of a hidden population but that statisticalinference from snowball samples was at the time of their writing problematical.[Thompson, 2006] gives a design-based strategy for snowball networkdesigns with estimation based on one or another simple, if inefficient, initialestimator. That estimate is improved using the Rap-Blackwell method, whichfinds the conditional expectation of the initial estimator given the minimalsufficient statistics. In the network designs considered the minimal sufficientstatistics is the set of distinct units in the sample, not counting repeatselections and not distinguishing order of selection. The estimates obtainedare exactly unbiased and are very efficient, having low variance because of theRao-Blackwell improvement. However, the method requires control over theinitial design that may not be achieved in practice with hard-to-reachpopulations, and because of the Markov Chain Monte Carlo needed to do theRao-Blackwell computations, may be limited in terms of scale up to largesample sizes.In contrast to snowball sampling, the methodology ofRespondent-Driven-Sampling ([Heckathorn, 1997]), limited the number ofcontacts a respondent could refer to a small number, typically 3. Estimatorsfor these designs based on random walk theory and assumptions of Markovtransitions in the sampling between values of attribute variables ofrespondents were given in [Salganik and Heckathorn, 2004],[Heckathorn, 2007], and [Volz and Heckathorn, 2008]. In a random walk designonly one link is traced from the currently selected person, and the sampling isdone with replacement. If a random walk is run in a network consisting of asingle connected component, the long term frequency of inclusion of node i isproportional to d i . Variations on these early approaches, still relying on theassumption of a random walk design or Markov transitions include[Gile, 2011], [Baraff et al., 2016], and [Rohe et al., 2019].Among network sampling designs, a snowball design is at the opposite endof the spectrum from a random walk design. A freely-branching snowballdesign allows unlimited branching, up to however many links are available,and is usually carried out without-replacement. For these reasons theestimators based on the random walk assumption have never beenrecommended for snowball designs, nor should they be.The idea of the new estimation method is very simple. We run a samplingprocess similar to the actual survey design on the sample network data anduse the inclusion frequencies in the sampling process to estimate the inclusionprobability of each sample unit in the real-world sampling. With the estimatedinclusion probabilities, well-established sampling inference methods can beused for population estimates and confidence intervals. Two approaches to there-sampling are repeated re-samples and a Markov chain resampling process.For the simulations we use the resampling process because it iscomputationally much faster.The estimation method and why it works are described more exactly in theMethods section. 10 Methods
In traditional survey sampling with unequal probabilities of inclusion fordifferent people, typical estimators divide an observed value y i for the i thperson by the inclusion probability π i that person. A variable of interest y i can be binary, for example 1 if the person tests positive for a virus and 0otherwise, or can be more generally quantitative, such as viral load. Theinverse-weighting gives an unbiased or low-bias estimate of the populationproportion or mean of that variable. In network surveys the inclusionprobabilities are unknown so they need to be estimated.The estimators described in this report first estimate the inclusionprobability of each person in the sample by selecting many resamples from thenetwork sample data using a design that adheres in key features to the actualsurvey design used to find the sample. In particular, the resampling design is alink-tracing design done without-replacement and with branching, as was theoriginal design. The frequency f i with which an individual is included in theresamples is used as an estimate of that person’s inclusion probability π i .What we do is select T resamples S , S , ..., S T from the sample networkdata. There are two approaches to selecting the sequence of resamples. In therepeated-samples approach each resample is selected independently from seedsand progresses step-by-step to target resample size independently of everyother resample, so we get a collection of independent resamples. in thesampling-process approach each resample S t is selected from the resample S t − just before it by randomly tracing a few links out and randomlyremoving a few nodes from the previous resample and using a small rate ofre-seeding so we do not get locked out of any component by chance. It is thisMarkov resampling process approach that we use for the simulations in thispaper because it is so computationally efficient.For an individual i in the original sample, there is a sequence of zeros andones Z i , Z i , ..., Z iT , where Z t is 1 if that person is included in resample S t and is 0 if the person is not included in that resample. The inclusion frequencyfor person i is f i = 1 T ( Z i + Z i + ... + Z iT ) (1)Individuals centrally located in sample components tend to have highvalues of f i . That is because there are more paths, and paths of higherprobability, leading the sample to those individuals. Also, individuals in largercomponents tend to have larger f i than individuals in smaller components, sothat the method is estimating inclusion probability of an individual relative toall other sample units, not just those in the same component or local area orthe sample This is because of the self-allocation of the branching design, evenin the absence of re-seeding, to areas of the social network having more linksand connected paths. 11he estimator of the mean of a characteristic y in the hidden population isthen ˆ µ f = (cid:80) ( y i /f i ) (cid:80) (1 /f i ) (2)where each sum is over all the people in the sample. If the actual inclusionprobabilities π i were known and replaced the f i in Equation 2 we would havethe generalized unequal probability estimator ˆ µ π of [Brewer, 1963]. A simplevariance estimator to go with ˆ µ f is (cid:100) v ar (ˆ µ ) = 1 n ( n − (cid:88) s (cid:18) ny i /f i (cid:80) s (1 /f i ) − ˆ µ f (cid:19) (3)An approximate 1 − α confidence interval is then calculated as ˆ µ ± z (cid:112)(cid:100) v ar (ˆ µ ),with z the 1 − α/ The new estimators are evaluated and compared with the current estimatesusing the network data on the hidden population at risk for HIV enumeratedin the Colorado Springs study on the heterosexual transmission of HIV[[Potterat et al., 1999]], also known as the Project 90 study. The study was sothorough in finding every linked person that it provides the most relevantnetwork data set that can be considered as an entire at-risk population for thepurpose of comparing sampling designs and estimators. In the simulations, thepopulation size is 5492, the number of people in the data set. From thispopulation a sample of 1200 people is selected, using a snowball design. Forresampling the target resample size is 400. For each of 1000 samples of size1200 each, 10,000 resamples were selected, each resample of size about 400.The population and the simulation methods are described in more detail inthe Methods section.The most commonly used estimator with network surveys in currentpractice is the VH estimator. The other variations in use such as SH arerelated to it and are based on the same assumptions plus the additionalassumption of a first-order Markov process in transitions between nodeattribute values during the sampling. The SH estimator is used mainly forbinary attribute variables. For categorical variables it has the property thatthe proportion estimates do not add to one without extra adjusting of onekind or another, and it is not well suited to continuous variables.
Among the most important quantities to estimate in relation to spread of HIVare the means and proportions of link-related variables. Two of widespreadinterest are mean degree and concurrency. Mean degree is the average numberof partners per person in the population. The most common definition ofconcurrency is the proportion of people in the population who have two or12
B SB Re−recruit M ean s qua r ed e rr o r Mean number of partners
Y−BARVHNew
Figure 3: Mean squared error for three estimators of mean number of partners(mean degree), with a regular snowball design (left) and with a snowball designallowing re-recruitment of a person by different recruiters (right). The threeestimators are the sample mean (Y-BAR, white), weighting by reciprocal ofself-reported degree (VH, yellow), and the new estimator (NEW, black). SmallMSE is good, so the new estimator is performing much the best.13
B SB Re−recruit M ean s qua r ed e rr o r . . . . . Concurrency
Y−BARVHNew
Figure 4: Mean squared error for three estimators of concurrency, with a regularsnowball design (left) and with a snowball design allowing re-recruitment of aperson by different recruiters (right). The three estimators are the sample mean(Y-BAR, white), weighting by reciprocal of self-reported degree (VH, yellow),and the new estimator (NEW, black). 14ore partners. This and related definitions of concurrency and their role inspreading HIV are discussed in [Kretzschmar and Morris, 1996],[Morris and Kretzschmar, 1997], and [Admiraal et al., 2016]. A high numberfor either of these is an indication that an epidemic could spread rapidly in thepopulation once it starts there. With the reference data set we are using forthe simulations, a link indicates either a sexual relationship, a drugrelationship, or a friendship relationship. Because friendship is included, theproportion of people with at least one relationship is very close to 1.00.Therefore we use a definition of concurrency here, which might be called“ k -concurrency with k=10,” where a person is concurrent if they have morethan 10 relationships. The true proportion of this is .82. The purpose here isto compare estimation methods for the mean of a characteristic of a node thatis highly related to links. Because inclusion probabilities in network samplingdepend on the pattern of links in the population in interaction with thedesign, these variables are the most sensitive to choice of estimator.The bars in Figure 3 show the mean squared error of three estimators forestimating mean degree with the regular snowball design. We include thesample mean as an estimator because it is the one most commonly used tosummarize the results in a snowball survey. We include the estimatorˆ µ d = (cid:80) [( y i /d i ) / (1 /d i )], which uses the self-reported degree d i as in place ofthe inclusion probability π i in Brewer’s estimator, because it is widely used,though not recommended for snowball sampling. It would be the correctestimator, equal to Brewers, if the design used had been a random walkinstead of a snowball design, and if he self-reported degrees were accurate. Itwas used by [Salganik and Heckathorn, 2004] to estimate mean degree and by[Volz and Heckathorn, 2008] for estimating any kind of variable. Weabbreviate it here VH.Looking at the actual numbers represented in Figure 3, with the snowballdesign the MSEs of y-bar, VH, and the new estimator are respectively40.453589, 7.249053, and 0.264755. Dividing the MSE of the sample mean bythat of the new estimator gives thee relative efficiency of the new estimator as153. The relative efficiency of the new estimator for the VH estimator is 29.The high MSE of the sample mean and VH for snowball sampling is almostentirely due to bias. For any estimator MSE = Variance + (Bias) . For thesample mean, the bias squared is over 99% of the variance. For the VH thesquared bias is 99% of the variance. So a bar plot of the bias of the estimatorslooks almost exactly the same as the bar plot for MSE in Figure 3.The large biases result from using the wrong sampling weights in theestimators in relation to the actual inclusion probabilities π i . The newestimator estimates those inclusion probabilities using the sample network ofthe data in interaction with the actual snowball design used or a closeapproximation to it. Because the estimated inclusion probabilities f i come outclose to proportional to the actual inclusion probabilities π i , the bias isvirtually eliminated with the new estimation method.For the snowball design allowing re-recruiitments the relative efficiency ofthe new estimator to y-bar for estimating mean degree is 329, and to VH the15elative efficiency of the new estimator is 59. The high MSEs of y-bar and VHagain are mostly due to bias.For the estimate of k -concurrency, with k = 10, the MSE values with theregular snowball design for y-bar, VH, and the new estimator are 0.0107170.043369 0.000447. The relative efficiency of the new estimator to the samplemean is 24. The relative efficiency of the new estimator to the VH estimator is97. The squared bias accounts for 99% of the MSE of y-bar, and 98% of theMSE of the VH estimator.For the snowball design allowing re-recruitments the relative efficiency ofthe new estimator to y-bar for estimating k -concurrency, with k = 10, is 17,and to VH the relative efficiency of the new estimator is 70. The high MSEs ofy-bar and VH again are explained largely by the bias due to the incorrectweightings relative to the inclusion probabilities of the snowball design used inthe survey.Notice also that the magnitudes of MSE of y-bar and VH reverse inrelation to each other, with VH performing better than y-bar for mean degreeand worse than VH for k -concurrency. When the sampling weights ofestimators differ significantly from the inclusion probabilities of the survey, theestimators not only perform poorly in most cases, but the performance iserratic, depending on the exact configuration of values of variables of interestand of the sampling weights in the population. In the present study, only thenew estimator, with its inclusion probabilities estimated from the sample dataand the actual design, performed consistently well for every variable and witheach design. The Colorado Springs Study node data includes 13 individual attributevariables such as sex worker, client of sex worker, or unemployed. These arevariables 1 through 13 in Table 1. For an individual, the value is 1 if theindividual has the attribute and 0 otherwise. The object for inference for eachattribute is to estimate the proportion of people in the population having thatattribute. Most of these variables, such as sex or race or employment status,are not strongly or consistently related to inclusion probabilities. Still, fordesign-based estimators to work well it helps to have the estimated inclusionprobabilities close to the actual inclusion probabilities. For the simulations,missing values were arbitrarily set to zero so that sample sizes would be thesame for all variables.With a conventional simple random sampling design for estimating apopulation proportion, the mean squared error of the estimate is aparabola-shaped function of the actual population proportion. The actualproportion has to be between zero and one. The MSE is highest when theactual proportion is one-half and the MSE is zero when the actual proportionis zero or one. With the network designs and their unequal inclusionprobabilities the situation is more complex, but it is still the case that theactual proportion has to be between zero and one and that the MSE will be16 lll lllllll ll ll llll llllll l . . . . . . SB design, 13 attribute variables
Actual proportion M ean s qua r ed e rr o r l lll lllllll ll ll l lll llllll l l lll lllll ll ll ll l ll l llll ll l l l l Y−BARVHNew
15 212 136 12
Figure 5: With the regular snowball design, the mean squared error of threeestimators of population proportion for each of 13 individual attributes. Thethree estimators are the sample mean (Y-BAR, white), weighting by reciprocalof self-reported degree (VH, yellow), and the new estimator (NEW, black).Thenew estimator (black) is compared to the current estimator (yellow). To helpsee the pattern, the MSE for estimating the compliment of each attribute isalso shown. The compliment of sex work client, for example, is not-client. Aparabolic curve is fitted by weighted least squares to the MSEs of the new esti-mator (solid line) and the current estimator (dashed line). The new estimatorMSEs have the lowest fitted curve and the best fit.17 lll lllllll ll ll llll llllll l . . . . . . SB Re−recruit design, 13 attribute variables
Actual proportion M ean s qua r ed e rr o r l lll lllllll ll ll l lll llllll l l lll lllll ll ll ll l ll l llll ll l l l l Y−BARVHNew
15 212 136 12
Figure 6: With the snowball design allowing re-recruiitments, the mean squarederror of three estimators of population proportion for each of 13 individual at-tributes. The three estimators are the sample mean (Y-BAR, white), weight-ing by reciprocal of self-reported degree (VH, yellow), and the new estimator(NEW, black).The new estimator (black) is compared to the current estimator(yellow). To help see the pattern, the MSE for estimating the compliment ofeach attribute is also shown. The compliment of sex work client, for example,is not-client. A parabolic curve is fitted by weighted least squares to the MSEsof the new estimator (solid line) and the current estimator (dashed line). Thenew estimator MSEs have the lowest fitted curve and the best fit.18ero if the actual proportion is zero or one.To help see the pattern in the mean squared errors for estimating thepopulation proportions of the 13 attribute variables, the MSE for each variableis plotted in Figure 3 against the actual proportion of people having thatattribute in the Colorado Springs study population. For each of the 13variables, we can also estimate the proportion for its complement. Thecompliment of “client”, for example, is “not client”. The proportion for thecompliment is 1 minus the proportion with the attribute, and the MSE forestimation of the compliment is the same as the the MSE for estimation of theoriginal variable. This gives us 26 variables for each plot in Figure 3, withactual proportions ranging from near 0 to near 1. The original variables are onthe left, since the actual proportions are all less than one-half. The complimentvariables provide redundant information but clarify the pattern in the MSEs.The MSE with the new estimator (black in plots) is lower than that of thecurrent estimator in all cases except for some of the ones with actualproportion near to zero for which the MSE is very small with either estimator.While the MSEs of the new estimator fall rather close to the fitted parabola(solid line), the MSEs of the current estimator are more erratic and the fittedparabola (dashed line) is higher. The overall higher MSEs and erratic patternwith the current estimator result from the discrepancies between actualinclusion probabilities and those used in estimation.Table 1: SB: Confidence Interval Coverage, median is .94name actual E.se width coverage1 nonwhite 0.24 0.023 0.04 0.912 female 0.43 0.029 0.06 0.983 sex work 0.05 0.010 0.02 0.944 procures 0.02 0.005 0.01 0.965 sex-work client 0.09 0.016 0.03 0.956 sells drugs 0.06 0.011 0.02 0.927 makes drugs 0.01 0.004 0.01 0.808 stealing 0.02 0.007 0.01 0.949 retired 0.03 0.009 0.02 0.9210 homemaker 0.06 0.011 0.02 0.9411 disabled 0.04 0.009 0.02 0.9512 unemployed 0.16 0.017 0.03 0.9613 homeless 0.01 0.005 0.01 0.8614 no. partners 7.88 0.287 0.56 0.7015 concurrency 0.82 0.035 0.07 1.00The parabolas in the plots have form MSE = ap (1 − p ) where p is theactual proportion, which is known for each of the 13 attribute variables in thesimulation population. The coefficient a measures the height of the fittedparabola for a given estimator-design combination. Since the relationship islinear with increasing variance in the quantity p (1 − p ), the weighted least19able 2: SB re-recruit design: Confidence Interval Coverage, median is .92name actual E.se width coverage1 nonwhite 0.24 0.024 0.05 0.972 female 0.43 0.031 0.06 0.983 sex work 0.05 0.010 0.02 0.934 procures 0.02 0.005 0.01 0.815 sex-work client 0.09 0.016 0.03 0.956 sells drugs 0.06 0.011 0.02 0.957 makes drugs 0.01 0.004 0.01 0.758 stealing 0.02 0.007 0.01 0.909 retired 0.03 0.009 0.02 0.9110 homemaker 0.06 0.012 0.02 0.9211 disabled 0.04 0.010 0.02 0.9212 unemployed 0.16 0.018 0.03 0.9513 homeless 0.01 0.005 0.01 0.8014 no. partners 7.88 0.267 0.52 0.8515 concurrency 0.82 0.038 0.07 0.99squares regression estimator of the coefficient a is a simple ratio estimator.Consider that the variable y = MSE linearly increases with the variable x = p (1 − p ) and that the variance of y about this line increases with x ,approximately linearly. From weighted regression results this suggests, as anestimator for the slope a , the ratio estimator hata = (cid:80) y i / (cid:80) x i . So for anoverall comparison of the estimator y-bar with the new estimator, let a ybar and a new represent the estimated parameters of the respective lines. The ratio a ybar /a new is simply the sum of the MSEs with y-bar divided by the sum ofthe MSEs with the new estimator, because the x -values are the same witheach design, so their sums divide out. Thus the ratio of the estimatedparabola parameters a is simply the ratio of the average MSE for the twodesigns. It’s also the ratio of heights of the two parabolas at any point. Thisratio provides a measure of average relative efficiency of the the two designs.For the regular snowball design, the overall relative efficiency of the newestimator to the sample mean is a ybar /a new = 0 . / . . a ybar /a new = 0 . / . .
2. The relative efficiency of the newestimator to the VH estimator is 0.01078216 / 0.002445438 = 4.4. To see theserelative parabola heights in Figures 5 and 6, note the position of zero on thevertical axis.Confidence interval coverage probabilities for each variable for each of the15 variables are given in Tables 1 and 2. The median coverage probability ofthe intervals for the 15 characteristics estimated with each of two designs is .94.20
Discussion and conclusions
Snowball designs are the most natural of the network designs and have manydesirable properties. They self-allocate so that most of the sample size goesinto the most highly connected parts of the population. This leadsinvestigators to those areas where risk of spread of a virus is highest. Thismakes snowball designs highly effective for distributing an intervention such asa vaccine to a population.Snowball designs have been used less than they should be because a simpledesign-based estimator has not been available. The new method of this paperprovide such estimators.Snowball designs and accurate estimators to go with them are needed forpreventing new pandemics from the coronaviruses family of viruses. They areneeded for understanding sexual networks on which HIV spreads and opioidmisuse networks which spread their own harms and occasionally provideexplosive terrain for HIV. The new estimators can also be used for adaptivespatial designs for hard-to-sample animal species, by translating the spatialstructure into network form.Accurate estimation with snowball designs requires estimating theinclusion probabilities π i from the data taking the actual survey samplingdesign into account. Sample means and proportions do not provide goodestimates or data summaries for snowball samples. Estimators that weightobservations by the reciprocal of the unit’s degree also do not work well withsnowball sampling designs.The new estimates are fast to compute, scale up well, and are based on thedata and the design actually used rather than unrealistic assumptions. Theyare low-bias, accurate, and give reliable confidence intervals. Acknowledgements
This research was supported by Natural Science and Engineering ResearchCouncil of Canada (NSERC) Discovery grant RGPIN327306. I would like tothank John Potterat and Steve Muth for making the Colorado Springs studydata available and for their generous help explaining it. I would like to expressappreciation for the participants in that study who shared their personalinformation with the researchers so that it could be made available inanonymized form to the research community and contribute to a solution toHIV and and addiction epidemics and to basic understanding of socialnetworks.
References [Admiraal et al., 2016] Admiraal, R., Handcock, M. S., et al. (2016). Modelingconcurrency and selective mixing in heterosexual partnership networks with21pplications to sexually transmitted diseases.
The Annals of AppliedStatistics , 10(4):2021–2046.[Andersen et al., 2020] Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes,E. C., and Garry, R. F. (2020). The proximal origin of sars-cov-2.
Naturemedicine , 26(4):450–452.[Atkinson and Flint, 2001] Atkinson, R. and Flint, J. (2001). Accessinghidden and hard-to-reach populations: Snowball research strategies.
Socialresearch update , 33(1):1–4.[Baraff et al., 2016] Baraff, A. J., McCormick, T. H., and Raftery, A. E.(2016). Estimating uncertainty in respondent-driven sampling using a treebootstrap method.
Proceedings of the National Academy of Sciences ,113(51):14668–14673.[Birnbaum and Sirken, 1965] Birnbaum, Z. and Sirken, M. G. (1965). Designof sample surveys to estimate the prevalence of rare diseases: three unbiasedestimates, vital and health statistics, series 2.
Government Printing Office,Washington, DC .[Brewer, 1963] Brewer, K. (1963). Ratio estimation and finite populations:Some results deducible from the assumption of an underlying stochasticprocess.
Australian Journal of Statistics , 5(3):93–105.[Campbell et al., 2017] Campbell, E. M., Jia, H., Shankar, A., Hanson, D.,Luo, W., Masciotra, S., Owen, S. M., Oster, A. M., Galang, R. R., Spiller,M. W., et al. (2017). Detailed transmission network analysis of a largeopiate-driven outbreak of hiv infection in the united states.
The Journal ofinfectious diseases , 216(9):1053–1062.[Chow and Thompson, 2003a] Chow, M. and Thompson, S. K. (2003a). Abayesian approach to estimation with link-tracing sampling designs.
SurveyMethodology , 29:197–205.[Chow and Thompson, 2003b] Chow, M. and Thompson, S. K. (2003b).Estimation with link-tracing sampling designs a bayesian approach.
SurveyMethodology , 29(2):197–206.[Corman et al., 2015] Corman, V. M., Baldwin, H. J., Tateno, A. F., Zerbinati,R. M., Annan, A., Owusu, M., Nkrumah, E. E., Maganga, G. D., Oppong,S., Adu-Sarkodie, Y., et al. (2015). Evidence for an ancestral association ofhuman coronavirus 229e with bats.
Journal of virology , 89(23):11858–11870.[Frank, 1977] Frank, O. (1977). Survey sampling in graphs.
Journal ofStatistical Planning and Inference , 1(3):235–264.[Frank and Snijders, 1994] Frank, O. and Snijders, T. (1994). Estimating thesize of hidden populations using snowball sampling.
Journal of OfficialStatistics , 10:53–53. 22Ge et al., 2013] Ge, X.-Y., Li, J.-L., Yang, X.-L., Chmura, A. A., Zhu, G.,Epstein, J. H., Mazet, J. K., Hu, B., Zhang, W., Peng, C., et al. (2013).Isolation and characterization of a bat sars-like coronavirus that uses theace2 receptor.
Nature , 503(7477):535–538.[Gile, 2011] Gile, K. J. (2011). Improved inference for respondent-drivensampling data with application to hiv prevalence estimation.
Journal of theAmerican Statistical Association , 106(493):135–146.[Graham and Baric, 2010] Graham, R. L. and Baric, R. S. (2010).Recombination, reservoirs, and the modular spike: mechanisms ofcoronavirus cross-species transmission.
Journal of virology , 84(7):3134–3146.[Handcock and Gile, 2010] Handcock, M. and Gile, K. (2010). Modeling socialnetworks from sampled data.
The Annals of Applied Statistics , 4(1):5–25.[Handcock and Gile, 2007] Handcock, M. S. and Gile, K. (2007). Modelingsocial networks with sampled or missing data.
Center for statistics and thesocial sciences working paper no.75. University of Washington, Departmentof Statistics. [Handcock and Gile, 2011] Handcock, M. S. and Gile, K. J. (2011). Comment:On the concept of snowball sampling.
Sociological Methodology ,41(1):367–371.[Heckathorn, 1997] Heckathorn, D. D. (1997). Respondent-driven sampling: anew approach to the study of hidden populations.
Social problems ,44(2):174–199.[Heckathorn, 2007] Heckathorn, D. D. (2007). Extensions of respondent-drivensampling: Analyzing continuous variables and controlling for differentialrecruitment.
Sociological Methodology , 37(1):151–208.[Heckathorn, 2011] Heckathorn, D. D. (2011). Comment: Snowball versusrespondent-driven sampling.
Sociological Methodology , 41(1):355–366.PMID: 22228916.[Hu et al., 2017] Hu, B., Zeng, L.-P., Yang, X.-L., Ge, X.-Y., Zhang, W., Li,B., Xie, J.-Z., Shen, X.-R., Zhang, Y.-Z., Wang, N., et al. (2017). Discoveryof a rich gene pool of bat sars-related coronaviruses provides new insightsinto the origin of sars coronavirus.
PLoS pathogens , 13(11):e1006698.[Huong et al., 2020] Huong, N. Q., Nga, N. T. T., Long, N. V., Luu, B. D.,Latinne, A., Pruvot, M., Phuong, N. T., Quang, L. T. V., Hung, V. V., Lan,N. T., Hoa, N. T., Minh, P. Q., Diep, N. T., Tung, N., Ky, V. D., Roberton,S. I., Thuy, H. B., Long, N. V., Gilbert, M., Wicker, L., Mazet, J. A. K.,Johnson, C. K., Goldstein, T., Tremeau-Bravard, A., Ontiveros, V., Joly,D. O., Walzer, C., Fine, A. E., and Olson, S. H. (2020). Coronavirus testingindicates transmission risk increases along wildlife supply chains for humanconsumption in viet nam, 2013-2014.
PLOS ONE , 15(8):1–20.23Kretzschmar and Morris, 1996] Kretzschmar, M. and Morris, M. (1996).Measures of concurrency in networks and the spread of infectious disease.
Mathematical biosciences , 133(2):165–195.[Latinne et al., 2020] Latinne, A., Hu, B., Olival, K. J., Zhu, G., Zhang, L.,Li, H., Chmura, A. A., Field, H. E., Zambrana-Torrelio, C., Epstein, J. H.,et al. (2020). Origin and cross-species transmission of bat coronaviruses inchina. bioRxiv .[Li et al., 2019] Li, H., Mendelsohn, E., Zong, C., Zhang, W., Hagan, E.,Wang, N., Li, S., Yan, H., Huang, H., Zhu, G., et al. (2019). Human-animalinteractions and bat coronavirus spillover potential among rural residents insouthern china.
Biosafety and Health , 1(2):84–90.[Li et al., 2005] Li, W., Shi, Z., Yu, M., Ren, W., Smith, C., Epstein, J. H.,Wang, H., Crameri, G., Hu, Z., Zhang, H., et al. (2005). Bats are naturalreservoirs of sars-like coronaviruses.
Science , 310(5748):676–679.[Montgomery and Macdonald, 2020] Montgomery, R. A. and Macdonald,D. W. (2020). Covid-19, health, conservation, and shared wellbeing: Detailsmatter.
Trends in Ecology & Evolution .[Morris and Kretzschmar, 1997] Morris, M. and Kretzschmar, M. (1997).Concurrent partnerships and the spread of hiv.
Aids , 11(5):641–648.[Olival et al., 2017a] Olival, K. J., Hosseini, P. R., Zambrana-Torrelio, C.,Ross, N., Bogich, T. L., and Daszak, P. (2017a). Erratum: host and viraltraits predict zoonotic spillover from mammals.
Nature , 548(7669):612–612.[Olival et al., 2017b] Olival, K. J., Hosseini, P. R., Zambrana-Torrelio, C.,Ross, N., Bogich, T. L., and Daszak, P. (2017b). Host and viral traitspredict zoonotic spillover from mammals.
Nature , 546(7660):646–650.[Pattison et al., 2013] Pattison, P. E., Robins, G. L., Snijders, T. A., andWang, P. (2013). Conditional estimation of exponential random graphmodels from snowball sampling designs.
Journal of mathematicalpsychology , 57(6):284–296.[Peters et al., 2016] Peters, P. J., Pontones, P., Hoover, K. W., Patel, M. R.,Galang, R. R., Shields, J., Blosser, S. J., Spiller, M. W., Combs, B., Switzer,W. M., et al. (2016). Hiv infection linked to injection use of oxymorphone inindiana, 2014–2015.
New England Journal of Medicine , 375(3):229–239.[Potterat et al., 1999] Potterat, J. J., Rothenberg, R. B., and Muth, S. Q.(1999). Network structural dynamics and infectious disease propagation.
International journal of STD & AIDS , 10(3):182–185.[Rohe et al., 2019] Rohe, K. et al. (2019). A critical threshold for designeffects in network sampling.
The Annals of Statistics , 47(1):556–582.24Salganik and Heckathorn, 2004] Salganik, M. J. and Heckathorn, D. D.(2004). Sampling and estimation in hidden populations usingrespondent-driven sampling.
Sociological methodology , 34(1):193–240.[Snijders, 1992] Snijders, T. A. (1992). Estimation on the basis of snowballsamples: how to weight?
Bulletin of Sociological Methodology/Bulletin deM´ethodologie Sociologique , 36(1):59–70.[Spreen, 1992] Spreen, M. (1992). Rare populations, hidden populations, andlink-tracing designs: What and why?
Bulletin of SociologicalMethodology/Bulletin de Methodologie Sociologique , 36(1):34–58.[Tang et al., 2020] Tang, X., Wu, C., Li, X., Song, Y., Yao, X., Wu, X., Duan,Y., Zhang, H., Wang, Y., Qian, Z., et al. (2020). On the origin andcontinuing evolution of sars-cov-2.
National Science Review .[Thompson, 2011] Thompson, S. (2011). Adaptive network and spatialsampling.
Survey Methodology , 37(2):183–196.[Thompson, 1990] Thompson, S. K. (1990). Adaptive cluster sampling.
Journal of the American Statistical Association , 85(412):1050–1059.[Thompson, 2006] Thompson, S. K. (2006). Adaptive web sampling.
Biometrics , 62(4):1224–1234.[Thompson, 2017] Thompson, S. K. (2017). Adaptive and network samplingfor inference and interventions in changing populations.
Journal of SurveyStatistics and Methodology , 5(1):1–21.[Thompson and Collins, 2002] Thompson, S. K. and Collins, L. M. (2002).Adaptive sampling in research on risk-related behaviors.
Drug and AlcoholDependence , 68:57–67.[Thompson and Frank, 2000] Thompson, S. K. and Frank, O. (2000).Model-based estimation with link-tracing sampling designs.
Surveymethodology , 26(1):87–98.[Volz and Heckathorn, 2008] Volz, E. and Heckathorn, D. D. (2008).Probability based estimation theory for respondent driven sampling.
Journal of official statistics , 24(1):79.[Wang and Anderson, 2019] Wang, L.-F. and Anderson, D. E. (2019). Virusesin bats and potential spillover to animals and humans.
Current opinion invirology , 34:79–89.[Wang et al., 2006] Wang, L.-F., Shi, Z., Zhang, S., Field, H., Daszak, P., andEaton, B. T. (2006). Review of bats and sars.
Emerging infectious diseases ,12(12):1834. 25Zhou et al., 2020] Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L.,Zhang, W., Si, H.-R., Zhu, Y., Li, B., Huang, C.-L., et al. (2020). Apneumonia outbreak associated with a new coronavirus of probable batorigin.