[PDF] Evaluation of Logistic Regression Applied to Respondent-Driven Samples: Simulated and Real Data

Abstract

Objective: To investigate the impact of different logistic regression estimators applied to RDS samples obtained by simulation and real data. Methods: Four simulated populations were created combining different connectivity models, levels of clusterization and infection processes. Each subject in the population received two attributes, only one of them related to the infection process. From each population, RDS samples with different sizes were obtained. Similarly, RDS samples were obtained from a real-world dataset. Three logistic regression estimators were applied to assess the association between the attributes and the infection status, and subsequently the observed coverage of each was measured. Results: The type of connectivity had more impact on estimators performance than the clusterization level. In simulated datasets, unweighted logistic regression estimators emerged as the best option, although all estimators showed a fairly good performance. In the real dataset, the performance of weighted estimators presented some instabilities, making them a risky option. Conclusion: An unweighted logistic regression estimator is a reliable option to be applied to RDS samples, with similar performance to random samples and, therefore, should be the preferred option.

Full PDF

EEvaluation of Logistic Regression Applied to Respondent-Driven Samples:Simulated and Real Data

Sandro Sperandei , Leonardo S. Bastos , Marcelo Ribeiro-Alves , Arianne Reis , Francisco I. Bastos Translational Health Research Institute, Western Sydney University, Australia Institute of Scientific and Technological Communication & Information in Health, Oswaldo Cruz Foundation, Brazil Scientific Computing Program, Oswaldo Cruz Foundation, Brazil National Institute of Infectious Diseases Evandro Chagas, Oswaldo Cruz Foundation, Brazil School of Health Sciences, Western Sydney University, Australia* Corresponding author: [email protected]

ABSTRACTObjective:

To investigate the impact of different logistic regression estimators applied toRDS samples obtained by simulation and real data.

Methods:

Four simulated populations werecreated combining different connectivity models, levels of clusterization and infection processes.Each subject in the population received two attributes, only one of them related to the infectionprocess. From each population, RDS samples with different sizes were obtained. Similarly, RDSsamples were obtained from a real-world dataset. Three logistic regression estimators were appliedto assess the association between the attributes and the infection status, and subsequently theobserved coverage of each was measured.

Results:

The type of connectivity had more impact onestimators’ performance than the clusterization level. In simulated datasets, unweighted logisticregression estimators emerged as the best option, although all estimators showed a fairly goodperformance. In the real dataset, the performance of weighted estimators presented someinstabilities, making them a risky option.

Conclusion:

An unweighted logistic regression estimatoris a reliable option to be applied to RDS samples, with similar performance to random samples and,therefore, should be the preferred option.

Keywords:

Respondent-driven sampling, logistic regression, simulation, hard-to-reachpopulations, statistical methods ighlights:  Unweighted logistic regressions are the best choice for RDS studies  RDS method can be applied to a broader spectrum of problems out of hard-to-reachpopulations  Weighted estimators can be heavily affected by real-world situations

INTRODUCTION

Respondent-driven sampling (RDS) is a chain-referring sampling method based on the keyprinciple that the best recruiter for a hard-to-reach, marginalized or hidden population is a memberof this very population. The method’s success in recruiting individuals from hard-to-reachpopulations is well accepted, and major international organizations have advocated its use,including the Centers for Disease Control and Prevention (Lansky & Mastro, 2008) and the WorldHealth Organization (Johnston, Chen, Silva-Santisteban, & Raymond, 2013). As an estimation method, it is based on the assumption that the size of an individual’scontact network is related to the probability of this individual being recruited to the sample. For thisreason, the accepted procedure is to weight individuals as the inverse value of their network size,resulting in individuals with smaller networks, and therefore less likely of being recruited, receivinghigher weighting or importance in prevalence prediction (Gile, Johnston, & Salganik, 2015).The performance of RDS prevalence estimators has been assessed in many studies, usingdifferent methods, particularly simulations (e.g., Goel & Salganik, 2010; Mills, Johnson, Hickman,Jones, & Colijn, 2014), with varying results. In general, studies have shown an intermediate to highperformance of RDS prevalence estimators (Mills et al., 2014; Rocha, Thorson, Lambiotte, &Liljeros, 2016; Sperandei et al., 2018). However, almost all currently proposed estimators for RDSsamples aim only to estimate the prevalence of a condition in the population of interest and not theidentification of factors associated with that condition. In order to address this, Bastos et al. (2018)proposed a model-based estimator, called RDS-B, which can be used to estimate both prevalenceand associated factors. Notwithstanding the capacity of RDS-B to estimate associated factors, theauthors only used the estimator in its simplest form to estimate prevalence and did not fully addressits model-based characteristics.Several researchers who have analyzed RDS-based datasets have applied simple logisticregression estimators to assess the putative association between covariates and outcomes,rrespective of the varied study designs and the very characteristics of RDS, especially theunderlying network structures (e.g., Do et al., 2018; Liu et al., 2018; Toro-Tobón, Berbesi-Fernandez, Mateu-Gelabert, Segura-Cardona, & Montoya-Vélez, 2018). Conversely, others try touse some form of weighted logistic regression, adding weights obtained from reported networksizes (e.g., Hotton, Quinn, Schneider, & Voisin, 2018; Ndori-Mharadze et al., 2018; Szwarcwald etal., 2018). However, the influence of such sampling weights has not been assessed beyond what hasbeen defined as the basic diagnostic tools to double-check either the sound or improper use of thestandard RDS procedures (e.g., Gile et al., 2015).The purpose of this paper is to address this gap in knowledge by assessing the performanceof three logistic regression models in estimating true, expected relationships when applied to RDSsamples generated by simulations. These estimators were then applied to a real-life RDS sampledata of transgender women from a large Brazilian study with a sample of 2,846 participants.

METHODSSimulation

A total of four connected populations (N=10,000) were simulated using two random graphmodels, with and without the simulation of nested subpopulations. The random graphs used and themain parameters for each population were as follows:  Erdös-Rényi without subpopulations (ER1) : the simplest random graph structure,initially proposed by Erdös and Rényi (1959), where links between two members of the populationwere established at random, with a fixed probability (P). P was set at 0.0025;  Erdös-Rényi with nested subpopulations (ER2) : this population is similar to theprevious (ER1). However, instead of one population, five subpopulations were nested within the Pset at 0.0125. Only ten individuals in each of the five subpopulations were allowed to connect withother subpopulations. They were chosen at random.  Barabasi-Albert without subpopulations (BA1) : the scale-free model created byBarabasi and Albert (1999), also known as the "richer get richer", follows a power-law distributionfor connectivity. In summary, the population starts with one individual and every new individualentering the population has the probability of linking with old members proportionally to theconnectivity degree (i.e. number of contacts) of each individual. It generates few individuals withxtremely high connectivity degrees and the majority of the population with few connections. Theparameter needed is the number of links each new individual will establish when joining thepopulation. In this simulation, such links were set to 12.5.  Barabasi-Albert with nested subpopulations (BA2) : five subpopulations with 2,000individuals each were generated to construct this population. Subsequently, ten individuals fromeach subpopulation with the highest degree were chosen to link randomly across thesesubpopulations.All parameters were set in order to obtain, whatever the model, a mean connectivity degreeof 20 in all populations.

Explanatory Variables

To assess the performance of logistic estimators emulating actual associations, two binaryexplanatory variables were added as attributes of each individual, apart from the infected/not-infected status (see infection process below). They were named E1 and E2. Each one presents 50%of positive and negative cases, randomly distributed in the population. During the infection process,each individual in the population with a positive E1 variable will present twice the chance of beinginfected. The purpose of this is to force a statistical association between E1 and the disease, whilethe variable E2 will present no relationship with disease.

Infection Processes

Four infection processes were simulated to emulate the dissemination of a particular diseasein each of the populations. All processes are variations of the classical Susceptible-Infected (SI)model, where the infected individual does not recover from the disease. In the first process,individuals were selected at random and defined as "infected". The other three processes weredependent on network contacts; all three started with some randomly selected individuals defined as"infected" but, unlike the first process, from there the infection followed through the networkcontacts in successive waves. In each wave, all individuals connected to the infected ones had aprobability of 0.005 to be infected. This infection rate was selected to avoid an out of controlincrease of the infected population (i.e. an unexpected outbreak). Each newly infected individualcould infect their contacts in subsequent waves. All infected individuals kept infecting their contactsuntil the desired prevalence was reached. The infection prevalence was set at 30%.rocesses started with 10, 100, and 500 infected individuals, creating infections dependenton network connectivity. In the case of 10 initially infected individuals, all those infected weremore closely related to the network of the initial individuals, given each individual would generate,on average, an infected tree of about 300 individuals. In the case of 500 initially infectedindividuals, there would be a lower network connectivity dependency, with expected trees of onlysix individuals each. Also, the random process can be considered a particular case, where theprocess starts with 3,000 infected individuals (prevalence = 30% of 10,000). These processessimulate diseases that depend on interaction between susceptible and infected individuals.

Sampling Process

Benchmark samples were obtained in a simple random process, applied to each combinationof population versus simulated infection pattern.RDS samples were obtained simulating an RDS process. All RDS sampling processes werelaunched using three randomly selected individuals ("seeds"). Each seed recruited randomly fromtheir network one to three contacts, with probabilities of 0.40, 0.40, and 0.20, respectively. Theseprobabilities were based on empirical data from a study with drug users from Belo Horizonte, Brazil(unpublished data). Each recruited individual repeats the process, recruiting additional individualsfrom their network, and this pattern is repeated until the desired sample size was obtained. It isessential to highlight that, although similar to the infection process previously described, eachindividual in the population recruits only one to three individuals. In contrast, in the infectionprocess, they keep infecting other individuals until the end of the process.No homophily-related bias was explicitly incorporated into the recruitment process, althoughprevious studies have suggested that homophily may influence the process (Gile et al., 2015). Thesimulated samples were designed to reproduce a “perfect world”, following the RDS methodassumptions, that is, seeds are recruited randomly, each recruiter recruits randomly among theircontacts, no recruitee refuses to participate and all report their network size accurately.In all cases, 1,000 samples with three sample sizes (i.e. 100, 250, and 500 individuals) wereobtained from each combination of population and infection, and applied to all three logisticestimators.

Logistic Estimators

Three variations of logistic regression estimators were applied to the above-simulated data.For each, a model with both variables and interaction was fitted.The first, used on both RDS and random samples, was the logistic regression estimator(Sperandei, 2014), with the frequentist likelihood estimator. It will be named here the "unweightedlogistic", given the other two estimators are weighted.The second type of regression, called here "RDS-weighted logistic", takes into considerationthe study design and weightings of each individual using the same form of weighting used in RDS-Iand RDS-II estimators (Heckathorn, 1997, 2002; Salganik & Heckathorn, 2004). It weighs resultsfrom the simulations proportionally to the inverse of the reported degree of each individual (Volz &Heckathorn, 2008).The third type of regression estimator, called “RDS-B” (Bastos et al., 2018), is a Bayesianversion of the RDS-weighted logistic, where weakly informative priors are set to the coefficients(Gelman, Jakulin, Pittau, & Su, 2008), and the weighted likelihood, called pseudo-likelihood, iscombined with the prior using Bayes theorem, leading to the pseudo-posterior distribution (Savitsky& Toth, 2015). Posterior means were used in order to make a comparison among estimators, and95% credible intervals were used to represent uncertainty. In the case of randomly selected samples, only the unweighted logistic estimator was used,defining a benchmark performance.

Performance Assessment

The performance assessment was accomplished by the observed coverage metric, alsoknown as coverage probability (Dodge, 2003). This is the proportion of times the confidenceinterval of each estimator contains the populational parameters simulated. It means that, for thecoefficient of E1, the OR confidence interval contains the parameter 2 simulated for eachpopulation. For this coefficient, the confidence interval also needs to exclude the value of 1,meaning a significant coefficient. The rationale for this second criterion is to avoid too wideconfidence intervals being considered a good performance. For the coefficients of E2 and theinteraction E1xE2, the OR confidence interval must contain the value of 1, meaning a non-significant interval, which is the simulated situation. For these two coefficients, the complementaryrobability (1 - coverage) will be used as an estimate of type-I error probability. Finally, acombination of E1, E2, and interaction results will be built to investigate the probability of acombined correct estimation from the model, meaning a significant E1 coefficient and non-significant coefficients for E2 and interaction E1 x E2. The word "significant" here was used in abroad sense, related to the usual 95% confidence interval, although we acknowledge that inBayesian models these definitions are not strictly adequate.All performances were compared to the random samples’ performance for each combinationof population and infection.

Real-Life Data

All four estimators were subsequently applied to the Divas Research dataset (Bastos et al.,2018), which is a large RDS-based study conducted across 12 cities in Brazil that collected data on2,846 transgender women.The entire dataset was combined and considered as one population, from where the expectedparameters were estimated. Four variables were considered in this study to assess the performanceof the estimators. HIV status (positive x negative) was considered the main outcome. The twoexplanatory variables considered were whether the person had acted as a sexual worker anytime intheir life (explanatory variable 1 – E1) and whether the person had moved from their place of birthanytime during life (E2). E1 is expected to be related to HIV status, while E2 is not. The fourthvariable was the reported number of contacts (network degree), which was used in RDSestimations. A total of 2,548 individuals were used to avoid missing information in any of thevariables considered.From this population, samples were extracted with sizes of 100, 250, and 500 individuals.First, 1,000 random samples of each sample size were used as benchmarks, similar to what wasdone in the simulation. Second, 1,000 samples were drawn following the RDS process. As theobjective here is to observe the impact of real-world constraints and bottlenecks in the samplingprocedure, these samples were extracted respecting the original RDS sampling from the dataset.Real seeds were randomly selected and the original recruitment trees were followed from each seeduntil the desired sample size was reached. By doing this, each sample used was a subsample of theoriginal dataset, presenting all the characteristics found in real-life sampling.gain, similarly to the process used in the simulation, the sample results were compared tothe observed result from the population, and the number of correct estimations was counted.The Divas study received ethics approval from the Escola Nacional de Saúde Publica(CAAE 49359415.9.0000.5240). All participants signed an informed consent form to take part inthe study. The dataset was provided in an unidentified form and no additional approval wasnecessary for the current study.All simulations and analyses used R software, version 3.4.4 (R Core Team, 2018) and itspackages igraph (Csardi & Nepusz, 2006), survey (Lumley, 2004), and arm (Gelman & Su, 2018).

RESULTS

Results of the simulated populations can be seen in Figure 1. Red dots represent infectedindividuals, while blue dots are non-infected individuals. A considerably different pattern can benoted between the two random graph models used and an even more dramatic effect betweenclustered and non-clustered populations. Comparing ER and BA networks, it is clear that highlyconnected individuals, located on the borders of the population, have a higher chance of becominginfected in the BA model. In ER models, as the distribution of degrees does not present heavy tails,the infection is more uniformly spread. The same pattern can be observed in models withsubpopulations well defined, with one additional characteristic: the clustered nature of these modelsresulted in parts of the population being almost untouched by infection. igure 1.

Populations created. Blue vertices and edges are for non-infected individuals.Red vertices and edges are for infected individuals. A: ER1 model. B: BA1 model. C:ER2 model. D: BA2 model.Table 1 presents the main characteristics of each simulated population as well as the Divasdataset. It can be noted that all main characteristics were successfully simulated. The Barabasi-Albert models showed a discrepancy between the average and the median degree due to theasymmetric nature of the model degree’s distribution.

Table 1.

Main characteristics of simulated and Divas populations.PopulationCharacteristic ER1 ER2 BA1 BA2 DivasMean Degree 20.03 19.95 19.99 19.95 20.21Median Degree 20.0 20.0 14.0 14.0 10.0Min – Max Degree 4 – 37 5 – 39 10 – 541 10 – 247 2 – 100nfectionPrevalence (%) 30.2 – 31.6* 30.4 – 31.9* 30.0 – 33.0* 29.8 – 32.5* 29.98E1 Prevalence (%) 50 50 50 50 76.4E2 Prevalence (%) 50 50 50 50 60.9E1 Odds Ratio 1.97 – 2.00* 1.98 – 2.05* 1.97 – 2.05* 1.98 – 2.04* 1.83E2 Odds Ratio 0.90 – 1.01* 0.85 – 1.10* 0.76 – 1.09* 0.83 – 1.04* 1.26E3 Odds Ratio 0.95 – 1.33* 0.94 – 1.21* 0.92 – 1.45* 0.98 – 1.40* 1.31* Values represent the minimum and maximum range across the four types of infectionThe simulated prevalence ranged from 14.6% (ER2) to 17.2% (BA1), very close to thedesired value (15%). Regarding true ORs observed in the population, general logistic models fittedto the whole population (one for each population) detected significant ORs for variable E1, allbetween 1.95 and 2.05, after adjusting for E2 and the interaction. For variable E2, true ORs rangedfrom 0.81 to 1.10, all of them non-significant, as expected. Lastly, for the interaction factor(E1xE2), true ORs varied from 0.90 (ER1) to 1.20 (BA2). These results confirm the simulationprocess was adequate. Regarding the Divas population, a pattern towards a power-law distributionof connectivity and a clustered behaviour is expected, given the way the population was created,joining samples from twelve cities. This means that no individual will recruit out of their own city.Overall, the Divas dataset was most similar to the BA2 simulated population.Figure 2 presents observed coverage probability results according to the network model,infection process, sample size, and estimators used for coefficient E1 alone. The most evident effectwas related to the sample size. The higher the sample size, the higher the coverage. Regardingestimators themselves, three of them had similar performances, with slightly better performance bythe traditional logistic estimator applied to RDS samples. The estimator with the worst performancewas the weighted-logistic estimator. However, even this estimator did not perform substantiallybelow the logistic estimator applied to random samples (benchmark) and could be considered asatisfactory estimator. In regards to the effect of network models, it can be observed thatpopulations without heavy tails in the distribution of degrees (ER1 and ER2) present very smalldifference between estimators, while heavy tail distributions of degree inside the population (BA1and BA2) seems to affect heavily the weighted estimators (RDS and Bayes) and favor theunweighted estimator applied to RDS samples. The presence of subpopulations (ER2 and BA2) hadlittle to no effect on the estimators’ performance for E1 or the analysis of the combined coefficients.Lastly, it is interesting to note that, in Barabasi-Albert model-based populations, the unweightedstimator applied to RDS samples presented a better performance when the infection was notrandom even when compared to random samples.

Figure 2.

Observed coverage probability results according to the combination ofnetwork models (each subgraph, as labelled), sample size (100, 250, 500) and infectionprocess (10s, 100s, 500s, Rand).In relation to type-I error probability, Figures 3 and 4 present the results for E2 andinteraction coefficients, respectively. Irrespective of the type of infection, sample size, networkmodel or estimator, the type-I error probability for both coefficients was close to the expected valueof 5%. Only for BA networks, under random infection, with n=500 (and to a lesser extent withn=250), the error rate was above this threshold, especially for the unweighted estimator applied toRDS samples. The error rate for the combined coefficients shows a general trend for an addictiveeffect, showing a certain independence between the coefficients error (Figure 5). igure 3.

Type-I error rate for the E2 coefficient according to network models (eachsubgraph, as labelled), sample size (100, 250, 500) and infection process (10s, 100s, 500s,Rand). igure 4.

Type-I error rate for the interaction coefficient according to network models (eachsubgraph, as labelled), sample size (100, 250, 500) and infection process (10s, 100s, 500s,Rand). igure 5 . Type-I error rate for the E2 and the interaction coefficients according to networkmodels (each subgraph, as labelled), sample size (100, 250, 500) and infection process (10s,100s, 500s, Rand).When the analyses of all three coefficients are combined, it is possible to notice the generalperformance of the estimators to find the “right answer” from the samples: a significant E1coefficient with a confidence interval containing the simulated E1 effect plus non-significant E2 andinteraction coefficients. Figure 6 illustrates how results are very similar to those for the E1coefficient, given the general stability of E2 and interaction results. The results for the randominfection were the most affected, especially by the higher type-I error rate. igure 6.

Observed coverage probability results for the combination of coefficients accordingto network models (each subgraph, as labelled), sample size (100, 250, 500) and infectionprocess (10s, 100s, 500s, Rand).A more interesting result was observed when the estimators were applied to the Divas dataset. First, the random samples behaved as expected, with a proportional increase in coverage forthe E1 coefficient according to the sample size (Figure 7). Second, the unweighted estimator presented a similar behavior when applied to RDS samples compared to random samples. Third, weighted estimators presented a somewhat strange behavior, with unusual high coverage for smallersamples (compared to random), and smaller improvements with increasing size, especially the RDS-B, which demonstrated a drop when the sample reached 500 individuals. This pattern was the same for the combination of all coefficients. igure 7.

Observed coverage probability for E1 and all coefficients combined according to sample size and estimator.When looking at the type-I error rate (Figure 8), they were well below the expected for the sample size of 100 and around 5% for the unweighted logistic estimator, either applied to random orRDS samples. The weighted estimators showed a higher error rate, especially for the RDS-weightedlogistic estimator, which reached more than 40% with sample size of 100. This represents a very high probability of wrong results when using this estimator. igure 8.

Type-I error rate for the E2 and the interaction coefficients according to sample size and estimator.

DISCUSSION

The RDS method has been widely used and recommended as a sampling method to recruithard-to-reach populations, such as drug users, sex workers, transgender individuals, among others.Although its ability to find and recruit members of these “hidden” populations is uncontroversial, itsuse as an estimator method is still disputed (Sperandei et al., 2018). Moreover, the use of model-based estimators to study relationships between response and explanatory variables has been poorlyassessed, especially in regards to the basic question of when to use sampling weightings (Schonlau& Liebau, 2012). These issues notwithstanding, researchers have used traditional logistic estimatorsor some form of weighted logistic applied to RDS samples. A quick survey of the Pubmed databaseidentified 70 studies published between 2018 and 2019 applying logistic regression models to RDSamples, with 48.6% (n=34) using unweighted estimators, 44.3% (n=31) using some form ofweighting with network degrees, and 7.1% (n=5) presenting both weighted and unweighted models.This pattern highlights the evident lack of consensus in the current literature on which type ofestimator should be used.Our simulations have demonstrated not only the impact of data and populationcharacteristics but also the estimator used on results of an RDS study. Although some interactionswith other factors must be considered, it seems that weighted and unweighted estimators performedrelatively well when compared to logistic regression applied to random samples. To the best of our knowledge, only one study assessed the impact of weights used on RDSsample estimates from logistic regression models (or any other form of model estimates) using asimulation approach, and, similarly to our results, it concluded that unweighted estimators performbetter than the weighted ones (Avery et al., 2019). However, the lack of a clear structure in thesimulated network connections and the absence of real data to reflect real sampling problems, incomparison to "perfect" simulated samples, left many issues unaddressed. First, Avery et al.’s(2019) study used only simple logistic models, with just one explanatory variable, not consideringthe effect of interaction between explanatory variables on the result. Second, this study confoundedclustering with homophily, when they are, in fact, separate concepts (Rocha et al., 2016; Sperandeiet al., 2018). Clustering represents the phenomenon of individuals being more connected to theirsimilar ones (in one or more characteristics such as age, geography, etc.), whereas homophilyrelates to preferential recruitment, where people choose to recruit those peers with particularcharacteristics (that the recruiter also possesses), instead of recruiting randomly (Lu et al., 2012). Inthe present study, we addressed these limitations by creating populations based on theoretical graphmodels, controlling the connectivity process. From the results, comparing the two models used here,it is clear the impact of the nature of connectivity on the performance of estimators, which isreinforced by previous research on simple prevalence estimators (Rocha et al., 2016; Sperandei etal., 2018).In addition, we used an adapted concept of "coverage probability" to reflect not only theidentification of correct estimation of the E1 coefficient but also the simultaneous identification ofE2 and the E1xE2 interaction, representing the proportion of correct estimation for the completehypothesis. It represents a more restrictive criterion compared to the usual coverage because itrequires all three hypotheses (E1, E2, E1xE2) being true at the same time.owever, simulations can only approximate the characteristics of the real world, theirsuccess being dependent on previous knowledge about the population being simulated. Thisknowledge, in the case of hard-to-reach populations, can be very restricted. The use of real dataallows us to observe what happens when RDS is applied in the real world. In our simulatedscenarios, RDS sampling followed best practice described for the method, with random selection ofseeds, long recruitment trees, and each recruiter “selecting” randomly amongst their peers (Salganik& Heckathorn, 2004; Volz & Heckathorn, 2008). In practice, it is common to see "dead seeds"(seeds that do not recruit any peers), recruitment trees with mixed length, and true homophily, withrecruiters choosing selectively amongst their peers (Li et al., 2018). Also, time, resource, andlogistical constraints are common, and their impacts on estimation are unknown (Truong et al.,2013, Valois-Santos et al., 2020). Considering a large sample as a population, and using realrecruitment trees as RDS samples, is not a perfect approach. However, we argue that it is one of thebest possible ways of assessing RDS estimators in real life.In this dataset, the random samples acted as a benchmark to what would be expected, giventhat, for any population, random samples are considered the gold standard sampling method. Theresults show the expected increase in coverage according to the sample size. The most excitingfinding was the performance of the unweighted logistic estimator applied to RDS samples, whichshowed similar results compared to random samples, sometimes even better. The results with realdata represent a decreased performance in comparison to simulation results, showing the effects ofdifferences between theoretical sampling procedures and real ones; however, it still performs welland is a good alternative to be used with RDS samples, similarly to what Avery et al. (2019) found.On the other hand, weighted estimators presented more aberrant behavior, especially theRDS-B, which presented higher coverage with smaller samples. At a lower intensity, the RDS-weighted estimator also showed an unexpectedly high power with the 100 samples, but the increasewith bigger sample sizes was not so considerable. This behavior, also partially observed in theperformance of unweighted logistic, is probably related to the differences in simulated and realsampling procedures. In relation to the type-I error rate, the RDS-weighted estimator showed a veryhigh result, representing a big chance of a wrong result.Several studies have demonstrated the advantages of weighting procedures for the simpleprevalence of RDS estimators (Goel & Salganik, 2010; Mills et al., 2014; Sperandei et al., 2018).However, the present results demonstrate that weighting may not be the best option when it comesto regression coefficient estimates, making the unweighted estimator the preferable one instead.

ONCLUSION

In summary, this study demonstrated how unweighted logistic regression is the best option tobe used with RDS samples, particularly when the requirements of the RDS method are respected.However, even in real RDS samples, it achieved a performance as good as the random samplingperformance. These findings suggest that the RDS method is applicable to a broader spectrum ofresearch designs, even where true random sampling is difficult to be achieved.

REFERENCES

Avery, L., Rotondi, N., McKnight, C., Firestone, M., Smylie, J., & Rotondi, M. (2019). Unweightedregression models perform better than weighted regression techniques for respondent-driven sampling data: results from a simulation study.

BMC Medical Research Methodology , (1), 202. https://doi.org/10.1186/s12874-019-0842-5Barabasi, A., & Albert, R. (1999). Emergence of scaling in random networks. Science (New York, N.Y.) , (5439), 509–512.Bastos, F. I., Bastos, L. S., Coutinho, C., Toledo, L., Mota, J. C., Velasco-de-Castro, C. A., … Malta, M. S. (2018). HIV, HCV, HBV, and syphilis among transgender women from Brazil. Medicine , (1S Suppl 1), S16–S24. https://doi.org/10.1097/MD.0000000000009447Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal , Complex Sy , 1695.Do, T. T. T., Le, M. D., Van Nguyen, T., Tran, B. X., Le, H. T., Nguyen, H. D., … Zhang, M. W. B. (2018). Receptiveness and preferences of health-related smartphone applications among Vietnamese youth and young adults.

BMC Public Health , (1), 764. https://doi.org/10.1186/s12889-018-5641-0Dodge, Y., Marriot, F. H. C. (2003). The Oxford dictionary of statistical terms. 6th ed. New York: Oxford Press University.Erdös, P., & Rényi, A. (1959). On random graphs, I. Publicationes Mathematicae (Debrecen) , , 290–297.elman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics , (4), 1360–1383. https://doi.org/10.1214/08-AOAS191Gelman, A., & Su, Y.-S. (2018). arm: Data Analysis Using Regression and Multilevel/Hierarchical Models.Gile, K. J., Johnston, L. G., & Salganik, M. J. (2015). Diagnostics for respondent-driven sampling. Journal of the Royal Statistical Society: Series A (Statistics in Society) , (1), 241–269. https://doi.org/10.1111/rssa.12059Goel, S., & Salganik, M. J. (2010). Assessing respondent-driven sampling. Proceedings of the National Academy of Sciences of the United States of America , (15), 6743–6747. https://doi.org/10.1073/pnas.1000261107Heckathorn, D. D. (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems , (2), 174–199. https://doi.org/10.2307/3096941Heckathorn, D. D. (2002). Respondent-Driven Sampling II: Deriving Valid Population Estimates from Chain-Referral Samples of Hidden Populations. Social Problems , (1), 11–34.Hotton, A., Quinn, K., Schneider, J., & Voisin, D. (2018). Exposure to community violence and substance use among Black men who have sex with men: examining the role of psychological distress and criminal justice involvement. AIDS Care , 1–9. https://doi.org/10.1080/09540121.2018.1529294Johnston, L. G., Chen, Y.-H., Silva-Santisteban, A., & Raymond, H. F. (2013). An empirical examination of respondent driven sampling design effects among HIV risk groups from studiesconducted around the world.

AIDS and Behavior , (6), 2202–2210. https://doi.org/10.1007/s10461-012-0394-8Lansky, A., & Mastro, T. D. (2008). Using respondent-driven sampling for behavioural surveillance:response to Scott. The International Journal on Drug Policy , (3), 241–243; discussion 246-7. https://doi.org/10.1016/j.drugpo.2008.03.004Li, J., Valente, T. W., Shin, H.-S., Weeks, M., Zelenev, A., Moothi, G., … Obidoa, C. (2018). Overlooked Threats to Respondent Driven Sampling Estimators: Peer Recruitment Reality, egree Measures, and Random Selection Assumption. AIDS and Behavior , (7), 2340–2359. https://doi.org/10.1007/s10461-017-1827-1Liu, Y., Jiang, C., Li, S., Gu, Y., Zhou, Y., An, X., … Pan, G. (2018). Association of recent gay-related stressful events with depressive symptoms in Chinese men who have sex with men. BMC Psychiatry , (1), 217. https://doi.org/10.1186/s12888-018-1787-7Lu, X., Bengtsson, L., Britton, T., Camitz, M., Kim, B. J., Thorson, A., & Liljeros, F. (2012). The sensitivity of respondent-driven sampling. Journal of the Royal Statistical Society: Series A (Statistics in Society) , (1), 191–216. https://doi.org/10.1111/j.1467-985X.2011.00711.xLumley, T. (2004). Analysis of Complex Survey Samples. Journal of Statistical Software , (8). https://doi.org/10.18637/jss.v009.i08Mills, H. L., Johnson, S., Hickman, M., Jones, N. S., & Colijn, C. (2014). Errors in reported degreesand respondent driven sampling: implications for bias. Drug and Alcohol Dependence , , 120–126. https://doi.org/10.1016/j.drugalcdep.2014.06.015Ndori-Mharadze, T., Fearon, E., Busza, J., Dirawo, J., Musemburi, S., Davey, C., … Cowan, F. (2018). Changes in engagement in HIV prevention and care services among female sex workers during intensified community mobilization in 3 sites in Zimbabwe, 2011 to 2015. Journal of the International AIDS Society ,

21 Suppl 5 (Suppl Suppl 5), e25138. https://doi.org/10.1002/jia2.25138R Core Team. (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.Rocha, L. E. C., Thorson, A. E., Lambiotte, R., & Liljeros, F. (2016). Respondent-driven sampling bias induced by community structure and response rates in social networks.

Journal of the Royal Statistical Society: Series A (Statistics in Society) , n/a-n/a. https://doi.org/10.1111/rssa.12180Salganik, M. J., & Heckathorn, D. D. (2004). Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling.

Sociological Methodology , (1), 193–240. https://doi.org/10.1111/j.0081-1750.2004.00152.xSavitsky, T. D., & Toth, D. (2015). Bayesian Estimation Under Informative Sampling.chonlau, M., & Liebau, E. (2012). Respondent-driven sampling. Stata Journal , (1), 72–93.Sperandei, S. (2014). Understanding logistic regression analysis. Biochemia Medica , (1), 12–18. https://doi.org/10.11613/BM.2014.003Sperandei, S., Bastos, L. S., Ribeiro-Alves, M., & Bastos, F. I. (2018). Assessing respondent-driven sampling: A simulation study across different networks. Social Networks , , 48–55. https://doi.org/10.1016/j.socnet.2017.05.004Szwarcwald, C. L., Damacena, G. N., de Souza-Júnior, P. R. B., Guimarães, M. D. C., de Almeida, W. da S., de Souza Ferreira, A. P., … Brazilian FSW Group. (2018). Factors associated with HIV infection among female sex workers in Brazil. Medicine , (1S Suppl 1), S54–S61. https://doi.org/10.1097/MD.0000000000009013Toro-Tobón, D., Berbesi-Fernandez, D., Mateu-Gelabert, P., Segura-Cardona, Á. M., & Montoya-Vélez, L. P. (2018). Prevalence of hepatitis C virus in young people who inject drugs in four Colombian cities: A cross-sectional study using Respondent Driven Sampling. The International Journal on Drug Policy , , 56–64. https://doi.org/10.1016/j.drugpo.2018.07.002Truong, H. H. M., Grasso, M., Chen, Y.-H., Kellogg, T. A., Robertson, T., Curotto, A., … McFarland, W. (2013). Balancing theory and practice in respondent-driven sampling: a case study of innovations developed to overcome recruitment challenges. PloS One , (8), e70344. https://doi.org/10.1371/journal.pone.0070344Valois-Santos, N. T. , Niquini, R. P., Sperandei, S., Bastos, L. S., Bertoni, N., Brito, A. M., Bastos, F. I. (2020). Reassessing geographic bottlenecks in a respondent-driven sampling based multicity study in Brazil. Salud Colectiva , 16, e2524.Volz, E., & Heckathorn, D. D. (2008). Probability Based Estimation Theory for Respondent-Driven Sampling.

Journal of Official Statistics ,24