[PDF] Risk Based Arsenic Rational Sampling Design for Public and Environmental Health Management

Abstract

Groundwater contaminated with arsenic has been recognized as a global threat, which negatively impacts human health. Populations that rely on private wells for their drinking water are vulnerable to the potential arsenic-related health risks such as cancer and birth defects. Arsenic exposure through drinking water is among one of the primary arsenic exposure routes that can be effectively managed by active testing and water treatment. From the public and environmental health management perspective, it is critical to allocate the limited resources to establish an effective arsenic sampling and testing plan for health risk mitigation. We present a spatially adaptive sampling design approach based on an estimation of the spatially varying underlying contamination distribution. The method is different from traditional sampling design methods that often rely on a spatially constant or smoothly varying contamination distribution. In contrast, we propose a statistical regularization method to automatically detect spatial clusters of the underlying contamination risk from the currently available private well arsenic testing data in the USA, Iowa. This approach allows us to develop a sampling design method that is adaptive to the changes in the contamination risk across the identified clusters. We provide the spatially adaptive sample size calculation and sampling location determination at different acceptance precision and confidence levels for each cluster. The spatially adaptive sampling approach may effectively mitigate the arsenic risk from the resource management perspectives. The model presents a framework that can be widely used for other environmental contaminant monitoring and sampling for public and environmental health.

Full PDF

RRisk Based Arsenic Rational Sampling Design for Public andEnvironmental Health Management

Lihao Yin a,b , ∗ , Huiyan Sang a , ∗ , Douglas J. Schnoebelen c , ∗∗ , Brian Wels c , Don Simmons d ,Alyssa Mattson d , Michael Schueller d , Michael Pentella d and Susie Y. Dai e , ∗∗∗ a Department of Statistics, Texas A&M University, College Station, Texas, 77843, USA b Institute of Statistics and Big Data, Renmin University, Beijing, China c University of Iowa, Iowa City, IA 52242 d State Hygienic Laboratory, University of Iowa, Coralville, Iowa, 52241, USA e Department of Plant Pathology and Microbiology, Texas A&M University, College Station, TX, 77843, USA

A R T I C L E I N F O

Keywords :private wellspatially clustered function modelresource management

A B S T R A C T

Groundwater contaminated with arsenic has been recognized as a global threat, which negativelyimpacts human health. Populations that rely on private wells for their drinking water are vulnera-ble to the potential arsenic-related health risks such as cancer and birth defects. Arsenic exposurethrough drinking water is among one of the primary arsenic exposure routes that can be eﬀec-tively managed by active testing and water treatment. From the public and environmental healthmanagement perspective, it is critical to allocate the limited resources to establish an eﬀectivearsenic sampling and testing plan for health risk mitigation. We present a spatially adaptive sam-pling design approach based on an estimation of the spatially varying underlying contaminationdistribution. The method is diﬀerent from traditional sampling design methods that often relyon a spatially constant or smoothly varying contamination distribution. In contrast, we proposea statistical regularization method to automatically detect spatial clusters of the underlying con-tamination risk from the currently available private well arsenic testing data in the USA, Iowa.This approach allows us to develop a sampling design method that is adaptive to the changes inthe contamination risk across the identiﬁed clusters. We provide the spatially adaptive samplesize calculation and sampling location determination at diﬀerent acceptance precision and conﬁ-dence levels for each cluster. The spatially adaptive sampling approach may eﬀectively mitigatethe arsenic risk from the resource management perspectives. The model presents a frameworkthat can be widely used for other environmental contaminant monitoring and sampling for publicand environmental health.

1. Introduction

Arsenic (As) is ranked as the 20th most abundant element in the Earth’s crust and has been studied internationally.Groundwater contaminated with arsenic has been recognized as a global threat, negatively impacting human health[1, 2]. The primary human exposure to arsenic is drinking water with additional contributors such as food and air [3–5]. Arsenic is a potent human carcinogen, which can cause bladder, lung, and skin cancers [6]. Furthermore, arsenicand its metabolites can cross the placental barrier and create risk for adverse maternal and fetal health, leading toadverse birth outcomes [7]. The Environmental Protection Agency (EPA) federal drinking water standard established0.01 mg/L as the arsenic maximum contaminant levels (MCLs) in drinking water. In the USA, approximately 41.8million (13% of the total US population) people obtain drinking water from private wells, and the private wells arenot regulated under the current EPA regulation [8]. The recent national Water-Quality Assessment Program fromthe United States Geological Survey (USGS) reports that more than one out of ﬁve wells contain contaminants atconcentrations exceeding the EPA MCLs or USGS health-based screening levels. Among the various pollutants thatexceed the EPA maximum contaminant levels, arsenic contamination is a common ﬁnding. Because private wells arenot regulated in the US, in the Midwest region, a signiﬁcant percentage of the population depending on private wells fordrinking water is at risk due to drinking water arsenic contamination [9]. Arsenic testing in private well water represents ∗ Equally contributing authors. ∗∗ Current contact: U.S. Geological Survey, 5563 De Zavala Road, San Antonio, TX 78023 ∗∗∗

Corresponding author. Department of Plant Pathology and Microbiology, College Station, TX, 77843, USA. E-mail address: [email protected](Susie Dai PhD).

ORCID (s):

Yin et al.:

Preprint submitted to Elsevier

Page 1 of 13 a r X i v : . [ s t a t . A P ] F e b isk Based Arsenic Rational Sampling Design a fundamental mean that helps mitigate the arsenic risk in the rural population for public and environmental health.In reality, many of the private wells are not tested, which presents a signiﬁcant challenge for health risk mitigation.From the management perspective, a scientiﬁcally sound sampling plan to test a representative sample size is neededto characterize the environmental arsenic hazard with limited resources.Sampling theory can be used to guide a large number of chemical and biological analyses for environmental controland consumer safety [10]. As for arsenic testing, a systematic sampling plan is critical for risk assessment to drawscience and data-based conclusions and make the best usage of limited resources. The EPA has published guidancefor data quality objectives with regard to sampling design [11]. One of the key preparations for a sampling design isto determine the sample size and sampling error for representative sample collection.Understanding sample statistical distributions is critical when selecting a sampling method, sampling strategy, andsample size. Application of probability distribution can help develop a science-based sampling plan and estimate thechemical and biological hazards in the environment. Previously, binomial probability theory has been well studiedfor sample size determination for estimating a binomial proportion [12]. Application examples include the samplingplan in product inspection and surveillance [13], epidemiology [14], and medical diagnostics [15]. In many of theseapplications, a univariate binomial distribution is considered, that is, the underlying binomial proportion parameter isassumed to be a constant in the study. However, due to the spatial heterogeneity nature of arsenic distribution in theearth’s crust and groundwater, the traditional binomial sampling scheme based on a univariate binomial distributionmay not be suitable to survey the target private well population. There is a great need to develop new sampling schemescapable of accounting for the spatially heterogeneity nature of the arsenic distribution.In terms of arsenic contamination, quite a few statistical and mathematical models have been used to estimate andpredict arsenic concentrations in groundwater and private wells. Logistic models for binomial distributions are widelyadopted to estimate the spatial distribution of As contamination probability at both global and regional levels [16–22].For instance, a logistic linear regression model has been used to predict the high arsenic domestic well population inthe US [23]. Furthermore, boosted regression tree models (weak-learner ensemble models) and traditional logisticlinear models have been compared to estimate and predict arsenic contamination probabilities in drinking water wellsin the Central Valley, California [24]. Similar to those statistical models, predictive variables are used to predictgeogenic arsenic in drinking water wells in glacial aquifers, north-central USA [25]. Machine learning models havealso been used to predict arsenic concentrations in groundwater in Asia [26]. Nevertheless, the aforementioned modelsprimarily focus on the estimation and prediction of arsenic distributions rather than the sampling design. Moreover,most methods often rely on a rich set of predictors and training data set to guarantee model accuracy. To the best ofour knowledge, there is very limited work that combines the model-based estimation of varying arsenic distributionswith the binomial sampling design method.To close this gap in the current literature for spatial binomial distribution sampling design, the current study pro-poses a spatially adaptive sampling design approach, by estimating a spatially clustered underlying contaminationdistribution. We apply this method to determine the data locations to understand arsenic contamination in privatewells in Iowa. The method is diﬀerent from traditional spatial sampling design methods [27, 28] that often assumecontinuous process-based spatial models for relatively smooth spatial ﬁelds. In contrast, we model the underlyingcontamination risk as a spatially clustered function for a straightforward interpretation of the result. It also has theadvantage of detecting discontinuous spatial heterogeneity in the arsenic distribution and then borrowing informationwithin each identiﬁed spatially homogeneous cluster for an adaptive sampling design. The method is built upon agraph fused lasso regularization method [29], which automatically detects clusters of spatial units and estimates theunderlying spatially varying contamination distributions simultaneously. Thanks to the ﬂexibility of graphs, our spa-tial clustering model enjoys several nice properties. First, it leads to very ﬂexible cluster shapes naturally satisfyingspatial contiguity constraints. Second, the method automatically learns the number of clusters from the data, relaxingthe limitation in other clustering algorithms that require to specify the number of clusters a priori.Another unique advantage of estimating a spatially clustered contamination distribution over other contaminationdistribution estimation methods lies in its easy integration with the traditional binomial sampling theory. Within eachidentiﬁed spatial cluster, the contamination distribution can be treated as having a common binomial proportion, forwhich we propose and compare two diﬀerent sample size determination methods at diﬀerent levels of acceptanceprecision and conﬁdence. Given the sample size calculations, a remaining sample design task is to determine thesampling locations. In our study, both the candidate wells and the available tested wells are distributed highly unevenlyin the study region. To ensure the sampling design has a balanced spatial coverage, we propose a practical algorithmbased on spatial point processes to distinguish areas that have been suﬃciently-sampled and insuﬃciently-sampled, Yin et al.:

Preprint submitted to Elsevier

Page 2 of 13isk Based Arsenic Rational Sampling Design and determine new sampling locations accordingly. This new strategy, presumably more adaptive than traditionalsampling without considering heterogeneity in sampling distributions, can potentially provide more precise tools toeﬃciently allocate sample collection eﬀorts and resources.

2. Materials and Methods

For the private well samples, the data used to build the model was collected as part of the Iowa Grants-to-Counties(GTC) program. The Iowa GTC program was established in 1987 after the Iowa legislature passed the Iowa Ground-water Protection Act to protect groundwater. Arsenic testing has been included as part of the GTC program based onIowa Administrative Code [30]. A total of , samples were collected and analyzed at the University of Iowa StateHygienic Laboratory from July 1st, 2015 to June 16, 2020. As part of the GTC program, the local health departmentcollects the private well samples by conducting a home visit, and sending them to a laboratory for analysis. It shouldbe noted that the selection of the laboratory is at the county’s discretion.For all the samples analyzed at the State Hygienic Laboratory, the water sample is collected either at the tap faucetor outside the house. Samples are collected in a 4 oz. HDPE plastic bottle containing 1 mL of nitric acid as apreservative. Cooling is not required for sampling. Samples are screened for turbidity following Standard Methods2130 B using a HACH model 2100N Turbidimeter. Samples exceeding 1 nephelometric turbidity units (NTU) aredigested prior to analysis. The arsenic analysis is performed based on the Iowa State Hygienic Laboratory standardoperating procedure (SOP), similar to the EPA 200.2 method. Brieﬂy, a 50-mL aliquot is transferred from a well-mixedsample to a polypropylene digestion tube (Environmental Express hydrochloric acid (Fisher, Trace Metal Grade) are added to the tubes. Digestion is accomplished using ahot block (Environmental Express ◦ C. The sample volume is reduced to 10 mL, and thenthe sample is covered with a watch glass (Environmental Express , previously collected observations of Arsenic tests in total (Figure 1). Based onthe risk categories, we characterize the wells that contain higher than 0.01 mg/L arsenic as high risk wells, and use abinary variable to denote whether a well is at high risk. We exclude the observations whose location information isabsent. We also aggregate the repeated measurements at the same locations into one single observation, by setting thebinary value to be if there is at least one concentration measurement exceeding MCL. A visual presentation of theprivate well arsenic testing is available through the Iowa Department of Public Health website [31]. Let 𝑦 ( 𝐬 𝑖 ) denote the binary variable at a well location 𝐬 𝑖 , for 𝑖 = 1 , … , 𝑛 , coded as being if the arsenic concentrationat 𝐬 𝑖 is exceeding the EPA MCL (i.e., 0.01 mg/L), and otherwise. Here, 𝑛 is the total number of available tested wells.We propose a spatially varying binary logistic model for 𝑦 ( 𝐬 ) . Speciﬁcally, we assume 𝑃 ( 𝑦 ( 𝐬 𝑖 ) = 1 ) ∼ Bernoulli ( 𝑝 ( 𝐬 𝑖 ) ) , for 𝑖 = 1 , … , 𝑛, (1)where 𝑝 ( 𝐬 𝑖 ) is the probability of the well located at 𝐬 𝑖 being contaminated. In the logistic regression model, we modelthe probability 𝑝 ( 𝐬 𝑖 ) as 𝑝 ( 𝐬 𝑖 ) = 11 + exp{− 𝛽 ( 𝐬 𝑖 )} or equivalently, log 𝑝 ( 𝐬 𝑖 )1− 𝑝 ( 𝐬 𝑖 ) = 𝛽 ( 𝐬 𝑖 ) , where 𝛽 ( 𝐬 𝑖 ) is interpreted as the log-odds of the arsenic contamination event that 𝑦 ( 𝐬 𝑖 ) = 1 . Let 𝜷 = ( 𝛽 ( 𝐬 ) , … , 𝛽 ( 𝐬 𝑛 ) ) be the stacked regression parameters for all the observed well locations. It Yin et al.:

Preprint submitted to Elsevier

Page 3 of 13isk Based Arsenic Rational Sampling Design follows that the corresponding logistic regression log likelihood function takes the form: 𝓁 ( 𝜷 ) = − 𝑛 ∑ 𝑖 =1 log(1 + 𝑒 𝛽 ( 𝐬 𝑖 ) ) + 𝑛 ∑ 𝑖 =1 𝑦 ( 𝐬 𝑖 ) 𝛽 ( 𝐬 𝑖 ) (2)It is noted from (1) that we relax the assumption of having a constant contamination probability 𝑝 , or equivalently,contamination log-odds, 𝛽 , over the whole study region, and instead assume it is varying over space. This assumption isreasonable for a large study region like Iowa due to the anticipated spatial heterogeneity in the arsenic concentration ingroundwater and private wells. Speciﬁcally, we assume 𝑝 ( 𝐬 ) is a spatially clustered function, that is, there exists a num-ber of geographical clusters such that 𝑝 ( 𝐬 ) stays relatively homogeneous within each cluster but varies across clusters.This will facilitate the easy visualization and interpretation of the varying contamination probability across diﬀerentidentiﬁed clusters. We will show in Section 2.3 that the spatially clustered contamination probability estimation alsoleads to an eﬃcient spatially adaptive sampling design strategy.We consider a ﬂexible regularization model for pursing the clustered pattern of 𝛽 ( 𝐬 ) and 𝑝 ( 𝐬 ) . Regularizationmethods have gained large popularity in modern high dimensional statistics and machine learning methods for variousstatistical learning tasks [32]. They have proved to be eﬀective in imposing structural assumptions on model parameterssuch as sparsity, smoothness, and clustering to avoid over-ﬁtting problems. The regularization method for the Arseniccontamination model is performed in the following steps:1. Construct a spatial graph, denoted as 𝐺 = ( 𝑉 , 𝐸 ) where 𝑉 = { 𝑣 , 𝑣 , ..., 𝑣 𝑛 } is the vertex set with 𝑛 vertices and 𝐸 is the edge set. For a spatial problem, each vertex represents a spatial location. For example, in the arseniccase study, each vertex 𝑣 𝑖 represents a well location 𝐬 𝑖 , and the edge set 𝐸 reﬂects the prior assumption on theneighborhood structure of well locations based on spatial proximity. The edge set selection is an importantcomponent of the method, which we will discuss later in this section.2. Use the graph from step 1 to construct a homogeneity pursuit regularization, also called the fused lasso penaltyfunction [29, 33] , for 𝜷 as follows: 𝜌 ∑ ( 𝑖,𝑗 )∈ 𝐸 | 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 )) | . (3)3. Combine the penalty function in (3) with the logistic log-likelihood function in (2) to form a penalized objectivefunction, which we minimize to obtain an estimator of 𝜷 as follows: ̂ 𝜷 = arg min 𝜷 𝑄 ( 𝜷 ) = arg min 𝜷 {− 1 𝑛 𝓁 ( 𝜷 ) + 𝜌 ∑ ( 𝑖,𝑗 )∈ 𝐸 | 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 ) | } . (4)4. After obtaining ̂ 𝜷 , calculate the estimate of the contamination probability from ̂𝑝 ( 𝐬 𝑖 ) = 11 + exp(− ̂𝛽 ( 𝐬 𝑖 ) ) .The fused lasso regularization in step 2 is used to impose the assumption that the arsenic contamination probabilitiesat two wells are more likely to take the same value if they are connected by an edge in 𝐸 of the speciﬁed spatial graph.The objective function 𝑄 ( 𝜷 ) in (4) takes a similar form as the standard negative log-likelihood function from Bernoullidistributions for binary arsenic data, but with an added fused lasso regularization term to encourage spatial clustering of 𝜷 . As a result, when estimating the arsenic contamination probabilities from this penalized objective function 𝑄 ( 𝜷 ) , wenot only use the information from the binary arsenic testing data in the ﬁrst likelihood term, but also take into accountthe spatial information from the spatial-graph based fused lasso penalty in the second term. 𝜌 is a regularization tuningparameter determining the strength of fused lasso penalty and ultimately inﬂuencing the estimated number of clustersof wells. The solution of 𝐿 norm penalty results in an exact fusion or separation between 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 ) , that is, theedges in the graph are classiﬁed into two sets, one consists of all the non-zero elements of 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 ) correspondingto pairs of neighboring wells that have diﬀerent contamination probabilities, and the other set consists of all the zeroelements of 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 ) corresponding to pairs of neighboring wells that share the same contamination probability.As such, this regularization automatically leads to spatially clustered contamination probabilities.The choice of graph plays two important roles in the method; it not only reﬂects the prior information about thegeological topology and spatial clustering constraint of the data, but also determines the computation complexity ofthe algorithm. Some natural graph choices for spatial data include the 𝑘 nearest neighbor graphs, graphs connecting Yin et al.:

Preprint submitted to Elsevier

Page 4 of 13isk Based Arsenic Rational Sampling Design neighbors within a certain radius, and spatial Delaunay triangulation graphs (see, e.g., Li and Sang [34]). Alternatively,graphs can be constructed based on some preliminary estimates of parameters. For instance, the diﬀerences betweenthe initial estimates of parameters at any two vertices can be used as the distance metric between vertices to replace thespatial Euclidean distance when constructing graphs. In this paper, we take a hybrid approach to construct the graph;the 𝑘 nearest neighbor edge set connecting counties is determined based on the sample proportion within each county,and the 𝑘 nearest neighbor edge set within each county is determined based on the Euclidean distance.There are several advantages of using the fused lasso penalty function for cluster detection. First, this penalizationallows to detect clusters and estimate model parameters simultaneously. Second, this method guarantees to achieve aspatially contiguous clustering conﬁguration such that only adjacent locations are clustered together. Another appealingproperty of this method is that the resulting clusters have very ﬂexible shapes. We explain this point using the notionof connected components in graph theory; spatially contiguous clusters can be deﬁned as the connected componentsof a graph 𝐺 , and accordingly, a spatially contiguous partition of 𝑉 can be deﬁned as a collection of disjoint connectcomponents such that the union of vertices is 𝑉 . It is easy to show that any spatially contiguous partition with arbitrarycluster shapes can be recovered by removing a set of edges from a spatial graph [34]. In addition, the number ofclusters does not need to be ﬁxed a priori. Instead, we can determine it by a data-driven information criterion approachdescribed later in this section. Finally, besides its capability to capture piece-wise constant coeﬃcients, previoustheoretical studies proved that this penalty has a strong local adaptivity in that it is also capable of capturing piece-wise Lipschitz continuous functions [35], which implies that the method can also approximate a spatially smoothlyvarying contamination probability reasonably well.We now discuss how to solve the optimization in (4) to obtain the parameter estimation results. Note that − 𝑛 𝓁 ( 𝜷 ) is convex and diﬀerentiable with respect to 𝜷 , and ∑ ( 𝑖,𝑗 )∈ 𝐸 | 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 ) | is also convex. Therefore we propose aniterative algorithm combining the proximal gradient method [36] and the alternating direction method of multipliers(ADMM) [37] for this convex optimization problem. Speciﬁcally, given the current estimate 𝜷 ( 𝑡 ) , we let 𝐠 ( 𝑡 ) = 𝜷 ( 𝑡 ) +(1∕ 𝐿 ) 𝑛 ∇ 𝓁 ( 𝜷 ( 𝑡 ) ) , where 𝐿 is the Lipschitz constant of − 𝑛 𝓁 ( 𝜷 ) , and ∇ 𝓁 ( 𝜷 ( 𝑡 ) ) is the ﬁrst derivative of 𝓁 ( 𝜷 ) evaluated at 𝜷 ( 𝑡 ) . For the logistic regression model in (2), we can choose 𝐿 to be 𝑛 . Following the proximal gradient algorithm,we then update the value of 𝜷 by solving: 𝜷 𝑡 +1 = arg min 𝜷 ‖‖‖ 𝜷 − 𝐠 ( 𝑡 ) ‖‖‖ + 𝜌𝐿 ∑ ( 𝑖,𝑗 )∈ 𝐸 | 𝛽 ( 𝐬 𝑖 ) − 𝛽 ( 𝐬 𝑗 ) | . (5)We use the ADMM algorithm [38] to solve the optimization in (5). We will release the R code of our algorithm as asupplementary ﬁle upon acceptance of this manuscript for publication.Finally, the parameter estimation algorithm involves the selection of the tuning parameter 𝜌 . In high dimensionalstatistics, data-dependent model selection criteria, such as generalized cross-validation [39], Bayesian informationcriterion (BIC) [40] and extended Bayesian information criterion [41] have been commonly used to determine thevalue of 𝜌 . For the numerical studies in this paper, we use BIC with the form, BIC = −2 𝓁 ( ̂𝛽 ) + 𝑘 log 𝑛 , where 𝑘 is theestimated number of clusters. The “optimal" 𝜌 is selected by minimizing BIC from a candidate set. We now turn the attention to the sampling design problem for the determination of the sample size and samplelocations of wells. Recall in Section 2.2 we have obtained a spatially clustered contamination probability 𝑝 ( 𝐬 ) , that is,within each identiﬁed spatial cluster, each sample is assumed to have the same probability of being contaminated. Thisallows us to employ existing sampling design methods based on the univariate binomial distribution with a constant pwithin each cluster, while adapting to the value of 𝑝 across clusters. The method leads to a simple but eﬃcient samplingstrategy accounting for the spatial variation in 𝑝 ( 𝐬 ) .Sample size determination and conﬁdence interval construction methods for a constant-proportion binomial distri-bution have been well studied in the statistics literature. Popular methods include the Clopper-Pearson exact method,Wilson score method, Wald test, Bayesian Jeﬀreys method, and Agresti–Coull method, among others. For a reviewand comparison of diﬀerent methods, see, for example, [42] and [12]. In this work, we consider two methods, themodiﬁed Jeﬀerey and the Wilson score methods, following the recommendations by [12].Consider a univariate binomial distribution where a random sample of size 𝑛 is drawn from a large population, 𝑋 is the number of 1’s (e.g., the number of contaminated wells), and 𝑝 is the probability of a randomly selected well iscontaminated. We seek to ﬁnd the sample size, 𝑛 , such that, for a given 𝑝 and acceptance precision level 𝛿 , the expected Yin et al.:

Preprint submitted to Elsevier

Page 5 of 13isk Based Arsenic Rational Sampling Design length of the conﬁdence interval,

EL( 𝑛, 𝑝 ) ∶= E [Δ( 𝑋 )] is equal to 𝛿 , where Δ( 𝑋 ) is the length of conﬁdence interval,and the expectation is taken over the binomial distribution of 𝑋 . The modiﬁed Jeﬀerey and the Wilson score methodsare described below.1. The Wilson score test conﬁdence interval takes the form 𝑋 + 𝑧 𝛼 ∕2 ± 𝑧 𝛼 ∕2 √ 𝑧 𝛼 ∕2 + 4 𝑋 (1 − 𝑋 ∕ 𝑛 )2 ( 𝑛 + 𝑧 𝛼 ∕2 ) This method is derived from Pearson’s chi-square test, where the center of the interval is a weighted averageof sample proportion and 1/2, such that it is more suitable than the commonly used Wald method for extremeprobability or small sample sizes. The Wilson method also has the advantage of yielding an analytical formulafor the sample size as follows 𝑛 𝑊 = − 𝑧 𝛼 ∕2 [4 𝛿 − 2 𝑝 (1 − 𝑝 )] + 𝑧 𝛼 ∕2 √ [4 𝛿 − 2 𝑝 (1 − 𝑝 )] − 4 𝛿 ( 𝛿 − 1 ) 𝛿 , (6)where 𝑛 𝑊 is the required sample size for a given estimate of 𝑝 and an acceptance precision level 𝛿 .2. The modiﬁed Jeﬀreys method is derived from a Bayesian approach, which uses the non-informative Jeﬀrey’sprior Beta(1∕2 , to derive the posterior credible interval for 𝑝 , while modifying the formula at the boundaryvalues. For < 𝑋 < 𝑛 , the credible interval is [ Beta 𝛼 ∕2 ( 𝑋 + 1∕2 , 𝑛 − 𝑋 + 1∕2) , Beta 𝛼 ∕2 ( 𝑋 + 1∕2 , 𝑛 − 𝑋 + 1∕2) ] The expressions when 𝑋 takes boundary values are provided in Table 1 of [12]. The modiﬁed Jeﬀreys methodenjoys similar coverage properties as those of the Wilson score method. But it has an additional advantage ofyielding a credible interval that is equal-tailed. For modiﬁed Jeﬀreys,EL ( 𝑛 ; 𝑝 ) = 𝑛 ∑ 𝑋 =1 Δ( 𝑋 ) ( 𝑛𝑋 ) 𝑝 𝑋 (1 − 𝑝 ) 𝑛 − 𝑋 , which is a function of sample size 𝑛 depending on a given 𝑝 . The sample size can be calculated by solvingEL ( 𝑛 ; 𝑝 ) = 2 𝛿 . It follows that the required sample size using the modiﬁed Jeﬀreys method, denoted as 𝑛 𝐽 , takesthe form 𝑛 𝐽 = EL −1 (2 𝛿 ; 𝑝 ) . (7) 𝑛 𝐽 does not have a closed form and has to be solved numerically. In practice, it is often calculated by an approx-imated solution such that | EL ( 𝑛 ; 𝑝 ) − 2 𝛿 | is less than a certain tolerance.Spatial sampling design involves the determination of sample size, as well as the locations of sampling points. Onesimple and commonly used spatial sampling design is the uniform random sampling, where each location is chosenindependently and uniformly within each cluster. However, two complications arise when applying this method for theArsenic study. First, the number of all available candidate wells are not uniformly distributed in space. Second, a largenumber of wells have been tested where the sampling locations were arbitrarily chosen before the formal statisticalsampling design, which results in a highly unbalanced sampling in space with some areas over-sampled and the otherareas insuﬃciently-sampled. The design for the new sample well locations needs to exclude those previously testedwells. Our goal is to sample the candidate wells with the expectation that the combined new sample wells and thepreviously tested wells are spatially uniformly distributed in each cluster except for the over-sampled areas. To achievethis goal, we utilize the connection between the uniform distribution in space and the spatial Poisson point processmodel, and adopt the thinning sampling idea from the latter. As a preliminary, we introduce the intensity function ofthe spatial point processes [43], which characterizes the probability that a point occurs in an inﬁnitesimal ball arounda given location. If there is a point process  on 𝐷 ⊂ ℝ , let 𝑁 ( 𝐵 ) denote the expected number of points within anysubset 𝐵 ⊂ 𝐷 . The intensity function 𝜆 ( 𝐬 ) at location 𝐬 ∈ 𝐷 is deﬁned as, 𝜆 ( 𝐬 ) = lim | 𝑏 ( 𝐬 ) | → 𝑁 ( 𝑏 ( 𝐬 )) | 𝑏 ( 𝐬 ) ∩ 𝐷 | Yin et al.:

Preprint submitted to Elsevier

Page 6 of 13isk Based Arsenic Rational Sampling Design

Figure 1:

Spatial distribution of the Arsenic contamination presence/absence observations. where 𝑏 ( 𝐬 ) denotes a small ball containing 𝐬 , and measure | ⋅ | denotes the area. If 𝜆 ( 𝐬 ) = 𝜆 is a constant for all 𝐬 ∈ 𝐵 ,then  is called a homogeneous point process on 𝐵 , implying the point has the same probability to occur at eachlocation in 𝐵 . Besides, the intensity function determines the expected number of points on 𝐵 by E [ 𝑁 ( 𝐵 )] = ∫ 𝐵 𝜆 ( 𝐬 ) 𝑑 𝐬 .It is known that, conditional on the number of points, the locations from a homogeneous Poisson point process areuniformly distributed on 𝐵 . Therefore, the desired sample well locations have the intensity function ̂𝜆 ( 𝐬 ) = 𝑛 𝑖 ∕ 𝑎 𝑖 for 𝐬 located in cluster 𝑖 , to render the sampled wells evenly-distributed. Here 𝑛 𝑖 and 𝑎 𝑖 denote the number of requiredsamples and the area in cluster 𝑖 respectively.The detailed sampling algorithm is described below. First, we use the nonparametric intensity estimation ap-proach via R function density.ppp in package spatstat to estimate the candidate well intensity function, denotedas ̂𝜆 𝑐𝑎𝑛𝑑𝑖 ( 𝐬 ) , and the previously tested well intensity, denoted ̂𝜆 𝑒𝑥𝑖𝑠𝑡 ( 𝐬 ) . To exclude the previously tested wells in Iowafrom new samples, we calculate the target intensity from ̂𝜆 𝑡𝑎𝑟𝑔 ( 𝐬 ) = max{ ̂𝜆 ( 𝐬 ) − ̂𝜆 𝑒𝑥𝑖𝑠𝑡 ( 𝐬 ) , . Locations that havenegative ̂𝜆 ( 𝐬 ) − ̂𝜆 𝑒𝑥𝑖𝑠𝑡 ( 𝐬 ) values correspond to the over-sampled areas where the intensity of previously tested wellsexceeds the required sampling density. We will leave them out when drawing new samples. Finally, for other areas,each candidate well will be selected with probability ̂𝜆 𝑡𝑎𝑟𝑔 ( 𝐬 )∕ ̂𝜆 𝑐𝑎𝑛𝑑𝑖 ( 𝐬 ) , where 𝐬 is the location of the candidate well.The last step is based on the assumption that ̂𝜆 𝑐𝑎𝑛𝑑𝑖 ( 𝐬 ) is large enough to bound ̂𝜆 𝑡𝑎𝑟𝑔 ( 𝐬 ) , and indeed there are adequatewells available in Iowa to meet this assumption. As a result, the algorithm guarantees that the combined new samplesand existing samples other than the over-sampled areas will be (nearly) uniformly distributed, and the expected samplesize meets the requirement in Table 1.

3. Results

After the data pre-processing steps, there remain observations at diﬀerent locations. Figure 1 shows thespatial distribution of the observations, and Figure 2 shows the spatial map of the number of observations in eachcounty. From the existing tested data, the most tested regions include northern central Iowa, a few counties in thewestern central, southwestern, and eastern central Iowa regions (Figure 2). Less than 20% of the counties have morethan 100 tests per county. There are fewer tests per county in the southern, northeastern, and northwestern regions.We show in Figure 3 the sample proportion of the contaminated wells among all the tested wells at each county, asa means to visualize a rough empirical estimate of the arsenic risk and its spatial pattern. Even though we see anuneven testing distribution, which means uneven sampling at the current testing scale, we observe that the arsenic riskcharacterization appears to be independent of the testing density (Figures 2 and 3).

Yin et al.:

Preprint submitted to Elsevier

Page 7 of 13isk Based Arsenic Rational Sampling Design

Figure 2:

The number of tested wells in each county in Iowa.

Figure 3:

This ﬁgure illustrates the sample proportion of the contaminated wells among all the observed tested wells foreach county in Iowa; Grey color indicates there is no observed data in the county.

Ayotte et al. [23] use a predictive logistic regression model to estimate arsenic presence in regions with limitedarsenic data. In that study, a total of 20450 domestic well samples are used to develop the model to estimate for thewhole conterminous US. Unique to our research, we do not aim to establish a predictive model to accurately predictthe arsenic contamination in a given region, as the risk of As has been already recognized by the state and many localhealth risk management agencies. We aim to utilize the locally clustered arsenic risks to estimate a sample size withminimum bias, which can be managed with appropriately allocated resources. In order to do that, we deﬁne the binaryexistence of arsenic in a given private well is higher than 0.01 mg/L, which is the current EPA regulation level. Inother words, we regard if the private well contains less than 0.01 mg/L arsenic, then the health risks are absent in arisk-based sampling scheme. We ﬁrst model and estimate the underlying contamination risk as a spatially clusteredfunction following the method described in Section 2.2 for the straightforward interpretation of the result and easyimplementation of the sampling design. The optimization result partitions the whole state into three risk clustersbased on the estimated arsenic presence probability (Figure 4). The three risk probabilities (p) are 0.03, 0.21, and0.33 for clusters 1, 2, and 3, respectively. The risk cluster assignment is consistent with some previous observationsand predictions. For example, cluster 1 is largely consistent with the estimations in [23]. Cluster 2 is also highlightedwith potential high As contamination in the same study. Furthermore, a targeted As study performed in Cerro GordoCounty (Northern Central Iowa) has sampled 68 wells over three years [9]. The study reveals one potential mechanismof As mobilization in the shallow aquifer. The naturally occurring sulﬁde minerals (typically pyrite) in the bedrockaquifers could be the source of As. Under the oxidizing condition, the As mobilization could happen from rocks to

Yin et al.:

Preprint submitted to Elsevier

Page 8 of 13isk Based Arsenic Rational Sampling Design

Figure 4:

Partition of the map in terms of estimated 𝑝 ; In cluster 1, ̂𝑝 = 0 . ; In cluster 2, ̂𝑝 = 0 . ; In cluster3, ̂𝑝 = 0 . ; The numbers of observations in each cluster are , and respectively. the water. Signiﬁcantly, the Cerro Gordo study has resulted in a policy change for arsenic testing and well completionlocally. Interestingly, cluster 3 at the border of Iowa and Nebraska is identiﬁed as a new As "hotspot" in this currentstudy. Notably, the cluster 3 region overlaps with the Missouri alluvial plain. The Missouri River valley contains up toaround 150 feet of highly-permeable alluvial sediments [44]. Alluvial sediments could be quite heterogeneous in theirgravel, sand, silt, and clay compositions, dependent on the locations. At the same time, those sediments could containa large percent of argillaceous materials composed of organics, clays, and silts. The presence of argillaceous materialscould assist in disseminating arsenic pyrite from the materials themselves or from ferrous hydroxides coating the sandgrains, which often contain arsenic as well. Furthermore, diverse geochemical and bacterially mitigated reactions(i.e., oxidation, reduction, adsorption, precipitation, methylation, and volatilization) can participate actively in arsenicrecycling within alluvial aquifers. As the alluvial aquifers are largely unconﬁned, the water table’s movement up anddown in the aquifer can mobilize arsenic from the argillaceous material or the ferrous hydroxide coating the sandgrains through oxidation reactions. The potential high arsenic concentration in the private well in the alluvial plain(i.e., cluster 3) could be attributed to the permeable alluvial sediments and their unique properties. Based on the estimated probability clusters, we further estimate the ideal sample size based on various acceptanceprecision and conﬁdence levels. Based on the publicly available database (Iowa Private Well Tracking System), itis estimated there are more than 300,000 private wells in Iowa. Among them, , wells are geo-coded. Thetotal geo-coded well population locations are shown in Figure 5, clearly indicating an uneven spatial distribution inIowa. Based on the regional cluster risk probability, we thus deﬁne three diﬀerent cluster regions (clusters 1, 2, and3) with diﬀerent risk cluster ranks. For clusters with a reasonable testing coverage, we have three probabilities. Forcluster 1, the estimated probability for As concentration higher than 0.01 mg/L probability is 0.03. For clusters 2 and3, the probabilities are 0.21 and 0.34, respectively. If we deﬁne the precision acceptance as 10% of the probability,the precision acceptance is 0.003 for cluster 1 ( e.g. 10% of 0.03), 0.021 for cluster 2, and 0.034 for cluster 3. Table1 provides the calculated required sample size for each cluster under three diﬀerent conﬁdence levels (90%, 95%,and 99%) using both the Wilson in (6) and Jeﬀrey methods in (6). For example, at the 95% conﬁdence interval, theestimated sample size based on the Jeﬀrey method is 12446 for cluster 1. Applying the same criteria to clusters 2 and3, the estimated sample size would be 1442 for cluster 2 and 743 for cluster 3. The sample sizes calculated by theWilson method only slightly diﬀer from those of the Jeﬀrey method. Accordingly, at the 99% conﬁdence interval, weestimate that 21456, 2492, and 1282 samples are needed for clusters 1, 2, and 3, respectively.In the existing As data set, there are 6482, 3194, and 166 tested wells already collected from clusters 1, 2 , and3, respectively. It is noted that the sample size of the tested wells within each cluster constitutes a large proportionor exceeds the required sample size calculated in Table 1. However, we recognize the current As data collection isoperated at the county level since the local environmental health jurisdiction resides in each county. County level datageneration results in an uneven spatial distribution of sampling locations for the whole state discussed in Section 3.2. Yin et al.:

Preprint submitted to Elsevier

Page 9 of 13isk Based Arsenic Rational Sampling Design

Figure 5:

Locations of the , candidate wells in Iowa, after discarding the wells in absence of their location information. Table 1

Expected number of well sampling in each cluster;Conﬁdence Level Method Cluster 1 Cluster 2 Cluster 3

Wilson 8766 1017 523Jeﬀrey 8746 1015 523

Wilson 12446 1444 743Jeﬀrey 12420 1442 743

Wilson 21497 2493 1282Jeﬀrey 21456 2492 1284

Therefore, although some areas are over-sampled, new samples still need to be collected at those places that are onlysparsely sampled previously.We follow the method presented in Section 2.3 to determine the locations of new sampling locations. We use theprivate wells in the current Iowa PWTS database as the target sampling population (Figure 5). The goal is to achievea spatially balanced sampling design that meets the required sample size, while accounting for the fact that both thecandidate wells and existing tested wells are distributed highly non-uniformly in space. To illustrate, we give anexample of the sampling scenario using the sample size calculated from the Wilson method for the

CI in Figure 6.The dense red point clouds reveal the previously over sampled areas in this Figure. Cluster 2 has the largest proportionof previously over-sampled areas. Only a relatively small number of additional wells (marked by green dots) need to besampled, mostly are located in the middle west of this region. In contrast, most areas in cluster 1 have not been sampledand tested previously, with exceptions in several counties (e.g., Buchanan, Butler, and Clinton). In cluster 3, althoughthe spatial coverage of the existing tested samples is nearly uniform, our method suggests that an additional numberof wells need to be collected to achieve the desired conﬁdence level and precision accuracy. Overall, it is noted thatthe locations of samples in Figure 6 appear to be uniformly distributed except for the previously over-sampled areas.Looking more closely, we observe that the intensity/density of samples diﬀers across the identiﬁed risk clusters, dueto adopting a spatially adaptive sampling design according to each cluster’s own contamination risk.

Yin et al.:

Preprint submitted to Elsevier

Page 10 of 13isk Based Arsenic Rational Sampling Design

Figure 6:

An example of sampling results; , and additional wells are sampled in each cluster respectively inthis example.

4. Discussions

It is commonly recognized that many conditions such as geological, geochemical, and hydrologic variables, im-pact arsenic presence in groundwater. For example, It has been observed high arsenic concentrations are often found inmore arid western US [23]. Furthermore, precipitation and recharge show signiﬁcant correlations with arsenic concen-trations in domestic wells in the conterminous US. Among various conditions, glaciated terrain, bedrock geology, soilhydrology, soil tile drainage, water table depth and climate factors can also impact arsenic concentrations in groundwa-ter. Particularly, Iowa’s groundwater resources are majorly surﬁcial aquifers and bedrock aquifers. For a long historycontacting with glaciers, many parts of Iowa soil/dirt contain glacier age materials with moderate to low permeability.The water table beneath those materials occurs at relatively shallow depths and varies from 3 to 30 feet below ground[45].The micro-environment such as pH, soil, and water bacterial activity, oxidation and reduction reactions (Redox),coexistence with other elements (e.g., iron) can also play a signiﬁcant role in arsenic concentration in groundwater.Taking account of all those macro and micro-environmental conditions is a shared challenge for all current availablepredictive models to estimate arsenic concentrations at the county, state/province, or region levels.There are several potential beneﬁts to adopt the proposed sampling design. First, the sample size estimate suggestsfuture feasible random sampling targets, given the total Iowa private well population. As the sample sizes are depen-dent on the arsenic probabilities, we present options for the same probability with diﬀerent sampling precision goals.We also recognize there are regions with too few or no data points (Figure 3, thus warrant further sampling for prob-ability estimate). Second, the method developed in this study helps pinpoint future sampling locations with adequatestatistical power. From the resource management perspective, future planning can prioritize the high-risk well sam-pling, eliminate redundant testing, and collect representative samples for risk assessment purposes. In practice, samplecollections and management are often conducted at certain administration levels. It is desired to develop a samplingdesign method that is easy and fast to implement at each administration unit. Third, this design presents future oppor-tunities to investigate practical solutions to coordinate joint eﬀorts across counties for the eﬃcient implementation ofthe sampling design method.Moving forward, this work could be further reﬁned in several ways. First, the estimator we obtained by optimizingthe regularized log-likelihood function does not come with an uncertainty measure. As such, the sample size calcula-tion is only based on a point estimate of the contamination risk. A potential solution is to consider a Bayesian versionof the method. In principle, the modiﬁed Jeﬀrey’s method for sample size calculation can be adapted to account forthe uncertainty in the estimate of the contamination probability 𝑝 , where the expected length of the conﬁdence intervalused in (7) can be taken over both the distributions of 𝑝 and the binomial random variable 𝑋 instead of 𝑋 only. Second,we assume that the clusters of wells are spatially contiguous, and the contiguity constraint is deﬁned with respect to the Yin et al.:

Preprint submitted to Elsevier

Page 11 of 13isk Based Arsenic Rational Sampling Design choice of a spatial graph. However, in practice, the spatial contiguity constraint may not dominate the clustering con-ﬁguration globally, in the sense that two or more locally contiguous clusters that are remote in space may actually havevery similar arsenic concentrations, and hence should be classiﬁed into the same cluster. The method presented in thispaper needs to be modiﬁed to handle the case with both globally dis-contiguous and locally contiguous clusters. Oneidea is to perform a two-step analysis, where in the ﬁrst step we obtain local spatial clusters from the method presentedin this paper, and in the second step, conduct another clustering analysis without any spatial constraint based on thelocal clustering results from the ﬁrst step. Third, the model can be further improved with more representative samples.As we noted, there are counties without testing data, which presents a gap for risk analysis. We expect collecting datain those regions helps build a more comprehensive evaluation of arsenic health risk at the state level. Overall, thecurrent study presents a targeted approach to save cost and time for eﬀective public health management strategy. Therational sampling design focuses on risk categories, which assures that preventive measures and mitigation practicesare implemented where most needed.

References [1] J. Podgorski, M. Berg, Global threat of arsenic in groundwater, Science 368 (2020) 845–850.[2] L. A. DeSimone, P. A. Hamilton, Quality of water from domestic wells in principal aquifers of the United States, 1991-2004, US Departmentof the Interior, US Geological Survey, 2009.[3] K. S. Almberg, M. E. Turyk, R. M. Jones, K. Rankin, S. Freels, J. M. Graber, L. T. Stayner, Arsenic in drinking water and adverse birthoutcomes in Ohio, Environmental research 157 (2017) 52–59.[4] M. Vahter, Eﬀects of arsenic on maternal and fetal health, Annual review of nutrition 29 (2009) 381–399.[5] N. Sohel, L. Å. Persson, M. Rahman, P. K. Streatﬁeld, M. Yunus, E.-C. Ekström, M. Vahter, Arsenic in drinking water and adult mortality: apopulation-based cohort study in rural Bangladesh, Epidemiology (2009) 824–830.[6] M. Argos, H. Ahsan, J. H. Graziano, Arsenic and human health: epidemiologic progress and public health implications, Reviews on environ-mental health 27 (2012) 191–195.[7] M. S. Bloom, S. Surdu, I. A. Neamtiu, E. S. Gurzau, Maternal arsenic exposure and birth outcomes: a comprehensive review of the epidemi-ologic literature focused on drinking water, International journal of hygiene and environmental health 217 (2014) 709–719.[8] N. G. Association, Groundwater use in the United States of America, 2020. URL: .[9] D. J. Schnoebelen, S. Walsh, B. Hanft, O. E. Hernandez-Murcia, C. Fields, Elevated Arsenic in Private Wells of Cerro Gordo County, Iowa:Causes and Policy Changes., Journal of Environmental Health 79 (2017).[10] P. Minkkinen, Practical applications of sampling theory, Chemometrics and intelligent laboratory systems 74 (2004) 85–94.[11] U. E. P. A. (USEPA), Guidance on choosing a sampling design for environmental data collection, 2002.[12] L. Gonçalves, M. R. de Oliveira, C. Pascoal, A. Pires, Sample size for estimating a binomial proportion: comparison of diﬀerent methods,Journal of Applied Statistics 39 (2012) 2453–2473.[13] K.-M. Lee, T. J. Herrman, S. Y. Dai, Application and validation of a statistically derived risk-based sampling plan to improve eﬃciency ofinspection and enforcement, Food Control 64 (2016) 135–141.[14] N. Sepúlveda, C. Drakeley, Sample size determination for estimating antibody seroconversion rate under stable malaria transmission intensity,Malaria journal 14 (2015) 141.[15] L. Joseph, C. Reinhold, Statistical inference for continuous variables, American Journal of Roentgenology 184 (2005) 1047–1056.[16] M. Amini, K. C. Abbaspour, M. Berg, L. Winkel, S. J. Hug, E. Hoehn, H. Yang, C. A. Johnson, Statistical modeling of global geogenic arseniccontamination in groundwater, Environmental science & technology 42 (2008) 3669–3675.[17] J. D. Ayotte, B. T. Nolan, J. R. Nuckols, K. P. Cantor, G. R. Robinson, D. Baris, L. Hayes, M. Karagas, W. Bress, D. T. Silverman, et al.,Modeling the probability of arsenic in groundwater in New England as a tool for exposure assessment, Environmental science & technology40 (2006) 3578–3585.[18] L. Winkel, M. Berg, M. Amini, S. J. Hug, C. A. Johnson, Predicting groundwater arsenic contamination in Southeast Asia from surfaceparameters, Nature Geoscience 1 (2008) 536–542.[19] J. E. Podgorski, S. A. M. A. S. Eqani, T. Khanam, R. Ullah, H. Shen, M. Berg, Extensive arsenic contamination in high-pH unconﬁned aquifersin the Indus Valley, Science advances 3 (2017) e1700935.[20] L. H. Winkel, P. T. K. Trang, V. M. Lan, C. Stengel, M. Amini, N. T. Ha, P. H. Viet, M. Berg, Arsenic pollution of groundwater in vietnamexacerbated by deep aquifer exploitation for more than a century, Proceedings of the National Academy of Sciences 108 (2011) 1246–1251.[21] L. Rodríguez-Lado, G. Sun, M. Berg, Q. Zhang, H. Xue, Q. Zheng, C. A. Johnson, Groundwater arsenic contamination throughout China,Science 341 (2013) 866–868.[22] Q. Yang, H. B. Jung, R. G. Marvinney, C. W. Culbertson, Y. Zheng, Can arsenic occurrence rates in bedrock aquifers be predicted?, Environ-mental science & technology 46 (2012) 2080–2087.[23] J. D. Ayotte, L. Medalie, S. L. Qi, L. C. Backer, B. T. Nolan, Estimating the high-arsenic domestic-well population in the conterminous UnitedStates, Environmental science & technology 51 (2017) 12443–12454.[24] J. D. Ayotte, B. T. Nolan, J. A. Gronberg, Predicting arsenic in drinking water wells of the Central Valley, California, Environmental Science& Technology 50 (2016) 7555–7563.[25] M. L. Erickson, S. M. Elliott, C. Christenson, A. L. Krall, Predicting geogenic Arsenic in Drinking Water Wells in Glacial Aquifers, North-Central USA: Accounting for Depth-Dependent Features, Water Resources Research 54 (2018) 10–172.

Yin et al.:

Preprint submitted to Elsevier

Page 12 of 13isk Based Arsenic Rational Sampling Design [26] Z. Tan, Q. Yang, Y. Zheng, Machine Learning Models of Groundwater Arsenic Spatial Distribution in Bangladesh: Inﬂuence of HoloceneSediment Depositional History, Environmental Science & Technology 54 (2020) 9454–9463.[27] Z. Zhu, M. L. Stein, Spatial sampling design for prediction with estimated parameters, Journal of agricultural, biological, and environmentalstatistics 11 (2006) 24.[28] P. J. Diggle, R. Menezes, T.-l. Su, Geostatistical inference under preferential sampling, Journal of the Royal Statistical Society: Series C(Applied Statistics) 59 (2010) 191–232.[29] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight, Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society:Series B (Statistical Methodology) 67 (2005) 91–108.[30] I. D. of Public Health, Iowa Administrative Code 641, Chapter 24, Private Well Testing, Reconstruction, and Plugging- Grants to Counties ,2016.[31] Arsenic Testing Results in Private Well Water, Iowa Department of Public Health, accessed January 2021. URL: https://tracking.idph.iowa.gov/Environment/Private-Well-Water .[32] P. Bühlmann, S. Van De Geer, Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media,2011.[33] R. J. Tibshirani, J. Taylor, et al., The solution path of the generalized lasso, The Annals of Statistics 39 (2011) 1335–1371.[34] F. Li, H. Sang, Spatial homogeneity pursuit of regression coeﬃcients for large datasets, Journal of the American Statistical Association 114(2019) 1050–1062.[35] O. H. Madrid Padilla, J. Sharpnack, Y. Chen, D. M. Witten, Adaptive nonparametric regression with the k-nearest neighbour fused lasso,Biometrika 107 (2020) 293–310.[36] A. Beck, M. Teboulle, Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems, IEEE trans-actions on image processing 18 (2009) 2419–2434.[37] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., Distributed optimization and statistical learning via the alternating direction methodof multipliers, Foundations and Trends® in Machine learning 3 (2011) 1–122.[38] B. Wahlberg, S. Boyd, M. Annergren, Y. Wang, An ADMM algorithm for a class of total variation regularized estimation problems, IFACProceedings Volumes 45 (2012) 83–88.[39] G. H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979)215–223.[40] G. Schwarz, Estimating the dimension of a model, The annals of statistics 6 (1978) 461–464.[41] J. Chen, Z. Chen, Extended Bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771.[42] R. G. Newcombe, Two-sided conﬁdence intervals for the single proportion: comparison of seven methods, Statistics in medicine 17 (1998)857–872.[43] P. Diggle, A kernel method for smoothing point process data, Journal of the Royal Statistical Society: Series C (Applied Statistics) 34 (1985)138–147.[44] Mississippi and Missouri River Alluvial Aquifer, Missouri Department of Natural Resources, accessed January 2021. URL: https://dnr.mo.gov/geology/wrc/groundwater/education/provinces/riveralluviumprovince.htm .[45] J. C. Prior, J. L. Boekhoﬀ, M. R. Howes, R. D. Libra, P. E. VanDorpe, Iowa’s Groundwater Basics: A geological guide to the occurence, use,and vulnerability of Iowa’s aquifers (2003).

Yin et al.: