[PDF] A Data Science Approach to Understanding Residential Water Contamination in Flint

Abstract

When the residents of Flint learned that lead had contaminated their water system, the local government made water-testing kits available to them free of charge. The city government published the results of these tests, creating a valuable dataset that is key to understanding the causes and extent of the lead contamination event in Flint. This is the nation's largest dataset on lead in a municipal water system. In this paper, we predict the lead contamination for each household's water supply, and we study several related aspects of Flint's water troubles, many of which generalize well beyond this one city. For example, we show that elevated lead risks can be (weakly) predicted from observable home attributes. Then we explore the factors associated with elevated lead. These risk assessments were developed in part via a crowd sourced prediction challenge at the University of Michigan. To inform Flint residents of these assessments, they have been incorporated into a web and mobile application funded by \texttt{Google.org}. We also explore questions of self-selection in the residential testing program, examining which factors are linked to when and how frequently residents voluntarily sample their water.

Full PDF

AA Data Science Approach to Understanding Residential WaterContamination in Flint

Alex Chojnacki ∗ University of [email protected]

Chengyu Dai

University of [email protected]

Arya Farahi

University of [email protected]

Guangsha Shi

University of [email protected]

Jared Webb

Brigham Young [email protected]

Daniel T. Zhang

University of [email protected]

Jacob Abernethy

University of [email protected]

Eric Schwartz

University of [email protected]

ABSTRACT

When the residents of Flint learned that lead had contaminatedtheir water system, the local government made water-testing kitsavailable to them free of charge. The city government publishedthe results of these tests, creating a valuable dataset that is keyto understanding the causes and extent of the lead contaminationevent in Flint. This is the nation’s largest dataset on lead in amunicipal water system.In this paper, we predict the lead contamination for each house-hold’s water supply, and we study several related aspects of Flint’swater troubles, many of which generalize well beyond this one city.For example, we show that elevated lead risks can be (weakly) pre-dicted from observable home attributes. Then we explore the factorsassociated with elevated lead. These risk assessments were devel-oped in part via a crowd sourced prediction challenge at the Uni-versity of Michigan. To inform Flint residents of these assessments,they have been incorporated into a web and mobile applicationfunded by

Google.org . We also explore questions of self-selectionin the residential testing program, examining which factors arelinked to when and how frequently residents voluntarily sampletheir water.

CCS CONCEPTS • Information systems → Data analytics; • Machine learning → Applied computing;

KEYWORDS

Water Quality; Flint Water Crisis; Risk Assessment; Machine Learn-ing; Sampling Bias; Public Policy ∗ The six student authors are alphabetically ordered first, followed by the two facultyauthors, also alphabetically ordered.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

We now understand the Flint Water Crisis as a disaster with manyfacets: environmental, socio-economic, political, and infrastruc-tural, among others. The dire problems affecting the city’s waterstarted in April 2014 when, as a short-term cost-saving measure,city officials opted to switch the water supply from Lake Huron tothe Flint River. Not long after the switch, residents began to noticean unpleasant odor and discoloration in the water flowing fromtheir taps. While water testing data reported by state governmentofficials passed regulations from the U.S. Environmental ProtectionAgency (EPA), data collected by outside academics from VirginiaTech suggested otherwise. This independent academic work foundwater lead levels dramatically higher than the threshold allowedby the EPA’s Lead And Copper Rule. It was not until September2015, following a report by a pediatrician observing a dramatic risein lead levels in the blood of Flint children [10], that the watercrisis began to receive serious attention from government officials.In December 2015, Flint’s mayor declared a state of emergency,and agents from both the Michigan Department of EnvironmentalQuality (DEQ) and the EPA embarked on thorough investigations.By late 2015 and early 2016, the media had elevated the Flint WaterCrisis into a major national and international news story.Eventually, the immediate cause was understood: the water fromthe Flint River was significantly more corrosive than local officialshad thought. This, and other governmental failures, resulted inimproper water treatment. Central to the problem was that, likemany U.S. cities, Flint’s water infrastructure contains tens of thou-sands of lead pipes. These pipes typically are treated with beneficialchemicals to develop thick layers of deposits, which protect wa-ter against contamination from heavy metals. Treated incorrectly,however, Flint’s corrosive water began to erode these protectivelayers and ultimately, lead particles leeched from the pipes into thecity’s drinking water. Though Flint returned to Lake Huron’s watersupply in October 2015, the damage was done, and pervasive leadcontamination continued to be detected through 2016. While theEPA determined the water was safe to drink with a filter by mid2016, many issues remain and citizens continue to rely on bottledwater [7]. The city’s most vulnerable residents, namely children, It is now well established that lead-contaminated water poses significant health risks,particularly for children [3] a r X i v : . [ c s . L G ] J u l regnant women, and the elderly, have likely been exposed to leadin the water, and many questions about the lasting impact remainunanswered.As Flint’s water crisis has continued to unfold, affecting as manyas 35,000 homes, both city and state officials have been faced withdaunting questions: what is the best way to direct scarce resources?How can bottled water and water filter technology be efficientlydistributed? Where should volunteers be sent to educate residents?As the city has embarked on a highly expensive pipe removal pro-gram, where a replacing a single home’s water service line can costaround $5,000, officials have asked the obvious question: whichhomes are most at risk for lead contamination? Flint’s recoverydepends greatly on isolating which properties are most in need ofattention. This question is important beyond Flint, as other citiesand towns with aging infrastructure continue to address lead andother heavy metal abatement.In the present paper, we consider the problem of estimatingthe risk of lead contamination in home drinking water. This workrelies on a large collection of water samples taken by residents andgovernment officials throughout the crisis. Beginning in late 2015,the State of Michigan initiated program allowing any resident tosubmit a tap water sample for testing. This dataset is a publiclyavailable collection of over 25,000 tests, and it provides a glimpseinto the causes and extent of water lead contamination in Flint; itis indeed the largest dataset collected on lead in a municipal watersystem. We combine these measurements with several other datasources, including census data, property attributes, geographicalinformation, and infrastructure records, and we use the combineddata to answer several statistical and analytical questions. Amongthese are: • To what extent can we predict elevated lead in a home’sdrinking water? • What attributes of a home are associated with lead contami-nation? • How can we address the sampling bias of volunteer residentialtesting?

We present a number of additional results, and we conjecture thatmany of these observations will generalize beyond Flint.

Flint’s Water Contamination: A Birds-Eye View

Before we begin our analysis, let us give an overview of the leadtesting data and a brief analysis. When a resident takes a watersample and submits this water sample for testing, the state deter-mines the lead content (typically by mass spectrometer) and reportsthe result in parts per billion (ppb). The data released by the staterounded these values down to the nearest integer. Thus when wesay that a sample had “no detectable lead” we mean less than 1 ppb.It is important to note that, despite what one may infer from head-lines, nearly half of all homes had no detectable lead, and around80% of measurements from the residential testing program werebelow 5 ppb.These lead levels still warranted attention according to the law.The US Congress passed what is known as the Safe Drinking Wa-ter Act in 1986, which instructed the EPA to develop regulationslimiting heavy metals in drinking water. Pursuant to the act, theEPA developed what is now known as the Lead and Copper Rule (LCR), issued in 1991, requiring municipal water utilities to enforcea set of guidelines for allowable levels of lead and copper. Moreprecisely, the LCR requires that at regular intervals a municipalitymust take a set of water samples from a range of properties, andthat the 90 th percentile lead measurement must fall below 15 ppb.As a result of these EPA requirements, throughout the paper weemphasize this 15 ppb threshold.It is worth noting that, from the perspective of public health, thisvalue of 15 ppb is rather arbitrary. It is very challenging to deter-mine precisely the risks to human health from lead contaminationin water, and most epidemiological work aimed at understandingadverse effects from consuming dissolved lead can provide onlycoarse answers [12]; public health experts typically say that “nolevel of lead is safe.” The current guidelines should be viewed onlyas a workable regulatory framework. Figure 1: Comparing the 90 th percentile of lead readings, onthe sentinel data vs. the voluntary residential testing data. Based on this law, the key quantity is the estimate of the 90 th percentile of lead readings. We describe this quantity in Figure 12for each month of 2016, drawing from both the government-run sen-tinel program and the larger voluntary residential testing program.Using data from the state’s sentinel program, we found during aperiod in February only between 8 and 15 percent of homes had leadabove the federal action level of 15 ppb. Lead measurements areconfounded by weather and temperature, which is likely the reasonbehind the summer rise in lead levels. But in general it is hard todraw simple conclusions about the trend of lead contamination inFlint.Despite the statistical issues, a result of these guidelines has beensignificant political attention paid to what percent of homes that testat or below the 15 ppb threshold. This was especially true in Flintwhere it was alleged that government officials manipulated datato achieve compliance. At the height of the political firestorm, theembattled Governor Rick Snyder put out a tweet (seen in Figure 2)celebrating good news about the elevated lead levels. The tweet wasdeleted within a day, but we were able to grab a screenshot. Our ownanalysis, as displayed in Figure 12, however, rejects the conclusionof Governor Snyder’s deleted tweet about the distribution of leadlevels over time. igure 2: Governor Rick Snyder announcing improved Flintwater testing results on April 22, 2016. The tweet was quicklydeleted, and results then worsened over the summer. Notethat the plotted points of the line do not correspond to they-axis labels, and x-axis is not linear in time. Related Work.

Much of the work up until this point was con-ducted by Marc Edwards’ team from Virginia Tech, who indepen-dently monitored of lead water levels . Their efforts have helpedraise awareness and reveal the severity of the problem. In addition,[4] provides an overview of the water crisis and discusses strategiesfor risk management in Flint. Further, there is some work analyzingsome similar trends that we observe in lead levels over time [8, 9].But to the best of our knowledge, we are the first to apply predictivemodeling techniques to help with the Flint Water Crisis. This paper incorporates a diverse range of datasets related to prop-erties in the city of Flint. One of the main contributions of ourwork is acquiring and merging these datasets into a single dataset.Some of these datasets are publicly available from the state of Michi-gan, and others were provided by the city and other sources at ourrequest, as noted. We detail each dataset.

The vast majority of the lead water level data in Flint comes fromwater samples submitted voluntarily by residents. The city of Flintprovides free water testing services to all of its residents, who areable to pick up testing kits from a local distribution center. Residentsthen collect water from their own homes and submit the samples tobe analyzed by the Michigan Department of Environmental Quality.Since this program began in September 2015, over 25,000 tests havebeen conducted from 15,000 unique locations (as of May 2017). Theresults are available on the State’s website . For each sample we aregiven the date the sample was submitted, the lead and copper levels,and the address of the residence. In Figure 3, we show the locationsand lead readings for these tests. Measuring lead contamination is ahighly noisy process, and even repeated measurements at the samesource produce highly variable results. We can observe this directly The authors recognize some of their own work has been presented elsewhere [2]. Figure 3: Locations of voluntary residential water tests inFlint. Color corresponds to the level of lead contamination(parts per billion). We observe that elevated lead readingsare highly geographically diverse. in the data because a subset of homes had their water tested onmultiple occasions. The correlation in (log) lead levels betweenfirst and second samples is modest (Pearson correlation coefficient0.465 for voluntary residential testing and 0.522 for the sentinelprogram.This noisy measure has an effect on performance of our predic-tions, as we will see later. There are many causes for this noise, butone major source is the delicate nature of sampling a home’s water.Residents are asked to sample the first liter of water from theirtap first thing in the morning, with the hope of getting water thathas been stagnant in the plumbing, but a toilet flush or runningthe shower can significantly affect the concentration of variouscontaminants.

As news of the crisis broke, Michigan DEQ initiated what is calledthe “sentinel program,” in which over 400 homes were selected tobe tested multiple times over many months. These were homesthat were considered to be especially at risk of lead contamination—many were known to have a lead service line, for example—and theywere drawn from diverse neighborhoods around the city. These siteswere chosen to be a representative sample, and the state receivedsome guidance from other academics for selecting these homes. In Section 4, we address some of the reasons for residents testing their water morethan once. ata from the sentinel program has been made publicly availableat http://michigan.gov/flintwater. One of the challenges with determining lead contamination lev-els is determining which homes to test. The EPA requires watersystems to select homes that are at greater risk of elevated lead intheir tap water, according to the Lead and Copper Rule, but thisleaves much to the discretion of officials who can seek data pointsin order to produce more optimistic (or pessimistic) estimates. In-deed, investigators have questioned the selection of homes in Flint,for instance some were in a more newly-developed neighborhood[9, 11].Sentinel sites were visited for water tests a varying number oftimes, with some homes tested fewer than 5 times, while otherswere tested more than 10 times. The samples were taken at roughlyweekly intervals, early in 2016, and then less frequently as theyear went on. While the sentinel data represents a smaller set ofhomes than the voluntary residential testing program, we generallyassume the sentinel data to be much more reliable as the residentsin these homes are given more direct instructions, by workers andother officials, on how to correctly take a water sample. The bottlesare picked up by DEQ officials and others for chemical testing.

The city provided us with detailed records of the 55,893 parcels ofland in Flint. This data contains information on the property’s age,location, and value, in addition to other characteristics. This data isnot publicly available online in this exact form, but a very similardataset is freely in an ARCGIS format, known as Flint 2014 HousingData. We used the Google Maps API to merge noisy addressdata. Those samples that did not correspond to Flint parcels werediscarded. After merging and discarding non-Flint parcels, 55,857parcels remained in our dataset.The key step was merging the parcel data with the lead testingdata. We matched the address of each lead test to the address ofthe corresponding parcel of land in the city records. Because aparcel can contain multiple residences and residents are free tosubmit as many tests as they would like, we often have multipletests that correspond to a single parcel. On the other hand, becausemany properties in Flint are vacant and residents are not requiredto submit tests, most parcels have no associated lead test.An important challenge working with residential data on Flintis a striking fact:

Flint has the highest rate of vacant homes in anymunicipality across the US [1]. Figure 4 shows the density map ofvacant homes in on the Flint map. We have two variables serving asweak signals of occupancy: does the home has an active U.S. PostalService account, and was the 2014 Housing Condition survey. In ourdiscussions to follow, in Section 4, we carefully consider vacancy,and characterize the a household’s decision to submit a residential The sentinel data omit the full addresses of the homes, but our team was able to getaccess to these records with help from the Michigan Governor’s office. This allowedus to link each home to the many variables describing each parcel of property. Thanks to a grant and API access from Google.org.

Figure 4: There are many abandoned homes in Flint MI. Thisheatmap displays the density of (likely) unoccupied proper-ties. water test along with whether that test will have an elevated leadreading.

Water service lines are the pipes that connect each property in Flintto the water distribution system, often called the “water main”. Ahome’s water service line is typically composed of two differentsegments: public and private. The public service line which is thepipe connecting the water main to the property “curb box”, whichis an underground device owned by the municipality that containsa shutoff valve. The private service line connects the curb boxthrough front lawn and runs into the home’s water meter.Service lines can be made out of any number of materials, includ-ing lead, copper, galvanized steel, plastic, and other metal alloys.Unfortunately, there is not a definitive record of the service linematerial for every home. Initially, the City of Flint struggled toproduce any service line records. Eventually they discovered a setof 45,000 3” ×

5” index cards and a set of municipal maps fromthe water department with handwritten annotations [13]. The in-formation in these maps was painstakingly digitized by a groupof students at the University of Michigan, Flint, GIS center. Thisproject was spearheaded by Dr. Marty Kaufman, the faculty direc-tor of the center. It was noted that the city records are not alwaysccurate and reliable. For more details about the service lines see[15]. The previous data tell us much about the physical properties ofthe homes in Flint, but they do not tell us much about the peoplethat live in them. They also provide a richer understanding of theaffected populations. The census conducted by the U.S. CensusBureau has precise, parcel-level demographic data, but this data isnot made available until many years after it is gathered to protectcitizens’ privacy. The American Community Survey (ACS), however,is a survey conducted by the U.S. Census Bureau that supplementstheir census data with demographic and economic data. The resultsare provided at the level of census block groups.Using the American Fact Finder website , we acquired dataabout race, age, family structure, languages spoken, householdincome and rent values for each block in Flint city limits. The parceldata includes census tract, block group, and block information foreach parcel, so these block-level census data were merged with theother parcel-level data. In the present section, we present our predictive models of waterlead levels, allowing us to understand the factors related to highlead risk and to provide predictions for homes that had not yetbeen tested. In the previous section, we discussed the challengesassociated with lead testing data, particularly due to the noisynature of the sampling process. But A closer look at lead level datafrom Flint provides a much more nuanced picture, A number ofhome features correlate quite strongly with elevated lead, and wenote one example that should not come as a great surprise: theage of the property . In Figure 5 we report average log(lead levels+1) grouped by the year of construction for these homes, and thedownward trend is quite stark.Good lead risk predictions can inform public health policy inFlint. They can also provide insight into what factors are producingcontaminated water. In this section we discuss classification modelsthat predict whether a water sample submission will test above theEPA action level of 15 ppb.

To create our training data, we join the residential volunteer datawith the merged parcel data so that each sample has a correspondingparcel. Note that not every home in Flint has submitted a watersample to be tested. Similarly, several homes have submitted manysamples, and these will have a row in the training data for eachindividual test.For each row in our dataset, there are 71 features, coming fromthe parcel dataset, service line dataset, and census dataset. One-hot encoding is performed on all categorical features. The targetvariable is the binary classification of homes with water tests above15 ppb and below 15 ppb. The replacement of lead and galvanized service lines became a top priority for theCity of Flint in February 2016. By May 2017, over $100 million in State and Federalfunds had been appropriated for Flint service line replacement, managed by by theFlint Fast Action and Sustainability Team. https://factfinder.census.gov Figure 5: When averaged over many parcels, lead levels dis-play a number of very clear trends. Homes built after in the1950s and later display significantly lower lead levels thanhomes built in the early 1900s.Table 1: Grid search best parameters for

XTBoost

Number of Trees 512Training Subsample Ratio 0.9Tree Column Sample Ratio 0.6Max Depth 3 γ α λ calibration module from the scikit-learn was used to calibrate the predicted probabilities of the classifier.We constructed various models using the scikit-learn librariesand the XGBoost python package [6]. Tree based methods, suchas random forests, performed the best, with the

XGBoost gradientboosted tree classifier achieving the best prediction result. The crossvalidation score after 250 runs for the classifier was 0 . ± .

01. Atypical ROC curve is shown in Figure 6. The

XGBoost parametersare found in Table 1.The learning curve for is shown in Figure 7. The convergencein the learning curve indicates that the model has been saturatedwith data. The initial steep decline in the training score indicatesinherent bias in the model without sufficient data, but it declineswith appropriate numbers of samples.We also implemented various regression models, directly model-ing the continuous non-negative value of lead levels (ppb). However,compared to modeling the binary variable using the 15 ppb thresh-old, these consistently produced inferior results. For example, acollection typical xgboost regression models had a mean squarederror of 305 ±

72. When the predicted lead levels were converted igure 6: ROC curve of a typical train/test split.Figure 7: The learning curve was produced using 10-fold cross validation and the scikit-learn model selection module. The convergence and small gap in the curve indi-cate that adding more data is unlikely to improve predic-tions. into a <

15 ppb classifier, AUC scores dropped to 0 . ± .

1. Thislackluster performance of the continuous regression model is likelydriven by both the large range of target values and measurementerror, high variance in lead levels even within the same parcel.The perceived weakness of the regression models lead us to focusexclusively on classification.

After we determined the best model for predicting the water tests,we generated a prediction on all the parcels in the city of Flint.Figure 8 summarizes the location of 1,000 homes predicted to bemost likely to have lead in their water which is above the EPA

Figure 8: The 1000 parcels with the highest probability tosubmit a water sample with lead above the EPA action level. action level. The homes in Figure 8 have not submitted lead testsyet. These predictions serve an important purpose, as they providea risk assessment for homes that were never tested during the peakof the crisis. The analysis provides a predicted measure of leadexposure via water during the years 2014-16 for every home inFlint, which can be used for public health studies in the years tocome.

Feature importance with tree ensemble methods can be determinedby the number of times the individual trees in the forest split on eachfeature. We break down the results into the following categories.

Various measures of a property’s value weredetermined to be important by the model. The top two featureswere consistently the value of the buildings and the value of theland. Additionally, land improvements and state assessed valuewere important.

Demographic data from the census bureauwas also important. The model divided the city down by linesusing age and race. Some of the less important features that stillcontributed were whether homes had married parents and whetheror not only English is spoken in the home. .3.3 Property Age.

Finally, the age of the property was one ofthe most important features. This was visible in Figure 5. Othervalues that were correlated to property age also appeared, suchas the estimated age of the population and whether or not elderlypeople were present.

We initiated a Kaggle prediction challenge to improve our predic-tion accuracy. This was hosted by https://inclass.kaggle.com/ and offered to people affiliated with the University of Michigan.The contest involved a dataset with over 17,000 water tests fromnearly 11,000 Flint homes de-identified. Along with the lead testresults, some other de-identified features of the home and lead test,including property value, vacancy status, and time of test were pro-vided. During two months of competition, over 150 students andpost-docs from various departments at the University of Michiganparticipated, submitting over 500 times in the process. The 1st, 2nd,and 3rd place winners had the opportunity to present their classi-fiers to the Michigan Data Science Team (MDST) . The result ofthe challenge was a small improvement to our initial classificationmodels. The winning submission achieved this through ensemblingXGBoost models with other classification models. However, thesecond and third winning solutions used a Random Forest model.We observed a high degree of variance between Random Forestsubmissions, in part due to the intrinsic uncertainty in the predic-tions. Moreover, we learned the most significant improvementscame through adding additional data, rather than hyperparametertuning. Figure 9: The lead-level prediction problem was released asa UM internal prize-drive challenge. The competition wasfacilitated by the MSDT.

MyWater-Flint

App

Related to our modeling efforts, we were involved in a projectfunded by

Google.org to develop a mobile app and website for the The authors are members of MDST, http://midas.umich.edu/mdst/. city of Flint to help the community and government agencies man-age the ongoing water crisis. Figure 10 shows a screenshot of theapp. The app development was a collaboration between ProfessorMark Allison at University of Michigan – Flint, his students, andMDST, with support from

Google.org . Figure 10: Snapshot of the

Mywater-Flint website.

The

Mywater-Flint

App , uses the predictive model and fea-tures described earlier to identify homes at high, medium, andlows risk of lead contamination. The users are also able to do thefollowing: • access a citywide map of where lead has been found indrinking water. • discover where service line workers have replaced infras-tructure that connects. homes to the water main, andwhere they’re currently working. • locate the nearest distribution centers for water and waterfilters. • find step-by-step instructions for water testing. • determine the likelihood that the water in a home or an-other location is contaminated, among other features. We find that of the 32,741 occupied homes, 10,998 submitted atleast one water test. Investigating the predictive factors behindwhen and how often submissions occur can help us understandthe submission behavior of residents. We study this behavior andinvestigate features which correlate with water test submissionvariables.

Despite the low cost of submitting a residential water test, a largemajority of the properties in Flint have not submitted any tests.Many properties are simply vacant; these properties are discardedfrom the analysis in this section. One hypothesis is that residentsworking long hours may not have the ability to conduct and deliver he test. Another hypothesis is that some may not know where toobtain one. In order to better understand why a property mightmake a submission, we employ several classifiers to predict whethera property has submitted. Of these, we choose the best modelaccording to accuracy of the classification. We then calculate thefeature importances to give insight into submission behavior. The dataset we use is the result of join-ing block level census data, city of flint parcel information, andthe residential water testing dataset. Combined, the joined datasetcontains 60 features and 32,741 rows where each row representsa parcel of land in Flint. As mentioned previously, vacant parcelsare discarded. Then one-hot encoding is performed on all categor-ical features. The target variable is a binary where 0 means nosubmission and 1 means at least one submission.

Figure 11: This figure shows the 10 variables that an Ad-aBoost classifier deemed most important according to

Giniimportance metric. The y-axis shows the the (normalized) to-tal reduction of the criterion brought by that feature. Largervalues indicate more important features. Note that many ofthese features are related to parcel value.

We use an AdaBoost classi-fier from the scikit-learn python package with num estimators and learning rate set to 200 and 0.2 respectively. We chose theAdaBoost model for it’s robustness to overfitting and its consistentperformance at this classification task when compared to logisticregression with L2 regularization.After training the model, we evaluate our performance usinga 5-fold cross validation. The model consistently achieved recallaccuracy of 0 .

65 with a standard deviation of ± .

03, meaning the model correctly identified 64% of the true positives in the cross-validation set.

Of the the 60 features used in trainingthe model, proxies for property value are consistently the mostimportant features. We calculate feature importances with the Giniimportance metric [5]. Gini importance of a feature is computed byaverages the Gini decreases for that feature over all trees [14]. SeeFigure 11 for a graph comparing the 10 most important features.Table 2 describes the marginal distribution of some of the mostpredictive features. Parcels which submit more than one test aregenerally more valuable, as shown by increases in “ResidentialHome Value”, “HomeSEV”, and “Parcel Acres”. We do not find thathomes which are old, which would typically be at greatest riskfor lead contamination, test less than other occupied properties.However, as illustrated in table 3, the number of submissions froma property tends to increase with its value.We find that of the various parcel, census, and infrastructurefeatures considered by our models, features which describe thevalue of the parcel are more predictive than census demographicinformation. However, the census data available to us are reportedat the block level and may not be granular enough to inform theclassifiers effectively.

Goovaerts (2017, [9]) questioned the “generalizability” of sentinelsites and argued that sentinel sites are less representative thanvoluntary residential water test data. However, the residentialwater test data could be biased due to the voluntary nature of thedata collection process. The analysis in this work shows that theimportant features in our water lead level prediction and water testsubmission submissions overlaps heavily. One hypothesis is thatwater tests are less likely to be submitted from houses built before1930, but those old houses are also those more likely to be sufferingfrom high level lead exposure. Thus to investigate whether thewater lead level has improved over time quantitatively, we need tocarefully correct the selection bias incurred by the data collectionmethod [9]. We approach this problem by assigning correctionweights on the residential test data when we calculated the quantilewater lead level.To get the weights, we take advantage of our predictive model forwater test submission in section 4.1. The model provides the proba-bility p i of each parcel i submitting at least one water test sample.This probability is used as a proxy quantity for over-representation.Each observed sample should be given by a correction weight w that is inversely proportional to the (predicted) probability that itcan be collected. Denote the set of collected samples in a given timeperiod as S, w i = (cid:205) i ∈ S p i p i . For any water test that couldn’t match government parcel records,we assign an median weight and then we normalize each month’stotal weight to 1. After the weighting procedure, we examine thewater lead level improvement over time. We note that that despitethe lack of sampling strategy, the correction doesn’t change theconclusion that over the whole year of 2016, the water lead leveldropped after reaching the highest level at about May. Goovaertsarcel FeaturesAttribute Number of Submissions Q1 Median Q3 % sample non-zeroHomeSEV Zero $7,500 $10,500 $14,000 43%One $8,600 $11,700 $16,500 67%Two or more $9,000 $12,400 $17,600 69%Land Value Zero $787 $1,697 $5,039 99%One $984 $2,652 $10,403 99%Two or more $1,074 $2,793 $15,984 99%Residential Building Value Zero $17,271 $32,891 $62,430 92%One $18,539 $36,294 $70,922 96%Two or more $19,338 $40,541 $80,478 96%

Table 2: This table gives the quartiles of the most predictive features. We find parcels with at least one submission are morevaluable. Because the parcel data has some missing values, we include a column that indicates the number of non-zero valuesfor the given category.

Demographic FeaturesAttribute Number of Submissions Year Built Q1 Median Q3Aggregate Income Zero > > > < < < Table 3: This table presents the quartiles of the household income for parcels who submitted zero, one, or more than onesample. We also separte the homes into two groups based on property age.Figure 12: Comparing the 90 th percentile of lead readings onvoluntary testing data without/with the reweighting correc-tion procedure for the selection bias. The error bar showsthe standard deviation of the estimator by bootstrapping. (2017) adapted a weighted average of stratum-specific rates to es-timate the effect of sampling bias and concluded that voluntarytesting capture the main characteristics of Flint properties muchmore closely than the sentinel program [9]. Though they are usinga different approach their findings are consistent with findings inthis paper.After the bias correction, the 90 th percentile estimate of waterlead level in some months increase by a small amount, which is infavor of our hypothesis that the selection bias mostly results fromthe lack of submission from the old houses most affected by the crisis. This trend has been noticed elsewhere [9]. Modern correctiontechniques may be able to provide better insights, which is beyondthe scope of this work. The lead contaminating Flint’s water systems poses a serious healthrisk for all of the city’s residents. There are two major challengeswith assessing water contamination using samples tested for lead.The first is that the observed distribution of lead levels in waterfat tailed and highly skewed: the 95 th percentile of Flint’s leadreadings is 28 ppb, the 99 th percentile is 180 ppb, and the 99.9 th percentile is over 2,100 ppb. The second challenge is that measuringlead contamination is a highly noisy process.We collaborated with the City of Flint and the Michigan Depart-ment of Environmental Quality to acquire data and joined thesedata with existing public data. We used these data to build a predic-tive model to predict which homes are more likely at risk of highlead contamination. This model is employed in predictions shownon the MyWater-Flint app and website. We identified featureswhich are strong predictors of high lead levels and found that anumber of factors, not just the composition of service lines, areimportant to consider in addressing the crisis. Knowing these riskfactors can help policy makers and community members betterallocate limited resources and prioritize action in this time of need.Our lead predictions may also have future value. By establishingeach home’s chance of having had high lead during 2014-16 crisis,his work provides a proxy for lead exposure to be used studiestracking health outcomes for Flint residents in years to come.This work is ongoing and serves as a model for university-community partnerships and for data-driven public policy decisionmaking.

ACKNOWLEDGMENTS

This work was supported by

Google.org and the National ScienceFoundation, grants IIS 1453304 and IIS 1421391. This work wouldnot have happened without the support of the broader MichiganData Science Team, including Jonathan Stroud and many others.The authors recognize the support of Michigan Institute for DataScience (MIDAS) and computational support from NVIDIA.

REFERENCES arXiv preprint arXiv:1610.00580 (2016).[3] J. Archbold and K. Bassil. 2014.

Health Impacts of Lead in Drinking Water . Tech-nical Report.[4] Rachel Baum, Jamie Bartram, and Steve Hrudey. 2016. The Flint Water Crisis Con-firms That US Drinking Water Needs Improved Risk Management.

Environmentalscience & technology (2016). [5] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984.

Classification and regression trees . CRC press.[6] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.In

Proceedings of the 22Nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining

Scienceof The Total Environment

Science of The Total Environment

590 (2017), 139–153.[10] Mona Hanna-Attisha, Jenny LaChance, Richard Casey Sadler, and Allison Champ-ney Schnepp. 2016. Elevated blood lead levels in children associated with theFlint drinking water crisis: a spatial analysis of risk and public health response.

American journal of public health

Environmental health perspectives