[PDF] An R Package for generating covariance matrices for maximum-entropy sampling from precipitation chemistry data

Abstract

We present an open-source R package (MESgenCov v 0.1.0) for temporally fitting multivariate precipitation chemistry data and extracting a covariance matrix for use in the MESP (maximum-entropy sampling problem). We provide multiple functionalities for modeling and model assessment. The package is tightly coupled with NADP/NTN (National Atmospheric Deposition Program / National Trends Network) data from their set of 379 monitoring sites, 1978--present. The user specifies the sites, chemicals, and time period desired, fits an appropriate user-specified univariate model for each site and chemical selected, and the package produces a covariance matrix for use by MESP algorithms.

Full PDF

AAn R Package for generating covariance matricesfor maximum-entropy sampling from precipitationchemistry data

Hessa Al-Thani · Jon Lee

March 20, 2020

Abstract

We present an open-source R package (

MESgenCov v 0.1.0) fortemporally ﬁtting multivariate precipitation chemistry data and extracting acovariance matrix for use in the MESP (maximum-entropy sampling problem).We provide multiple functionalities for modeling and model assessment. Thepackage is tightly coupled with NADP/NTN (National Atmospheric Deposi-tion Program / National Trends Network) data from their set of 379 monitor-ing sites, 1978–present. The user speciﬁes the sites, chemicals, and time perioddesired, ﬁts an appropriate user-speciﬁed univariate model for each site andchemical selected, and the package produces a covariance matrix for use byMESP algorithms.

Keywords maximum-entropy sampling · covariance matrix · environmentalmonitoring · environmetrics · NADP · NTN

Mathematics Subject Classiﬁcation (2000) · · · Introduction

The MESP (maximum-entropy sampling problem) (see [SW87, SW00, FL00,Lee12]) has been applied to many domains where the objective is to determinea "most informative" subset Y S , of pre-speciﬁed size s = | S | > , from aGaussian random vecor Y N , | N | = n > s . Information is typically measured H. Al-ThaniDepartment of Industrial and Operations Engineering, Univ. of Michigan, Ann Arbor, MI48105 USAE-mail: [email protected]. LeeDepartment of Industrial and Operations Engineering, Univ. of Michigan, Ann Arbor, MI48105 USAE-mail: [email protected] a r X i v : . [ c s . M S ] M a r Hessa Al-Thani, Jon Lee by (diﬀerential) entropy. Generally, we assume that Y N has a joint Gaussiandistribution with mean vector µ and covariance matrix C . Up to constants, theentropy of Y S is the log of the determinant of the principle submatrix C [ S, S ] .So, the MESP seeks to maximize the (log) determinant of C [ S, S ] , for some S ⊆ N with | S | = s .The MESP is NP-hard (see [KLQ95]), and there has been considerablework on algorithms aimed at exact solutions for problems of moderate size; see[KLQ95,Lee98,AFLW99,LW03,HLW01,AL04,BL07,Ans18b,Ans18a,CFLL20].All of this algorithmic work is based on a branch-and-bound framework in-troduced in [KLQ95], and the bulk of the contributions in these references ison diﬀerent methods for upper bounding the optimal value. This work hasbeen developed and validated in the context of a very small number of datasets, despite the fact that of course multivariate data is widely available. Thereason for this shortcoming is that despite all of the raw multivariate datathat is available, it is not a simple matter to turn this data into meaningfulcovariance matrices for Gaussian random variables.Our goal with the R package ( MESgenCov v 0.1.0) that we have devel-oped is to provide such a link — between readily available raw environmental-monitoring data and covariance matrices suitable for the MESP — in thecontext of environmental monitoring. Our work ﬁts squarely into recent ef-forts to better exploit massive amounts of available data for mathematical-programming approaches to decision problems. Even if we have reliable rawdata, we can only make good decisions if we have a means to prepare thatdata so that we can populate our mathematical-programming models in sucha was as to meet required assumptions.We note that another R package of interest is [LZWC19]: "EnviroStatprovides functions for spatio-temporal modeling of environmental processesand designing monitoring networks for them based on an approach describedin [LZ06]".In §1, we discuss application of the MESP to environmental monitoringand the NADP/NTN (National Atmospheric Deposition Program / NationalTrends Network) data environment. In §2, we describe our methodology. In§3, we describe our R package (

MESgenCov v 0.1.0). In §4, we make someconcluding remarks.

A key area of application for the MESP has been in environmental monitoring(see [ZSL00, BLZ94, GLSZ93], for example). The idea is that precipitation iscollected at many sites, and its chemistry is analyzed. This is costly, and it isa natural question as to whether a subset of the sites might yield data withoutmuch loss of information (as measured by entropy). But it is a challenge toprocess the raw data in such a way that multivariate normality is achieved,because only then are the model of the MESP and its related algorithmsapplicable.

ESgenCov 3

The NADP maintains the NTN (see [NAD18]); this network measures thechemistry (i.e., ammonium, calcium, chloride, hydrogen, magnesium, nitrate,pH, potassium, sodium, and sulfate) of precipitation at 379 monitoring sitesacross the US, with some data available as far back as 1978; at present, 255sites are active.Our R package is tightly coupled with this precipitation and chemistrydata. We are interested in instances of the MESP where n user-speciﬁed site/chemical pairs comprise N . Precipitation data (measured in L) are avail-able on a daily basis, and chemical concentrations (measured in mg/L) areavailable on a weekly basis. These datasets are available in the packages EnvMonDataConc and

EnvMonDataPre . They can be installed and loaded usingthe devtools library. If not already installed, the devtools library can beinstalled and loaded as follows: > install.packages("devtools")> library(devtools)

Then we can install the data packages (from GitHub) and load them: > install_github("hessakh/EnvMonDataConc")> install_github("hessakh/EnvMonDataPre") > library(EnvMonDataConc)> data("weeklyConc")> library(EnvMonDataPre)> data("preDaily")

A full description of the daily and weekly precipitation data appears in Ap-pendix A, derived from http://nadp.slh.wisc.edu/data/ntn/meta/ntn-daily-Meta.pdf and http://nadp.slh.wisc.edu/data/ntn/meta/ntn-weekly-Meta.pdf , courtesy of the NADP Small snapshots of the data can easily be viewed. For example, we canoutput the ﬁrst 6 rows and ﬁrst 5 columns of the weekly raw data. weeklyConc[1:6,1:5]siteID dateon dateoff yrmonth ph1 AB32 2016-09-13 18:40:00 2016-09-20 15:10:00 201609 -9.002 AB32 2016-09-20 15:15:00 2016-09-28 16:00:00 201609 -9.003 AB32 2016-09-28 16:00:00 2016-10-05 16:55:00 201610 6.564 AB32 2016-10-05 16:55:00 2016-10-11 17:00:00 201610 -9.00 National Atmospheric Deposition Program (NRSP-3). 2019. NADP Program Oﬃce,Wisconsin State Laboratory of Hygiene, 465 Henry Mall, Madison, WI 53706. Hessa Al-Thani, Jon Lee

Note that missing concentration values are coded as -9.00. t = 0 , , . . . , T − , let W ( t ) := set of weeks in month t , D ( w ) := set of days in week w , c w := recorded chemical concentration (mg/L) for week w ( c w = ∗ denotes an unrecorded value), p d := recorded precipitation quantity (L) for day d , p w := precipitation quantity (L) for week w ; p w = (cid:80) d ∈ D ( w ) p d .Then the chemical concentration (mg/L) for month t is calculated as y ( t ) := (cid:80) w ∈ W ( t ): c w (cid:54) = ∗ p w c w (cid:80) w ∈ W ( t ): c w (cid:54) = ∗ (cid:80) d ∈ w p d . It should be noted that when there is no weekly value available for the chemicalquantity, we do not use the precipitation values for any of the days in such aweek (so as to not artiﬁcially dilute the chemical concentration level for themonth).Next, we ﬁt a temporal model to log( y ( t )) , which is a rather standardmethod for handling heavy-tailed distributions. We note that our data has nozero concentrations, so there is no issue of " log(0) ".A quick look at some graphics indicates that there are clear long-termtrends; see Figure 1 , from which we can see that sulfate concentrations are Reprinted with the kind permission of the National Atmospheric Deposition Program(NRSP-3). 2019. NADP Program Oﬃce, Wisconsin State Laboratory of Hygiene, 465 HenryMall, Madison, WI 53706.ESgenCov 5 !!!!!! !!! !!!! ! !! !!!!!!! !!!! !!!! !!!!! !!!!!!!! !!!!!!! !! !!!! !!!! !! !!!! !!!!!!!! !!!!! !! !! !!!!!! ! !! !! !! !!!!! !! !! !!!!! ! !!!! ! !!! !!! !!! !! ! !!!!! !! !!!!! !! !!!!! !! !!! ! !!!! ! !!!!!! !!!!! !! ! !!!!!! ! (cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:65)(cid:116)(cid:109)(cid:111)(cid:115)(cid:112)(cid:104)(cid:101)(cid:114)(cid:105)(cid:99)(cid:32)(cid:68)(cid:101)(cid:112)(cid:111)(cid:115)(cid:105)(cid:116)(cid:105)(cid:111)(cid:110)(cid:32)(cid:80)(cid:114)(cid:111)(cid:103)(cid:114)(cid:97)(cid:109)(cid:47)(cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:84)(cid:114)(cid:101)(cid:110)(cid:100)(cid:115)(cid:32)(cid:78)(cid:101)(cid:116)(cid:119)(cid:111)(cid:114)(cid:107)(cid:104)(cid:116)(cid:116)(cid:112)(cid:58)(cid:47)(cid:47)(cid:110)(cid:97)(cid:100)(cid:112)(cid:46)(cid:105)(cid:115)(cid:119)(cid:115)(cid:46)(cid:105)(cid:108)(cid:108)(cid:105)(cid:110)(cid:111)(cid:105)(cid:115)(cid:46)(cid:101)(cid:100)(cid:117)

Sulfate as SO - (mg/L) Sulfate ion concentration, 1985 (cid:48)(cid:8805)(cid:32)(cid:50)(cid:46)(cid:53)(cid:49)(cid:46)(cid:53)(cid:50)(cid:46)(cid:48)(cid:49)(cid:46)(cid:48)(cid:48)(cid:46)(cid:53) !!!! !!! !! !!! !! !!!!! !!! ! !!! !!! !!! ! !! !! !! !! !! ! !! !!!!! !! ! ! !!! ! !! !!!!! ! !!!! !!!! !!! ! !!!! ! ! !!! !!! !! !! ! !!!! !! !!! ! !! ! !!! ! !!!!! ! !! ! ! ! !!!!! !!! !!!! ! ! !!! !!! ! !! ! !!! ! !! ! !!!! ! ! !!! !! !!! ! ! !! !

National Atmospheric Deposition Program/National Trends Networkh ttp://nadp.isws.illinois.edu

Sulfate as SO - (mg/L) Sulfate ion concentration, 1995

Sites not pictured:Alaska 01 0.2 mg/LAlaska 03 0.1 mg/LPuerto Rico 20 0.8 mg/L ≥ !!!! !!!! !!! !!! !! ! !!!! !! !!! ! ! ! !!! !!! !!! !!! ! !! !! !! ! !! ! !! !!!!! !! ! !!! ! !!! !! !!!! ! !!! !! ! !!!! !! ! !!! !!!! ! ! !!! !!! !! !!!!! !!! !!! !!! !! !! !!! ! ! !! !!!! !!!! ! !! !! ! !!! ! !! !! !! !! !! ! !! !!! !! !!! ! !! ! !!! ! !!! !! ! !! ! !!!! ! !! !! ! !! ! !!!!! ! !!! ! National Atmospheric Deposition Program/National Trends Networkh ttp://nadp.isws.illinois.edu

Sulfate as SO - (mg/L) Sulfate ion concentration, 2005 ≥ Sites not pictured:Alaska 03 0.1 mg/LVirgin Islands 01 0.9 mg/L !! !!!!!! !!!!!! ! !!! !! ! !! !!! !!!!!! ! !! ! !! !! !!! !! !!! !!! ! !!! !! !!!!!!! !! !!! ! !! !! !! !! !!!!! !!! !!! !! !!! !! !! !! !!! ! ! !! !! !!!! ! !!!! ! ! !! !!! ! !! !! !! ! !! !! !!!! !! !! !!!! !! ! !! !!! !!! ! ! !!! !! !! !! ! !! !!! !! !! !!! ! ! ! !!! !! !! !!!!!! !! !

Sulfate as SO (mg/L)

Sites not pictured:Alaska 01Alaska 02Alaska 03Br. Columbia 22Br. Columbia 23Puerto Rico 20Saskatchewan 20Saskatchewan 21Virgin Islands 01 0.1 mg/L0.1 mg/L0.1 mg/L0.8 mg/L0.1 mg/L0.7 mg/L0.5 mg/L0.3 mg/L0.7 mg/L

Sulfate ion concentration, 2015 !!!!!! !!! !!!! ! !! !! !!!!! !!!! !!!! !!!!! !!!!!!!! !!!!!!! !! !!!! !!!! !! !!!! ! !!!!!!! !!!!! !! !! !!!!!! ! !! !! !! !!!!! !! !! !!!!! ! !!!! ! !!! !!! !!! !! ! !!!!! !! !!!!! !! !!!!! !! !!! ! !!!! ! !!!!!! !!!!! !! ! !!!!!! ! (cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:65)(cid:116)(cid:109)(cid:111)(cid:115)(cid:112)(cid:104)(cid:101)(cid:114)(cid:105)(cid:99)(cid:32)(cid:68)(cid:101)(cid:112)(cid:111)(cid:115)(cid:105)(cid:116)(cid:105)(cid:111)(cid:110)(cid:32)(cid:80)(cid:114)(cid:111)(cid:103)(cid:114)(cid:97)(cid:109)(cid:47)(cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:84)(cid:114)(cid:101)(cid:110)(cid:100)(cid:115)(cid:32)(cid:78)(cid:101)(cid:116)(cid:119)(cid:111)(cid:114)(cid:107)(cid:104)(cid:116)(cid:116)(cid:112)(cid:58)(cid:47)(cid:47)(cid:110)(cid:97)(cid:100)(cid:112)(cid:46)(cid:105)(cid:115)(cid:119)(cid:115)(cid:46)(cid:105)(cid:108)(cid:108)(cid:105)(cid:110)(cid:111)(cid:105)(cid:115)(cid:46)(cid:101)(cid:100)(cid:117)

Fig. 1: Sulfate concentration over timegenerally trending downward over time. Again, looking at some data, we caneasily see periodic trends; see Figure 2, where we can easily see a yearly peri-odicity.The general model that we provide is log( y ( t )) (cid:86) = r (cid:88) i =0 β i t i + k (cid:88) j =1 (cid:20) a j cos (cid:18) πjt (cid:19) + b j sin (cid:18) πjt (cid:19)(cid:21) , (1)with the parameters β i , a i , and b i ﬁt by ordinary linear regression. The usercan specify the degree r for the polynomial part of the model which we thinkof as a truncated Taylor series, aimed at capturing aperiodic trends. Periodictrends are captured via a truncated Fourier series, truncated at level k .We note that [GLSZ93] used the following model to de-seasonalize andde-trend the log-transformed monthly sulfate concentration values: log( y ( t )) (cid:86) = β + β t + a cos (cid:18) πt (cid:19) + b sin (cid:18) πt (cid:19) . (2)This is just an aﬃne model β + β t plus a sinusoidal model with monthlyperiodicity and intercept a . The simple model (2) is (1) with r = 1 and k = 1 . We found that (2) did well at normalizing the errors for certain sites, Hessa Al-Thani, Jon Lee

Fig. 2: Log sulfate concentration over a four-year period at a siteFig. 3: Log sulfate concentrations at site "TN00"but some sites, such as "AL10" and "IN41", (2) did not do so well. So ratherthan ﬁx (2) as the model for our R package, we provide the ﬂexibility of (1).In Figure 3, we provide an example where (1) with r = 1 and k = 3 is amuch better ﬁt for the log concentration of sulfate at site "TN00" than (2).The plot on the left shows the ﬁt of model (2), and the plot on the right showsthe ﬁt of our model (1). ESgenCov 7

To produce the covariance matrix, we need error values for each time point.Missing values in the NADP/NTN data set means that each set of error valuesproduced by the model may vary in size. So we have ﬁlled in missing values foreach site and time point by sampling from a normal distribution with the meanbeing the predicted value by the univariate model and the standard deviationbeing the standard error of the univariate model.

Our R package

MESgenCov can be obtained from https://github.com/hessakh/MESgenCov and installed using the devtools library. > install_github("hessakh/MESgenCov")> library(MESgenCov)

MESgenCov contains functions in the S3 class to create a covariancematrix from the desired subset of NADP/NTN data. The function getCov() returns a covariance matrix, a list of univariate model summaries, and a ta-ble of normality tests produced by the MVN R package (see [KGZ14]). Themultivariate analysis is used to asses the validity of the covariance matrix tobe used as input for the MESP. The user can make adjustments to the inputof getCov() so as to obtain a covariance matrix for a multivariate Gaussianvector, which is then valid for use in the MESP.To avoid sites with a small sample size for the speciﬁed time-frame, thefunction getSites() outputs a vector of the sites with the largest sample ofdata for a given time-frame and measured chemical (see §3.2). To ﬁnd sitesthat are spatially "spread out" but have at least some speciﬁed sample size, thefunction maxDistSites() can be used to obtain a list of geographically sparsesites (see §3.2). Finally, in the case where the residuals from a univariate modeldo not appear to be normally distributed, the function lambertWtransform() allows the user to transform the residuals (from a univariate model) using the Rpackage LambertW: Probabilistic Models to Analyze and Gaussianize Heavy-Tailed, Skewed Data (see [Goe16]). This can be very eﬀective in situationswhere the distributions seems to have heavy tails and some skewness (see[Goe11] and§3.3).3.1 getCovgetCov() takes a 14 column data frame as input where each column corre-sponds to one of the user-speciﬁcations shown in Figure 4. The 14 speciﬁcationsin the input allow the user to specify the subset of data to analyze and givesthe user options in displaying diﬀerent parts of the analysis. Hessa Al-Thani, Jon Lee

Arguments

Deﬁnition startdateStr

Date and time of when to start analyzing thedata, in the format = m/d/y H:M enddateStr

Date and time of when to stop analyzing thedata, in the format = m/d/y H:M comp

String of pollutant or acidity level to beanalyzed, the pollutants name should be usedas it appears in weeklyConc use36

TRUE if default 36 sites should be added,FALSE otherwise siteAdd

List of strings of siteIDs that should beanalyzed outlierDatesbySite

List of sites where outliers should be analyzed siteOutliers

List of sites where outliers should be removed removeOutliers

Specify siteID string for outlier analysis plotMulti

TRUE if multivariate analysis plots shouldbe displayed, FALSE otherwise sitePlot

Specify list of siteIDs to be plotted plotAll

TRUE if plots for all sites should bedisplayed, FALSE otherwise writeMat

TRUE if .mat ﬁle of the resulting covariancematrix should be written in the workingdirectory r Integer <=5, see univariate model k Integer <= 5, see univariate model

Fig. 4: Input parameters for getCov()

A default set of inputs can be found in the stored data frame "defaultIn-put". Each column of "defaultInput" is an argument in the function getCov() .After storing "defaultInput" in a variable in the user’s workspace, the inputcan be changed. For example, below we store the "defaultInput" data framein a variable "df", and then change the end date: > data("defaultInput")> df <- defaultInput> dfstartdateStr enddateStr use36 siteAdd1 01/01/83 00:00 12/31/86 00:00 TRUE NULLoutlierDatesbySite siteOutliers comp plotMulti sitePlot1 NULL NULL SO4 FALSE NULLplotAll writeMat r kFALSE FALSE 1 1 df$enddateStr <- "12/31/88 00:00"

ESgenCov 9

The function getCov() produces a list with the the following elements:

Output

Deﬁnition cov

Covariance matrix produced by univariatemodel residuals listMod

List of univariate model summaries producedby lm() sites

List of sites that were analyzed mvn

Output of the MVN package univariateTest

Univariate test output, also by the MVNpackage residualData

Data frame of residuals produced by theunivariate model residualDataNA

Data frame of residuals, where missing valuesare left as NA rosnerTest

Output of the Rosner test for outlier analysisproduced by the EnvStats package;see [Mil13] pred

List of predicted values produced by theunivariate model for each siteNext, we demonstrate how to access these elements and certain plots afterrunning the function getCov() .1. Multivariate and univariate normality > df$plotMulti <- TRUE > df$k <- 3 > g <- getCov(df) > g$univariateTestTest Variable Statistic p value Normality1 Shapiro-Wilk AL10SO4 0.9347 0.001 NO2 Shapiro-Wilk IL11SO4 0.9894 0.8121 YES3 Shapiro-Wilk IL18SO4 0.9909 0.8854 YES4 Shapiro-Wilk IL19SO4 0.9184 2e-04 NO5 Shapiro-Wilk IL35SO4 0.8709 <0.001 NO

2. Outlier test for speciﬁc tests df$siteOutliers <- list(c("IN41"))df$sitePlot <- list(c("IN41"))g <- getCov(df)i <- match("IN41",g$sites)g$rosnerTest[[i]]$all.statsi Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier1 0 -0.0069 0.2814 -0.9359 25 3.3013 3.2680 TRUE2 1 0.0062 0.2604 -0.7194 30 2.7862 3.2628 FALSE3 2 0.0165 0.2471 0.7215 66 2.8533 3.2576 FALSE

By changing the input, we can remove the outliers detected by the Rosnertest. Note that the plots are generated after running getCov() . Further-more, getCov() does not need to be stored in a variable to generate theplots.

ESgenCov 11 > df$outlierDatesbySite <- c("IN41",25)> getCov(df)

3. Outlier test for all sites df$siteOutliers <- list(g$sites) df$removeOutliers <- list(g$sites)g <- getCov(df)

4. Plot all sites > df$plotAll <- TRUE> getCov(df)

5. Output all MVN package analysisThe following output is from a call to the MVN package that producesmultivariate analysis based on the Mardia method and univariate analysisbased on the Shapiro-Wilson method. > df$use36 <- FALSE > df$siteAdd <- list(c("NY52", "TN11", "IL63")) > df$siteOutliers <- NULL> df$outlierDatesbySite <- NULL> df$removeOutliers <- NULL> g <- getCov(df)

ESgenCov 13 > g$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 14.019 0.1721 YES2 Mardia Kurtosis -0.1465 0.8835 YES3 MVN YES$univariateNormalityTest Variable Statistic p value Normality1 Shapiro-Wilk NY52SO4 0.9821 0.3981 YES2 Shapiro-Wilk TN11SO4 0.9830 0.4385 YES3 Shapiro-Wilk IL63SO4 0.9873 0.6858 YES$Descriptivesn Mean Std.Dev Median Min MaxNY52SO4 72 1.1709e-17 0.2812 0.0333 -0.8815 0.6170TN11SO4 72 -1.4991e-02 0.4666 -0.0020 -0.9826 1.4243IL63SO4 72 4.8127e-18 0.3016 -0.0373 -0.7353 0.789925th 75th Skew KurtosisNY52SO4 -0.2211 0.1840 -0.3702 0.0894TN11SO4 -0.326 0.2130 0.4514 0.2981IL63SO4 -0.18369 0.1724 0.3287 0.0609

The speciﬁc call of the MVN package is > mvn(dfRes[,-1], subset = NULL, mvnTest = "mardia",covariance = TRUE, tol = 1e-25, alpha = 0.5, scale = FALSE,desc = TRUE, transform = "none",univariateTest = "SW",univariatePlot = "none", multivariatePlot ="none",multivariateOutlierMethod = "none", bc = FALSE, bcType ="rounded", showOutliers = FALSE, showNewData = FALSE).

See [KGZ14] for details on the MVN package.6. Covariance matrix > g <- getCov(df) > round(g$cov,digits = 4)NY52SO4 TN11SO4 IL63SO4NY52SO4 0.0791 0.0047 0.0009TN11SO4 0.0047 0.2177 0.0185IL63SO4 0.0009 0.0185 0.0909

7. Save covariance matrix as a .mat ﬁle (to populate an instance of the MESP,for example).

This is done by simply setting the input data frame attribute writeMat toTRUE. The .mat ﬁle will be saved to the user’s current working directory ascovSites.mat. For processing further with Matlab, use the (Matlab) ‘load’command. > df$writeMat <- TRUE> g <- getCov(df)

In the case that the user has already generated an output by the function getCov() , it is possible to also create the .mat ﬁle in the following manner. > library(rmatio)> write.mat(g$cov,filename = "covariance1.mat")

8. Univariate model summaries > result <- getCov(df) > sites <- result$sites > i = match(c("NY52"),sites) > result$listMod[i][[1]]Call:lm(formula = y1 ~ I(cos(t*(2*pi/s))) + I(sin(t*(2 *pi/s))) + I(cos(t*(2*pi/s)*2)) + I(sin(t*(2*pi/s)*2)) + I(cos(t*(2*pi/s)*3)) +I(sin(t*(2*pi/s)*3)) + I(t), data = df)Residuals:Min 1Q Median 3Q Max-0.5236 -0.1677 -0.0189 0.1818 0.9087Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 1.1110 0.0712 14.40 < 2e-16***I(cos(t*(2*pi/s))) -0.3541 0.0494 -9.75 2.8e-14***I(sin(t*(2*pi/s))) -0.1109 0.0498 -2.55 0.013*I(cos(t*(2*pi/s)*2)) 0.0500 0.0494 -0.37 0.716I(sin(t*(2*pi/s)*2)) 0.0797 0.0494 2.13 0.037*I(cos(t*(2*pi/s)*3)) 0.0391 0.0494 -0.20 0.845I(sin(t*(2*pi/s)*3)) 0.0536 0.0494 -0.52 0.604I(t) -0.0010 0.0017 -0.61 0.546

ESgenCov 15 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 0.2961 on 64 degrees of freedomMultiple R-squared: 0.6265,Adjusted R-squared: 0.5857F-statistic: 15.34 on 7 and 64 DF, p-value: 1.327e-11

9. Output data frame of residuals > g <- getCov(df) > g$residualData[1:5,]NY52SO4 TN11SO4 IL63SO41 0.1959813 -0.44377735 0.28248032 0.6170340 -0.38791938 -0.34950803 -0.4510128 1.05519620 -0.11582004 -0.1580145 -0.01789642 -0.12977765 -0.3218159 -0.47833685 -0.3415119 getCov() takes site lists as input. The function getSites() produces a list ofsites with available data for a speciﬁed time frame. The code below producesa list of 36 sites with the most weekly data between the years 1983–1986. > result <- getSites("01/01/83 00:00","12/31/86 00:00",36,104,"SO4","")> result$finalList[1] "OH71" "NY08" "WV18" "MI53" "NH02" "OH49" "PA42" "ME09"[9] "IN34" "MA13" "NY52" "NY10" "WA14" "NY20" "OH17" "ME00"[17] "TN00" "IL63" "MI99" "WI28" "IN41" "PA29" "WI36" "ME02"[25] "MI09" "MO05" "NC03" "NJ99" "PA15" "CO19" "MN18" "WI37"[33] "AR27" "KS31" "ME98" "MO03"

The 4th input speciﬁes the minimum sample of weekly data required to beincluded in the produced list, and the last input tells the function to only lookat sites in the Northern region of the US. The options for region are "W", "S",and "N"; see Appendix B for the geographic split. > NSites <- getSites("01/01/83 00:00","12/31/86 00:00",36,104,"SO4","N")> NSites$finalList[1] "OH71" "NY08" "MI53" "NH02" "OH49" "PA42" "ME09" "IN34"[9] "MA13" "NY52" "NY10" "NY20" "OH17" "ME00" "MI99" "WI28"[17] "IN41" "PA29" "WI36" "ME02" "MI09" "NJ99" "PA15" "MN18"[25] "WI37" "ME98" "IL11" "IL18" "MN16" "MI26" "NE15" "VT01"[33] "NY99" "MA01" "MA08" "MN27"

The function maxDistSites() prioritizes sites that are farther away fromeach other. This function takes the same arguments as input as getSites() ,except for the last argument where instead of specifying a region, the usercan specify which site should be included ﬁrst. If the last argument is 1, thenthe site with the greatest amount of data for the speciﬁed time period will bechosen; if the last argument is 2, then the site with the second greatest amountof data will be chosen; etc. > maxdist <- maxDistSites("01/01/83 00:00","12/31/86 00:00",36,104,"SO4",1)> maxdist$finalList[1] "OH71" "WA14" "TX04" "FL11" "ME00" "WY06" "MN27" "LA12"[9] "CA45" "OK00" "NY99" "GA41" "MI99" "AZ03" "MT05" "NC35"[17] "MO05" "CO00" "WY99" "IN34" "KY03" "MI09" "FL03" "MA01"[25] "OR10" "PA42" "AR27" "MN16" "TX21" "VT99" "NE15" "VA13"[33] "CO15" "CO22" "NY52" "AR02" lambertWtransform() that will allowa user to transform the residuals produced by the deterministic univariatemodel. The LambertW package estimates the parameters that ﬁt a Lam-bert W distribution on the given univariate data. Then the underlying Gaus-sian distribution implied by the Lambert W distribution is extracted and isused for the multivariate analysis in the function lambertWtransform() . The lambertWtransform() function takes the following as input: a data frame ofresiduals, and two logical inputs specifying whether to plot the multivariateqq plot and whether to produce the .mat ﬁle containing the covariance ma-trix with the Lambert W transformed residuals. Details on the algorithmsthat perform the transformation can be found in [Goe11]. Here we provide

ESgenCov 17 an example where we transform the residuals of 50 sites stored in an internaldataset, named "dfRes50". > data("dfRes50")> loutput <- lambertWtransform(dfRes=dfRes50, plotMulti=FALSE,writeMat=FALSE)> loutput$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 22800 0.0004 NO2 Mardia Kurtosis 0.418 0.6763 YES3 MVN NO

This function produces a list of four outputs:1. loutput$mvn contains the results of applying the multivariate analysis bythe MVN package2. loutput$cov contains the covariance matrix produced by the transformedresiduals3. loutput$newResiduals contains the data frame of Lambert W trans-formed residuals4. loutput$univariateTest contains the univariate tests produced by theMVN function for the transformed residuals > data("dfRes50")> dfRes50 <- dfRes50> loutput <- lambertWtransform(dfRes=dfRes50, plotMulti=FALSE,writeMat=FALSE)> loutput$mvn> loutput$cov> loutput$newResiduals> loutput$univariateTest

Next, we present an example where we use maxDistSites() to get a list of50 sites that is geographically sparse and has at least 200 weeks of data between1986 and 1994. From this list of sites, a covariance and its correspondingmultivariate normality test is generated and compared to the Lambert Wtransformed output. > maxd <- maxDistSites("01/01/86 00:00","12/31/94 00:00",50,200,"SO4",1) > df <- defaultInput > df$siteAdd <- list(maxd$finalList)> df$startdateStr <- maxd$startDate > df$use36 <- FALSE> df$comp <- maxd$comp> df$enddateStr <- maxd$endDate> df$writeMat <- TRUE> output <- getCov(df)> output$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 22962 2.6e-05 NO2 Mardia Kurtosis 0.2408 0.8097 YES3 MVN NOloutput <- lambertWtransform(g$residualDataNA, TRUE,FALSE)loutput$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 22263 0.2174 YES2 Mardia Kurtosis -1.6039 0.1087 YES3 MVN YES u := − (cid:18) ν − m + 56 (cid:19) log(det( R )) , where m := number of sites, ν := m ( m + 1) / , and R is the sample correlationmatrix. The null hypothesis H is that the variates are independent, and we re-ject H if u > χ m ( m − / ,α where for our analysis α = 0 . . The ﬁve covariancematrices are named "maxd1Cov", "maxd2Cov", "maxd3Cov", "maxd4Cov",and "maxd5Cov", and the corresponding test statistics u are 21605, 25740,26928, 25870, and 25137, all comfortably giving evidence to reject H at the α = 0 . level (we reject when u > ).We have made a function available that performs this independence test.Here we indicate how it can be used on a data frame of residuals produced by getCov() . > maxd1 <- maxDistSites("01/01/86 00:00","12/31/94 00:00",50,250,"SO4",1)> df$comp <- maxd1$comp ESgenCov 19 > df$enddateStr <- maxd1$endDate> df$startdateStr <- maxd1$startDate> df$siteAdd <- list(maxd1$finalList)> result <- getCov(df)> indp <- independenceTest(result$residualData)> indp$testchisq dist likelihood ratio tchisq independent1 21605 1307.54 FALSE

We are currently working on enhancements to

MESgenCov . Ultimately, wewould like to make it easy to use data sets from other application domains,and to make it easier for a user to use other univariate models than the onewe provide. Finally, we hope to eventually have a seamless integration withalgorithms for the MESP.

Acknowledgments

The authors are very grateful to Dr. Martin Shafer and Robert Larson forhelping us gain access to the NADP/NTN data in a convenient form.

Financial and Ethical disclosures

J. Lee was funded by the Air Force Oﬃce of Scientiﬁc Research (ComplexNetworks program), FA9550-19-1-0175. H. Al-Thani was funded by the QatarNational Research Fund (Graduate Sponsorship Research Award), GSRA4-2-0526-17114. The authors declare that they have no conﬂict of interest.

References

AFLW99. Kurt M. Anstreicher, Marcia Fampa, Jon Lee, and Joy Williams. Using con-tinuous nonlinear relaxations to solve constrained maximum-entropy samplingproblems.

Mathematical Programming , 85(2, Ser. A):221–240, 1999.AL04. Kurt M. Anstreicher and Jon Lee. A masked spectral bound for maximum-entropy sampling. In mODa 7—Advances in model-oriented design and analysis ,Contrib. Statist., pages 1–12. Physica, Heidelberg, 2004.Ans18a. Kurt M. Anstreicher. Eﬃcient solution of maximum-entropy sampling prob-lems. Preprint available at: , July 2018.Ans18b. Kurt M. Anstreicher. Maximum-entropy sampling and the boolean quadric poly-tope.

Journal of Global Optimization , 72(4):603–618, 2018.BL07. Samuel Burer and Jon Lee. Solving maximum-entropy sampling problems usingfactored masks.

Mathematical Programming , 109(2-3, Ser. B):263–281, 2007.0 Hessa Al-Thani, Jon LeeBLZ94. Philip J. Brown, Nhu D. Le, and James V. Zidek. Multivariate spatial interpo-lation and exposure to air pollutants.

Canad. J. Statist. , 22(4):489–509, 1994.CFLL20. Zhongzhu Chen, Marcia Fampa, Amélie Lambert, and Jon Lee. Mixing convex-optimization bounds for maximum-entropy sampling, 2020. http://arxiv.org/abs/2001.11896 .FL00. Valerii Fedorov and Jon Lee. Design of experiments in statistics. In

Handbook ofsemideﬁnite programming , volume 27 of

Internat. Ser. Oper. Res. ManagementSci. , pages 511–532. Kluwer Acad. Publ., Boston, MA, 2000.GLSZ93. Peter Guttorp, Nhu D. Le, Paul D. Sampson, and James V. Zidek. Using en-tropy in the redesign of an environmental monitoring network. In

Multivariateenvironmental statistics , volume 6 of

North-Holland Ser. Statist. Probab. , pages175–202. North-Holland, Amsterdam, 1993.Goe11. Georg M. Goerg. Lambert W random variables - a new family of generalizedskewed distributions with applications to risk estimation.

Annals of AppliedStatistics , 5(3):2197–2230, 2011.Goe16. Georg M. Goerg.

LambertW: Probabilistic Models to Analyze and GaussianizeHeavy-Tailed, Skewed Data. Version 0.6.4 , 2016. https://cran.r-project.org/web/packages/LambertW/ .HLW01. Alan J. Hoﬀman, Jon Lee, and Joy Williams. New upper bounds for maximum-entropy sampling. In mODa 6—advances in model-oriented design and analysis(Puchberg/Schneeberg, 2001) , Contrib. Statist., pages 143–153. Physica, Heidel-berg, 2001.KGZ14. Selcuk Korkmaz, Dincer Goksuluk, and Gokmen Zararsiz. MVN: An R Packagefor Assessing Multivariate Normality.

The R Journal , 6(2):151–162, 2014.KLQ95. Chun-Wa Ko, Jon Lee, and Maurice Queyranne. An exact algorithm for maxi-mum entropy sampling.

Operations Research , 43(4):684–691, 1995.Lee98. Jon Lee. Constrained maximum-entropy sampling.

Operations Research ,46(5):655–664, 1998.Lee12. Jon Lee.

Encyclopedia of Environmetrics, A.H. El-Shaarawi and W.W.Piegorsch, eds. , chapter Maximum entropy sampling, 2nd edition, pages 1570–1574. Wiley, 2012.LW03. Jon Lee and Joy Williams. A linear integer programming bound for maximum-entropy sampling.

Mathematical Programming , 94(2-3, Ser. B):247–256, 2003.LZ06. Nhu D. Le and James V. Zidek.

Statistical analysis of environmental space-timeprocesses . Springer Series in Statistics. Springer, New York, 2006.LZWC19. Nhu Le, Jim Zidek, Rick White, and Davor Cubranic. EnviroStat: StatisticalAnalysis of Environmental Space-Time Processes. Version 0.4-2. https://rdrr.io/cran/EnviroStat/ , 2019.Mil13. Steven P. Millard.

EnvStats: An R Package for Environmental Statistics .Springer, New York, 2013.NAD18. NADP. National Acidic Deposition Program, National Trends Network. https://nadp.slh.wisc.edu/ntn/ , 2018.RC12. A.C. Rencher and W.F. Christensen.

Methods of Multivariate Analysis . WileySeries in Probability and Statistics. Wiley, 2012.SW87. Michael C. Shewry and Henry P. Wynn. Maximum entropy sampling.

Journalof Applied Statistics , 14(2):165–170, 1987.SW00. Paola Sebastiani and Henry P. Wynn. Maximum entropy sampling and optimalbayesian experimental design.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 62(1):145–157, 2000.ZSL00. James V. Zidek, Weimin Sun, and Nhu D. Le. Designing and integrating com-posite networks for monitoring multivariate Gaussian pollution ﬁelds.

Journal ofthe Royal Statistical Society. Series C. Applied Statistics , 49(1):63–79, 2000.ESgenCov 21

NADP/NTN

Daily Data

Column number Field Data type Description

1 SiteID Char(4) Site Identifier 2 StartTime Char(16)

Period start, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format

3 EndTime Char(16)

Period end, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format

4 Amount Integer

Precipitation depth, inches Missing = ‐9, Trace precipitation amount = ‐7

NADP/NTN

Weekly Data

Column number Field Data type Description

1 SiteID Char(4) Site Identifier 2 DateOn Char(16)

Date on which the sample bucket was installed on the collector, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format

3 DateOff Char(16)

Date on which the sample bucket was removed from the collector, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format

4 yrMonth Integer

Year and Month of sample midpoint, in YYYYMM format

5 ph Decimal

Negative log of the hydrogen ion concentration as measured at the CAL, in pH units

6 Ca Decimal

Ca concentration, mg/L

7 Mg Decimal

Mg concentration, mg/L

8 K Decimal

K concentration, mg/L

9 Na Decimal

Na concentration, mg/L

10 NH4 Decimal

NH4 concentration, mg/L

11 NO3 Decimal

NO3 concentration, mg/L

12 Cl Decimal

Cl concentration, mg/L

13 SO4 Decimal

SO4 concentration, mg/L

14 Br Decimal

Br concentration, mg/L