An R Package for generating covariance matrices for maximum-entropy sampling from precipitation chemistry data
AAn R Package for generating covariance matricesfor maximum-entropy sampling from precipitationchemistry data
Hessa Al-Thani · Jon Lee
March 20, 2020
Abstract
We present an open-source R package (
MESgenCov v 0.1.0) fortemporally fitting multivariate precipitation chemistry data and extracting acovariance matrix for use in the MESP (maximum-entropy sampling problem).We provide multiple functionalities for modeling and model assessment. Thepackage is tightly coupled with NADP/NTN (National Atmospheric Deposi-tion Program / National Trends Network) data from their set of 379 monitor-ing sites, 1978–present. The user specifies the sites, chemicals, and time perioddesired, fits an appropriate user-specified univariate model for each site andchemical selected, and the package produces a covariance matrix for use byMESP algorithms.
Keywords maximum-entropy sampling · covariance matrix · environmentalmonitoring · environmetrics · NADP · NTN
Mathematics Subject Classification (2000) · · · Introduction
The MESP (maximum-entropy sampling problem) (see [SW87, SW00, FL00,Lee12]) has been applied to many domains where the objective is to determinea "most informative" subset Y S , of pre-specified size s = | S | > , from aGaussian random vecor Y N , | N | = n > s . Information is typically measured H. Al-ThaniDepartment of Industrial and Operations Engineering, Univ. of Michigan, Ann Arbor, MI48105 USAE-mail: [email protected]. LeeDepartment of Industrial and Operations Engineering, Univ. of Michigan, Ann Arbor, MI48105 USAE-mail: [email protected] a r X i v : . [ c s . M S ] M a r Hessa Al-Thani, Jon Lee by (differential) entropy. Generally, we assume that Y N has a joint Gaussiandistribution with mean vector µ and covariance matrix C . Up to constants, theentropy of Y S is the log of the determinant of the principle submatrix C [ S, S ] .So, the MESP seeks to maximize the (log) determinant of C [ S, S ] , for some S ⊆ N with | S | = s .The MESP is NP-hard (see [KLQ95]), and there has been considerablework on algorithms aimed at exact solutions for problems of moderate size; see[KLQ95,Lee98,AFLW99,LW03,HLW01,AL04,BL07,Ans18b,Ans18a,CFLL20].All of this algorithmic work is based on a branch-and-bound framework in-troduced in [KLQ95], and the bulk of the contributions in these references ison different methods for upper bounding the optimal value. This work hasbeen developed and validated in the context of a very small number of datasets, despite the fact that of course multivariate data is widely available. Thereason for this shortcoming is that despite all of the raw multivariate datathat is available, it is not a simple matter to turn this data into meaningfulcovariance matrices for Gaussian random variables.Our goal with the R package ( MESgenCov v 0.1.0) that we have devel-oped is to provide such a link — between readily available raw environmental-monitoring data and covariance matrices suitable for the MESP — in thecontext of environmental monitoring. Our work fits squarely into recent ef-forts to better exploit massive amounts of available data for mathematical-programming approaches to decision problems. Even if we have reliable rawdata, we can only make good decisions if we have a means to prepare thatdata so that we can populate our mathematical-programming models in sucha was as to meet required assumptions.We note that another R package of interest is [LZWC19]: "EnviroStatprovides functions for spatio-temporal modeling of environmental processesand designing monitoring networks for them based on an approach describedin [LZ06]".In §1, we discuss application of the MESP to environmental monitoringand the NADP/NTN (National Atmospheric Deposition Program / NationalTrends Network) data environment. In §2, we describe our methodology. In§3, we describe our R package (
MESgenCov v 0.1.0). In §4, we make someconcluding remarks.
A key area of application for the MESP has been in environmental monitoring(see [ZSL00, BLZ94, GLSZ93], for example). The idea is that precipitation iscollected at many sites, and its chemistry is analyzed. This is costly, and it isa natural question as to whether a subset of the sites might yield data withoutmuch loss of information (as measured by entropy). But it is a challenge toprocess the raw data in such a way that multivariate normality is achieved,because only then are the model of the MESP and its related algorithmsapplicable.
ESgenCov 3
The NADP maintains the NTN (see [NAD18]); this network measures thechemistry (i.e., ammonium, calcium, chloride, hydrogen, magnesium, nitrate,pH, potassium, sodium, and sulfate) of precipitation at 379 monitoring sitesacross the US, with some data available as far back as 1978; at present, 255sites are active.Our R package is tightly coupled with this precipitation and chemistrydata. We are interested in instances of the MESP where n user-specified site/chemical pairs comprise N . Precipitation data (measured in L) are avail-able on a daily basis, and chemical concentrations (measured in mg/L) areavailable on a weekly basis. These datasets are available in the packages EnvMonDataConc and
EnvMonDataPre . They can be installed and loaded usingthe devtools library. If not already installed, the devtools library can beinstalled and loaded as follows: > install.packages("devtools")> library(devtools)
Then we can install the data packages (from GitHub) and load them: > install_github("hessakh/EnvMonDataConc")> install_github("hessakh/EnvMonDataPre") > library(EnvMonDataConc)> data("weeklyConc")> library(EnvMonDataPre)> data("preDaily")
A full description of the daily and weekly precipitation data appears in Ap-pendix A, derived from http://nadp.slh.wisc.edu/data/ntn/meta/ntn-daily-Meta.pdf and http://nadp.slh.wisc.edu/data/ntn/meta/ntn-weekly-Meta.pdf , courtesy of the NADP Small snapshots of the data can easily be viewed. For example, we canoutput the first 6 rows and first 5 columns of the weekly raw data. weeklyConc[1:6,1:5]siteID dateon dateoff yrmonth ph1 AB32 2016-09-13 18:40:00 2016-09-20 15:10:00 201609 -9.002 AB32 2016-09-20 15:15:00 2016-09-28 16:00:00 201609 -9.003 AB32 2016-09-28 16:00:00 2016-10-05 16:55:00 201610 6.564 AB32 2016-10-05 16:55:00 2016-10-11 17:00:00 201610 -9.00 National Atmospheric Deposition Program (NRSP-3). 2019. NADP Program Office,Wisconsin State Laboratory of Hygiene, 465 Henry Mall, Madison, WI 53706. Hessa Al-Thani, Jon Lee
Note that missing concentration values are coded as -9.00. t = 0 , , . . . , T − , let W ( t ) := set of weeks in month t , D ( w ) := set of days in week w , c w := recorded chemical concentration (mg/L) for week w ( c w = ∗ denotes an unrecorded value), p d := recorded precipitation quantity (L) for day d , p w := precipitation quantity (L) for week w ; p w = (cid:80) d ∈ D ( w ) p d .Then the chemical concentration (mg/L) for month t is calculated as y ( t ) := (cid:80) w ∈ W ( t ): c w (cid:54) = ∗ p w c w (cid:80) w ∈ W ( t ): c w (cid:54) = ∗ (cid:80) d ∈ w p d . It should be noted that when there is no weekly value available for the chemicalquantity, we do not use the precipitation values for any of the days in such aweek (so as to not artificially dilute the chemical concentration level for themonth).Next, we fit a temporal model to log( y ( t )) , which is a rather standardmethod for handling heavy-tailed distributions. We note that our data has nozero concentrations, so there is no issue of " log(0) ".A quick look at some graphics indicates that there are clear long-termtrends; see Figure 1 , from which we can see that sulfate concentrations are Reprinted with the kind permission of the National Atmospheric Deposition Program(NRSP-3). 2019. NADP Program Office, Wisconsin State Laboratory of Hygiene, 465 HenryMall, Madison, WI 53706.ESgenCov 5 !!!!!! !!! !!!! ! !! !!!!!!! !!!! !!!! !!!!! !!!!!!!! !!!!!!! !! !!!! !!!! !! !!!! !!!!!!!! !!!!! !! !! !!!!!! ! !! !! !! !!!!! !! !! !!!!! ! !!!! ! !!! !!! !!! !! ! !!!!! !! !!!!! !! !!!!! !! !!! ! !!!! ! !!!!!! !!!!! !! ! !!!!!! ! (cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:65)(cid:116)(cid:109)(cid:111)(cid:115)(cid:112)(cid:104)(cid:101)(cid:114)(cid:105)(cid:99)(cid:32)(cid:68)(cid:101)(cid:112)(cid:111)(cid:115)(cid:105)(cid:116)(cid:105)(cid:111)(cid:110)(cid:32)(cid:80)(cid:114)(cid:111)(cid:103)(cid:114)(cid:97)(cid:109)(cid:47)(cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:84)(cid:114)(cid:101)(cid:110)(cid:100)(cid:115)(cid:32)(cid:78)(cid:101)(cid:116)(cid:119)(cid:111)(cid:114)(cid:107)(cid:104)(cid:116)(cid:116)(cid:112)(cid:58)(cid:47)(cid:47)(cid:110)(cid:97)(cid:100)(cid:112)(cid:46)(cid:105)(cid:115)(cid:119)(cid:115)(cid:46)(cid:105)(cid:108)(cid:108)(cid:105)(cid:110)(cid:111)(cid:105)(cid:115)(cid:46)(cid:101)(cid:100)(cid:117)
Sulfate as SO - (mg/L) Sulfate ion concentration, 1985 (cid:48)(cid:8805)(cid:32)(cid:50)(cid:46)(cid:53)(cid:49)(cid:46)(cid:53)(cid:50)(cid:46)(cid:48)(cid:49)(cid:46)(cid:48)(cid:48)(cid:46)(cid:53) !!!! !!! !! !!! !! !!!!! !!! ! !!! !!! !!! ! !! !! !! !! !! ! !! !!!!! !! ! ! !!! ! !! !!!!! ! !!!! !!!! !!! ! !!!! ! ! !!! !!! !! !! ! !!!! !! !!! ! !! ! !!! ! !!!!! ! !! ! ! ! !!!!! !!! !!!! ! ! !!! !!! ! !! ! !!! ! !! ! !!!! ! ! !!! !! !!! ! ! !! !
National Atmospheric Deposition Program/National Trends Networkh ttp://nadp.isws.illinois.edu
Sulfate as SO - (mg/L) Sulfate ion concentration, 1995
Sites not pictured:Alaska 01 0.2 mg/LAlaska 03 0.1 mg/LPuerto Rico 20 0.8 mg/L ≥ !!!! !!!! !!! !!! !! ! !!!! !! !!! ! ! ! !!! !!! !!! !!! ! !! !! !! ! !! ! !! !!!!! !! ! !!! ! !!! !! !!!! ! !!! !! ! !!!! !! ! !!! !!!! ! ! !!! !!! !! !!!!! !!! !!! !!! !! !! !!! ! ! !! !!!! !!!! ! !! !! ! !!! ! !! !! !! !! !! ! !! !!! !! !!! ! !! ! !!! ! !!! !! ! !! ! !!!! ! !! !! ! !! ! !!!!! ! !!! ! National Atmospheric Deposition Program/National Trends Networkh ttp://nadp.isws.illinois.edu
Sulfate as SO - (mg/L) Sulfate ion concentration, 2005 ≥ Sites not pictured:Alaska 03 0.1 mg/LVirgin Islands 01 0.9 mg/L !! !!!!!! !!!!!! ! !!! !! ! !! !!! !!!!!! ! !! ! !! !! !!! !! !!! !!! ! !!! !! !!!!!!! !! !!! ! !! !! !! !! !!!!! !!! !!! !! !!! !! !! !! !!! ! ! !! !! !!!! ! !!!! ! ! !! !!! ! !! !! !! ! !! !! !!!! !! !! !!!! !! ! !! !!! !!! ! ! !!! !! !! !! ! !! !!! !! !! !!! ! ! ! !!! !! !! !!!!!! !! !
Sulfate as SO (mg/L)
Sites not pictured:Alaska 01Alaska 02Alaska 03Br. Columbia 22Br. Columbia 23Puerto Rico 20Saskatchewan 20Saskatchewan 21Virgin Islands 01 0.1 mg/L0.1 mg/L0.1 mg/L0.8 mg/L0.1 mg/L0.7 mg/L0.5 mg/L0.3 mg/L0.7 mg/L
Sulfate ion concentration, 2015 !!!!!! !!! !!!! ! !! !! !!!!! !!!! !!!! !!!!! !!!!!!!! !!!!!!! !! !!!! !!!! !! !!!! ! !!!!!!! !!!!! !! !! !!!!!! ! !! !! !! !!!!! !! !! !!!!! ! !!!! ! !!! !!! !!! !! ! !!!!! !! !!!!! !! !!!!! !! !!! ! !!!! ! !!!!!! !!!!! !! ! !!!!!! ! (cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:65)(cid:116)(cid:109)(cid:111)(cid:115)(cid:112)(cid:104)(cid:101)(cid:114)(cid:105)(cid:99)(cid:32)(cid:68)(cid:101)(cid:112)(cid:111)(cid:115)(cid:105)(cid:116)(cid:105)(cid:111)(cid:110)(cid:32)(cid:80)(cid:114)(cid:111)(cid:103)(cid:114)(cid:97)(cid:109)(cid:47)(cid:78)(cid:97)(cid:116)(cid:105)(cid:111)(cid:110)(cid:97)(cid:108)(cid:32)(cid:84)(cid:114)(cid:101)(cid:110)(cid:100)(cid:115)(cid:32)(cid:78)(cid:101)(cid:116)(cid:119)(cid:111)(cid:114)(cid:107)(cid:104)(cid:116)(cid:116)(cid:112)(cid:58)(cid:47)(cid:47)(cid:110)(cid:97)(cid:100)(cid:112)(cid:46)(cid:105)(cid:115)(cid:119)(cid:115)(cid:46)(cid:105)(cid:108)(cid:108)(cid:105)(cid:110)(cid:111)(cid:105)(cid:115)(cid:46)(cid:101)(cid:100)(cid:117)
Sulfate as SO - (mg/L) Sulfate ion concentration, 1985 (cid:48)(cid:8805)(cid:32)(cid:50)(cid:46)(cid:53)(cid:49)(cid:46)(cid:53)(cid:50)(cid:46)(cid:48)(cid:49)(cid:46)(cid:48)(cid:48)(cid:46)(cid:53)
Fig. 1: Sulfate concentration over timegenerally trending downward over time. Again, looking at some data, we caneasily see periodic trends; see Figure 2, where we can easily see a yearly peri-odicity.The general model that we provide is log( y ( t )) (cid:86) = r (cid:88) i =0 β i t i + k (cid:88) j =1 (cid:20) a j cos (cid:18) πjt (cid:19) + b j sin (cid:18) πjt (cid:19)(cid:21) , (1)with the parameters β i , a i , and b i fit by ordinary linear regression. The usercan specify the degree r for the polynomial part of the model which we thinkof as a truncated Taylor series, aimed at capturing aperiodic trends. Periodictrends are captured via a truncated Fourier series, truncated at level k .We note that [GLSZ93] used the following model to de-seasonalize andde-trend the log-transformed monthly sulfate concentration values: log( y ( t )) (cid:86) = β + β t + a cos (cid:18) πt (cid:19) + b sin (cid:18) πt (cid:19) . (2)This is just an affine model β + β t plus a sinusoidal model with monthlyperiodicity and intercept a . The simple model (2) is (1) with r = 1 and k = 1 . We found that (2) did well at normalizing the errors for certain sites, Hessa Al-Thani, Jon Lee
Fig. 2: Log sulfate concentration over a four-year period at a siteFig. 3: Log sulfate concentrations at site "TN00"but some sites, such as "AL10" and "IN41", (2) did not do so well. So ratherthan fix (2) as the model for our R package, we provide the flexibility of (1).In Figure 3, we provide an example where (1) with r = 1 and k = 3 is amuch better fit for the log concentration of sulfate at site "TN00" than (2).The plot on the left shows the fit of model (2), and the plot on the right showsthe fit of our model (1). ESgenCov 7
To produce the covariance matrix, we need error values for each time point.Missing values in the NADP/NTN data set means that each set of error valuesproduced by the model may vary in size. So we have filled in missing values foreach site and time point by sampling from a normal distribution with the meanbeing the predicted value by the univariate model and the standard deviationbeing the standard error of the univariate model.
Our R package
MESgenCov can be obtained from https://github.com/hessakh/MESgenCov and installed using the devtools library. > install_github("hessakh/MESgenCov")> library(MESgenCov)
MESgenCov contains functions in the S3 class to create a covariancematrix from the desired subset of NADP/NTN data. The function getCov() returns a covariance matrix, a list of univariate model summaries, and a ta-ble of normality tests produced by the MVN R package (see [KGZ14]). Themultivariate analysis is used to asses the validity of the covariance matrix tobe used as input for the MESP. The user can make adjustments to the inputof getCov() so as to obtain a covariance matrix for a multivariate Gaussianvector, which is then valid for use in the MESP.To avoid sites with a small sample size for the specified time-frame, thefunction getSites() outputs a vector of the sites with the largest sample ofdata for a given time-frame and measured chemical (see §3.2). To find sitesthat are spatially "spread out" but have at least some specified sample size, thefunction maxDistSites() can be used to obtain a list of geographically sparsesites (see §3.2). Finally, in the case where the residuals from a univariate modeldo not appear to be normally distributed, the function lambertWtransform() allows the user to transform the residuals (from a univariate model) using the Rpackage LambertW: Probabilistic Models to Analyze and Gaussianize Heavy-Tailed, Skewed Data (see [Goe16]). This can be very effective in situationswhere the distributions seems to have heavy tails and some skewness (see[Goe11] and§3.3).3.1 getCovgetCov() takes a 14 column data frame as input where each column corre-sponds to one of the user-specifications shown in Figure 4. The 14 specificationsin the input allow the user to specify the subset of data to analyze and givesthe user options in displaying different parts of the analysis. Hessa Al-Thani, Jon Lee
Arguments
Definition startdateStr
Date and time of when to start analyzing thedata, in the format = m/d/y H:M enddateStr
Date and time of when to stop analyzing thedata, in the format = m/d/y H:M comp
String of pollutant or acidity level to beanalyzed, the pollutants name should be usedas it appears in weeklyConc use36
TRUE if default 36 sites should be added,FALSE otherwise siteAdd
List of strings of siteIDs that should beanalyzed outlierDatesbySite
List of sites where outliers should be analyzed siteOutliers
List of sites where outliers should be removed removeOutliers
Specify siteID string for outlier analysis plotMulti
TRUE if multivariate analysis plots shouldbe displayed, FALSE otherwise sitePlot
Specify list of siteIDs to be plotted plotAll
TRUE if plots for all sites should bedisplayed, FALSE otherwise writeMat
TRUE if .mat file of the resulting covariancematrix should be written in the workingdirectory r Integer <=5, see univariate model k Integer <= 5, see univariate model
Fig. 4: Input parameters for getCov()
A default set of inputs can be found in the stored data frame "defaultIn-put". Each column of "defaultInput" is an argument in the function getCov() .After storing "defaultInput" in a variable in the user’s workspace, the inputcan be changed. For example, below we store the "defaultInput" data framein a variable "df", and then change the end date: > data("defaultInput")> df <- defaultInput> dfstartdateStr enddateStr use36 siteAdd1 01/01/83 00:00 12/31/86 00:00 TRUE NULLoutlierDatesbySite siteOutliers comp plotMulti sitePlot1 NULL NULL SO4 FALSE NULLplotAll writeMat r kFALSE FALSE 1 1 df$enddateStr <- "12/31/88 00:00"
ESgenCov 9
The function getCov() produces a list with the the following elements:
Output
Definition cov
Covariance matrix produced by univariatemodel residuals listMod
List of univariate model summaries producedby lm() sites
List of sites that were analyzed mvn
Output of the MVN package univariateTest
Univariate test output, also by the MVNpackage residualData
Data frame of residuals produced by theunivariate model residualDataNA
Data frame of residuals, where missing valuesare left as NA rosnerTest
Output of the Rosner test for outlier analysisproduced by the EnvStats package;see [Mil13] pred
List of predicted values produced by theunivariate model for each siteNext, we demonstrate how to access these elements and certain plots afterrunning the function getCov() .1. Multivariate and univariate normality > df$plotMulti <- TRUE > df$k <- 3 > g <- getCov(df) > g$univariateTestTest Variable Statistic p value Normality1 Shapiro-Wilk AL10SO4 0.9347 0.001 NO2 Shapiro-Wilk IL11SO4 0.9894 0.8121 YES3 Shapiro-Wilk IL18SO4 0.9909 0.8854 YES4 Shapiro-Wilk IL19SO4 0.9184 2e-04 NO5 Shapiro-Wilk IL35SO4 0.8709 <0.001 NO
2. Outlier test for specific tests df$siteOutliers <- list(c("IN41"))df$sitePlot <- list(c("IN41"))g <- getCov(df)i <- match("IN41",g$sites)g$rosnerTest[[i]]$all.statsi Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier1 0 -0.0069 0.2814 -0.9359 25 3.3013 3.2680 TRUE2 1 0.0062 0.2604 -0.7194 30 2.7862 3.2628 FALSE3 2 0.0165 0.2471 0.7215 66 2.8533 3.2576 FALSE
By changing the input, we can remove the outliers detected by the Rosnertest. Note that the plots are generated after running getCov() . Further-more, getCov() does not need to be stored in a variable to generate theplots.
ESgenCov 11 > df$outlierDatesbySite <- c("IN41",25)> getCov(df)
3. Outlier test for all sites df$siteOutliers <- list(g$sites) df$removeOutliers <- list(g$sites)g <- getCov(df)
4. Plot all sites > df$plotAll <- TRUE> getCov(df)
5. Output all MVN package analysisThe following output is from a call to the MVN package that producesmultivariate analysis based on the Mardia method and univariate analysisbased on the Shapiro-Wilson method. > df$use36 <- FALSE > df$siteAdd <- list(c("NY52", "TN11", "IL63")) > df$siteOutliers <- NULL> df$outlierDatesbySite <- NULL> df$removeOutliers <- NULL> g <- getCov(df)
ESgenCov 13 > g$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 14.019 0.1721 YES2 Mardia Kurtosis -0.1465 0.8835 YES3 MVN
The specific call of the MVN package is > mvn(dfRes[,-1], subset = NULL, mvnTest = "mardia",covariance = TRUE, tol = 1e-25, alpha = 0.5, scale = FALSE,desc = TRUE, transform = "none",univariateTest = "SW",univariatePlot = "none", multivariatePlot ="none",multivariateOutlierMethod = "none", bc = FALSE, bcType ="rounded", showOutliers = FALSE, showNewData = FALSE).
See [KGZ14] for details on the MVN package.6. Covariance matrix > g <- getCov(df) > round(g$cov,digits = 4)NY52SO4 TN11SO4 IL63SO4NY52SO4 0.0791 0.0047 0.0009TN11SO4 0.0047 0.2177 0.0185IL63SO4 0.0009 0.0185 0.0909
7. Save covariance matrix as a .mat file (to populate an instance of the MESP,for example).
This is done by simply setting the input data frame attribute writeMat toTRUE. The .mat file will be saved to the user’s current working directory ascovSites.mat. For processing further with Matlab, use the (Matlab) ‘load’command. > df$writeMat <- TRUE> g <- getCov(df)
In the case that the user has already generated an output by the function getCov() , it is possible to also create the .mat file in the following manner. > library(rmatio)> write.mat(g$cov,filename = "covariance1.mat")
8. Univariate model summaries > result <- getCov(df) > sites <- result$sites > i = match(c("NY52"),sites) > result$listMod[i][[1]]Call:lm(formula = y1 ~ I(cos(t*(2*pi/s))) + I(sin(t*(2 *pi/s))) + I(cos(t*(2*pi/s)*2)) + I(sin(t*(2*pi/s)*2)) + I(cos(t*(2*pi/s)*3)) +I(sin(t*(2*pi/s)*3)) + I(t), data = df)Residuals:Min 1Q Median 3Q Max-0.5236 -0.1677 -0.0189 0.1818 0.9087Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 1.1110 0.0712 14.40 < 2e-16***I(cos(t*(2*pi/s))) -0.3541 0.0494 -9.75 2.8e-14***I(sin(t*(2*pi/s))) -0.1109 0.0498 -2.55 0.013*I(cos(t*(2*pi/s)*2)) 0.0500 0.0494 -0.37 0.716I(sin(t*(2*pi/s)*2)) 0.0797 0.0494 2.13 0.037*I(cos(t*(2*pi/s)*3)) 0.0391 0.0494 -0.20 0.845I(sin(t*(2*pi/s)*3)) 0.0536 0.0494 -0.52 0.604I(t) -0.0010 0.0017 -0.61 0.546
ESgenCov 15 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Residual standard error: 0.2961 on 64 degrees of freedomMultiple R-squared: 0.6265,Adjusted R-squared: 0.5857F-statistic: 15.34 on 7 and 64 DF, p-value: 1.327e-11
9. Output data frame of residuals > g <- getCov(df) > g$residualData[1:5,]NY52SO4 TN11SO4 IL63SO41 0.1959813 -0.44377735 0.28248032 0.6170340 -0.38791938 -0.34950803 -0.4510128 1.05519620 -0.11582004 -0.1580145 -0.01789642 -0.12977765 -0.3218159 -0.47833685 -0.3415119 getCov() takes site lists as input. The function getSites() produces a list ofsites with available data for a specified time frame. The code below producesa list of 36 sites with the most weekly data between the years 1983–1986. > result <- getSites("01/01/83 00:00","12/31/86 00:00",36,104,"SO4","")> result$finalList[1] "OH71" "NY08" "WV18" "MI53" "NH02" "OH49" "PA42" "ME09"[9] "IN34" "MA13" "NY52" "NY10" "WA14" "NY20" "OH17" "ME00"[17] "TN00" "IL63" "MI99" "WI28" "IN41" "PA29" "WI36" "ME02"[25] "MI09" "MO05" "NC03" "NJ99" "PA15" "CO19" "MN18" "WI37"[33] "AR27" "KS31" "ME98" "MO03"
The 4th input specifies the minimum sample of weekly data required to beincluded in the produced list, and the last input tells the function to only lookat sites in the Northern region of the US. The options for region are "W", "S",and "N"; see Appendix B for the geographic split. > NSites <- getSites("01/01/83 00:00","12/31/86 00:00",36,104,"SO4","N")> NSites$finalList[1] "OH71" "NY08" "MI53" "NH02" "OH49" "PA42" "ME09" "IN34"[9] "MA13" "NY52" "NY10" "NY20" "OH17" "ME00" "MI99" "WI28"[17] "IN41" "PA29" "WI36" "ME02" "MI09" "NJ99" "PA15" "MN18"[25] "WI37" "ME98" "IL11" "IL18" "MN16" "MI26" "NE15" "VT01"[33] "NY99" "MA01" "MA08" "MN27"
The function maxDistSites() prioritizes sites that are farther away fromeach other. This function takes the same arguments as input as getSites() ,except for the last argument where instead of specifying a region, the usercan specify which site should be included first. If the last argument is 1, thenthe site with the greatest amount of data for the specified time period will bechosen; if the last argument is 2, then the site with the second greatest amountof data will be chosen; etc. > maxdist <- maxDistSites("01/01/83 00:00","12/31/86 00:00",36,104,"SO4",1)> maxdist$finalList[1] "OH71" "WA14" "TX04" "FL11" "ME00" "WY06" "MN27" "LA12"[9] "CA45" "OK00" "NY99" "GA41" "MI99" "AZ03" "MT05" "NC35"[17] "MO05" "CO00" "WY99" "IN34" "KY03" "MI09" "FL03" "MA01"[25] "OR10" "PA42" "AR27" "MN16" "TX21" "VT99" "NE15" "VA13"[33] "CO15" "CO22" "NY52" "AR02" lambertWtransform() that will allowa user to transform the residuals produced by the deterministic univariatemodel. The LambertW package estimates the parameters that fit a Lam-bert W distribution on the given univariate data. Then the underlying Gaus-sian distribution implied by the Lambert W distribution is extracted and isused for the multivariate analysis in the function lambertWtransform() . The lambertWtransform() function takes the following as input: a data frame ofresiduals, and two logical inputs specifying whether to plot the multivariateqq plot and whether to produce the .mat file containing the covariance ma-trix with the Lambert W transformed residuals. Details on the algorithmsthat perform the transformation can be found in [Goe11]. Here we provide
ESgenCov 17 an example where we transform the residuals of 50 sites stored in an internaldataset, named "dfRes50". > data("dfRes50")> loutput <- lambertWtransform(dfRes=dfRes50, plotMulti=FALSE,writeMat=FALSE)> loutput$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 22800 0.0004 NO2 Mardia Kurtosis 0.418 0.6763 YES3 MVN
This function produces a list of four outputs:1. loutput$mvn contains the results of applying the multivariate analysis bythe MVN package2. loutput$cov contains the covariance matrix produced by the transformedresiduals3. loutput$newResiduals contains the data frame of Lambert W trans-formed residuals4. loutput$univariateTest contains the univariate tests produced by theMVN function for the transformed residuals > data("dfRes50")> dfRes50 <- dfRes50> loutput <- lambertWtransform(dfRes=dfRes50, plotMulti=FALSE,writeMat=FALSE)> loutput$mvn> loutput$cov> loutput$newResiduals> loutput$univariateTest
Next, we present an example where we use maxDistSites() to get a list of50 sites that is geographically sparse and has at least 200 weeks of data between1986 and 1994. From this list of sites, a covariance and its correspondingmultivariate normality test is generated and compared to the Lambert Wtransformed output. > maxd <- maxDistSites("01/01/86 00:00","12/31/94 00:00",50,200,"SO4",1) > df <- defaultInput > df$siteAdd <- list(maxd$finalList)> df$startdateStr <- maxd$startDate > df$use36 <- FALSE> df$comp <- maxd$comp> df$enddateStr <- maxd$endDate> df$writeMat <- TRUE> output <- getCov(df)> output$mvn$multivariateNormalityTest Statistic p value Result1 Mardia Skewness 22962 2.6e-05 NO2 Mardia Kurtosis 0.2408 0.8097 YES3 MVN
We are currently working on enhancements to
MESgenCov . Ultimately, wewould like to make it easy to use data sets from other application domains,and to make it easier for a user to use other univariate models than the onewe provide. Finally, we hope to eventually have a seamless integration withalgorithms for the MESP.
Acknowledgments
The authors are very grateful to Dr. Martin Shafer and Robert Larson forhelping us gain access to the NADP/NTN data in a convenient form.
Financial and Ethical disclosures
J. Lee was funded by the Air Force Office of Scientific Research (ComplexNetworks program), FA9550-19-1-0175. H. Al-Thani was funded by the QatarNational Research Fund (Graduate Sponsorship Research Award), GSRA4-2-0526-17114. The authors declare that they have no conflict of interest.
References
AFLW99. Kurt M. Anstreicher, Marcia Fampa, Jon Lee, and Joy Williams. Using con-tinuous nonlinear relaxations to solve constrained maximum-entropy samplingproblems.
Mathematical Programming , 85(2, Ser. A):221–240, 1999.AL04. Kurt M. Anstreicher and Jon Lee. A masked spectral bound for maximum-entropy sampling. In mODa 7—Advances in model-oriented design and analysis ,Contrib. Statist., pages 1–12. Physica, Heidelberg, 2004.Ans18a. Kurt M. Anstreicher. Efficient solution of maximum-entropy sampling prob-lems. Preprint available at: , July 2018.Ans18b. Kurt M. Anstreicher. Maximum-entropy sampling and the boolean quadric poly-tope.
Journal of Global Optimization , 72(4):603–618, 2018.BL07. Samuel Burer and Jon Lee. Solving maximum-entropy sampling problems usingfactored masks.
Mathematical Programming , 109(2-3, Ser. B):263–281, 2007.0 Hessa Al-Thani, Jon LeeBLZ94. Philip J. Brown, Nhu D. Le, and James V. Zidek. Multivariate spatial interpo-lation and exposure to air pollutants.
Canad. J. Statist. , 22(4):489–509, 1994.CFLL20. Zhongzhu Chen, Marcia Fampa, Amélie Lambert, and Jon Lee. Mixing convex-optimization bounds for maximum-entropy sampling, 2020. http://arxiv.org/abs/2001.11896 .FL00. Valerii Fedorov and Jon Lee. Design of experiments in statistics. In
Handbook ofsemidefinite programming , volume 27 of
Internat. Ser. Oper. Res. ManagementSci. , pages 511–532. Kluwer Acad. Publ., Boston, MA, 2000.GLSZ93. Peter Guttorp, Nhu D. Le, Paul D. Sampson, and James V. Zidek. Using en-tropy in the redesign of an environmental monitoring network. In
Multivariateenvironmental statistics , volume 6 of
North-Holland Ser. Statist. Probab. , pages175–202. North-Holland, Amsterdam, 1993.Goe11. Georg M. Goerg. Lambert W random variables - a new family of generalizedskewed distributions with applications to risk estimation.
Annals of AppliedStatistics , 5(3):2197–2230, 2011.Goe16. Georg M. Goerg.
LambertW: Probabilistic Models to Analyze and GaussianizeHeavy-Tailed, Skewed Data. Version 0.6.4 , 2016. https://cran.r-project.org/web/packages/LambertW/ .HLW01. Alan J. Hoffman, Jon Lee, and Joy Williams. New upper bounds for maximum-entropy sampling. In mODa 6—advances in model-oriented design and analysis(Puchberg/Schneeberg, 2001) , Contrib. Statist., pages 143–153. Physica, Heidel-berg, 2001.KGZ14. Selcuk Korkmaz, Dincer Goksuluk, and Gokmen Zararsiz. MVN: An R Packagefor Assessing Multivariate Normality.
The R Journal , 6(2):151–162, 2014.KLQ95. Chun-Wa Ko, Jon Lee, and Maurice Queyranne. An exact algorithm for maxi-mum entropy sampling.
Operations Research , 43(4):684–691, 1995.Lee98. Jon Lee. Constrained maximum-entropy sampling.
Operations Research ,46(5):655–664, 1998.Lee12. Jon Lee.
Encyclopedia of Environmetrics, A.H. El-Shaarawi and W.W.Piegorsch, eds. , chapter Maximum entropy sampling, 2nd edition, pages 1570–1574. Wiley, 2012.LW03. Jon Lee and Joy Williams. A linear integer programming bound for maximum-entropy sampling.
Mathematical Programming , 94(2-3, Ser. B):247–256, 2003.LZ06. Nhu D. Le and James V. Zidek.
Statistical analysis of environmental space-timeprocesses . Springer Series in Statistics. Springer, New York, 2006.LZWC19. Nhu Le, Jim Zidek, Rick White, and Davor Cubranic. EnviroStat: StatisticalAnalysis of Environmental Space-Time Processes. Version 0.4-2. https://rdrr.io/cran/EnviroStat/ , 2019.Mil13. Steven P. Millard.
EnvStats: An R Package for Environmental Statistics .Springer, New York, 2013.NAD18. NADP. National Acidic Deposition Program, National Trends Network. https://nadp.slh.wisc.edu/ntn/ , 2018.RC12. A.C. Rencher and W.F. Christensen.
Methods of Multivariate Analysis . WileySeries in Probability and Statistics. Wiley, 2012.SW87. Michael C. Shewry and Henry P. Wynn. Maximum entropy sampling.
Journalof Applied Statistics , 14(2):165–170, 1987.SW00. Paola Sebastiani and Henry P. Wynn. Maximum entropy sampling and optimalbayesian experimental design.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 62(1):145–157, 2000.ZSL00. James V. Zidek, Weimin Sun, and Nhu D. Le. Designing and integrating com-posite networks for monitoring multivariate Gaussian pollution fields.
Journal ofthe Royal Statistical Society. Series C. Applied Statistics , 49(1):63–79, 2000.ESgenCov 21
NADP/NTN
Daily Data
Column number Field Data type Description
1 SiteID Char(4) Site Identifier 2 StartTime Char(16)
Period start, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format
3 EndTime Char(16)
Period end, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format
4 Amount Integer
Precipitation depth, inches Missing = ‐9, Trace precipitation amount = ‐7
NADP/NTN
Weekly Data
Column number Field Data type Description
1 SiteID Char(4) Site Identifier 2 DateOn Char(16)
Date on which the sample bucket was installed on the collector, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format
3 DateOff Char(16)
Date on which the sample bucket was removed from the collector, reported in Greenwich Mean Time (GMT) YYYY‐MM‐DD hh:mm format
4 yrMonth Integer
Year and Month of sample midpoint, in YYYYMM format
5 ph Decimal
Negative log of the hydrogen ion concentration as measured at the CAL, in pH units
6 Ca Decimal
Ca concentration, mg/L
7 Mg Decimal
Mg concentration, mg/L
8 K Decimal
K concentration, mg/L
9 Na Decimal
Na concentration, mg/L
10 NH4 Decimal
NH4 concentration, mg/L
11 NO3 Decimal
NO3 concentration, mg/L
12 Cl Decimal
Cl concentration, mg/L
13 SO4 Decimal
SO4 concentration, mg/L
14 Br Decimal
Br concentration, mg/L