[PDF] A joint bayesian space-time model to integrate spatially misaligned air pollution data in R-INLA

Abstract

In air pollution studies, dispersion models provide estimates of concentration at grid level covering the entire spatial domain, and are then calibrated against measurements from monitoring stations. However, these different data sources are misaligned in space and time. If misalignment is not considered, it can bias the predictions. We aim at demonstrating how the combination of multiple data sources, such as dispersion model outputs, ground observations and covariates, leads to more accurate predictions of air pollution at grid level. We consider nitrogen dioxide (NO2) concentration in Greater London and surroundings for the years 2007-2011, and combine two different dispersion models. Different sets of spatial and temporal effects are included in order to obtain the best predictive capability. Our proposed model is framed in between calibration and Bayesian melding techniques for data fusion red. Unlike other examples, we jointly model the response (concentration level at monitoring stations) and the dispersion model outputs on different scales, accounting for the different sources of uncertainty. Our spatio-temporal model allows us to reconstruct the latent fields of each model component, and to predict daily pollution concentrations. We compare the predictive capability of our proposed model with other established methods to account for misalignment (e.g. bilinear interpolation), showing that in our case study the joint model is a better alternative.

Full PDF

AA J

OINT B AYESIAN S PACE -T IME M ODEL TO I NTEGRATE S PATIALLY M ISALIGNED A IR P OLLUTION D ATA IN

R-INLA

A P

REPRINT

C. Forlani ∗ MRC Centre for Environment and HealthDepartment of Epidemiology and BiostatisticsSchool of Public HealthImperial College LondonLondon, UK

S. Bhatt

Department of Infectious Disease EpidemiologySchool of Public HealthImperial College LondonLondon, UK

M. Cameletti

Department of ManagementEconomics and Quantitative MethodsUniversità degli Studi di BergamoBergamo, Italy

E. Krainski

Department of StatisticsUniversidade Federal do ParanáParaná, Brazil

M. Blangiardo

MRC Centre for Environment and HealthDepartment of Epidemiology and BiostatisticsSchool of Public HealthImperial College LondonLondon, UKJune 17, 2020 A BSTRACT

In air pollution studies, dispersion models provide estimates of concentration at grid level coveringthe entire spatial domain, and are then calibrated against measurements from monitoring stations.However, these different data sources are misaligned in space and time. If misalignment is notconsidered, it can bias the predictions. We aim at demonstrating how the combination of multipledata sources, such as dispersion model outputs, ground observations and covariates, leads to moreaccurate predictions of air pollution at grid level. We consider nitrogen dioxide (NO ) concentrationin Greater London and surroundings for the years 2007-2011, and combine two different dispersionmodels. Different sets of spatial and temporal effects are included in order to obtain the best predictivecapability. Our proposed model is framed in between calibration and Bayesian melding techniquesfor data fusion red. Unlike other examples, we jointly model the response (concentration level atmonitoring stations) and the dispersion model outputs on different scales, accounting for the differentsources of uncertainty. Our spatio-temporal model allows us to reconstruct the latent ﬁelds of eachmodel component, and to predict daily pollution concentrations. We compare the predictive capabilityof our proposed model with other established methods to account for misalignment (e.g. bilinearinterpolation), showing that in our case study the joint model is a better alternative. K eywords data integration · coregionalization model · geostatistical model · NO · SPDE ∗ Correspondence to: [email protected] a r X i v : . [ s t a t . A P ] J un PREPRINT - J

UNE

17, 2020

Air pollution is a major concern for policy makers worldwide (European Commission, 2018; EPA, 2016; WHO, 2006),and there is extensive evidence of its negative effects, in particular on respiratory and cardiovascular diseases (Dominici et al. , 2010; COMEAP, 2015; Atkinson et al. , 2015; Lipfert, 2017). Obtaining an accurate estimate of air pollutionconcentration is key for evaluating compliance with regulatory standards set by national and international environmentalagencies and to reduce exposure misclassiﬁcation in epidemiological studies (Berrocal et al. , 2012; Shaddick et al. ,2017; Keller and Peng, 2019).Air pollution data come from different sources, each presenting some limitations: ground measurements from monitoringnetwork stations, usually affected by sparse spatial resolution; estimates from Land Use Regression models (LUR),which rely on the availability of accurate and dense monitor observations; satellite remote sensing data, sometimespoorly correlated with ground pollution level; simulations from deterministic models (e.g. chemical transport modelsor dispersion models), that can present prediction quality concerns despite the complete spatial coverage and hightemporal resolution (Shaddick and Wakeﬁeld, 2002; Lee et al. , 2011; Gelfand et al. , 2012; Shaddick et al. , 2015; Hoek et al. , 2008; Johnson et al. , 2010; Chang, 2016; Shaddick et al. , 2017).Several ‘hybrid’ appproaches have been proposed to combine these data sources to draw from their strengths and toovercome their limitations, but not all of them address the discrepancy in the spatial resolution of the different datasources, which is known as misalignment or change of support problem (COSP). In the context of COSP, we refer to upscaling methods when the target resolution is lower than the data resolution (e.g.point-to-area), and to downscaling when the target resolution is higher (e.g. area- or grid-to-point).Model-based solutions for data assimilation (also referred to as data fusion or data blending ) in environmentalapplications allow us to account for all sources of uncertainty while addressing COSP. These are usually set within ahierarchical Bayesian framework. Popular approaches include Bayesian melding and calibration techniques (Chang,2016).Bayesian melding assumes that both measurements and modelled data are error-prone realizations of an underlyinglatent true pollution ﬁeld, and they both inform the posterior distribution of the latent process. Among the proposedmelding strategies applied to misaligned air pollution data we ﬁnd, for instance, the downscaling spatial Bayesianmelding model by Raftery and Fuentes (2005), the upscaling spatial Bayesian melding model by Wikle and Berliner(2005), and the upscaling spatio-temporal fusion model by McMillan et al. (2010).Calibration techniques assume that the model-based estimates (e.g. from dispersion models) are used in a regressionframework as predictors against the monitoring site measurements. In this way the computational cost is reducedcompared to melding, as the models only need to be ﬁtted at the monitoring sites locations (Berrocal et al. , 2012;Chang, 2016). Some examples are the block-averaging upscaling calibration fusion model by Sahu et al. (2010),and the spatio-temporal downscaling calibration models by Berrocal et al. (2010, 2012), which can be considered ageneralization of a Bayesian universal kriging model (Berrocal, 2019). In this paper, we are framed in the context of data integration to improve air pollution predictions at a ﬁne grid. Wecombine monitoring measurements and numerical model outputs coming from two dispersion models, the PollutionClimate Mapping (PCM) from DEFRA (DEFRA, 2018; Ricardo Energy & Environment, 2017) model and the AirQuality Uniﬁed Model (AQUM) from the Met Ofﬁce (Savage et al. , 2013; Met Ofﬁce, 2018), and account for theirassociated errors.These deterministic models have previously been used for similar purposes: Lee and Sarran (2015) provide an exampleof point-to-area upscaling from the PCM model grid to local authority areas for epidemiological applications andMukhopadhyay and Sahu (2017) combine the AQUM and the monitoring observations to accurately predict NO concentration in UK.However, usually in the literature only one extra data source at a time is considered (Wikle and Berliner, 2005; Rafteryand Fuentes, 2005; McMillan et al. , 2010; Sahu et al. , 2010; Berrocal et al. , 2010, 2012; Zidek et al. , 2012; Huang et al. , 2015; Lee and Sarran, 2015; Pannullo et al. , 2016; Mukhopadhyay and Sahu, 2017; Lee et al. , 2017; Huang et al. , 2017; Moraga et al. , 2017). We show that, when more are available, these can all be put together to get betterpredictions while accounting for the bias which affects deterministic data.2 PREPRINT - J

UNE

17, 2020Our approach is similar to the coregionalization model proposed by Schmidt and Gelfand (2003) to model CO, NO,and NO which allows us to calibrate the deterministic models against the monitor observations through a coefﬁcientsimilarly to calibration techniques. However, here we treat the three sources of information on NO as coming from thesame true underlying spatio-temporal process (i.e. the true air pollution concentration ﬁeld) as in Bayesian melding.The pure application of this kind of models is computationally prohibitive for the high resolution output data we have athand. This issue is solved by representing the spatially continuous ﬁelds as solutions to a Stochastic Partial DifferentialEquation (SPDE) to handle this in a computationally efﬁcient way (Lindgren et al. , 2011; Krainski et al. , 2018).Additionally, our model reconstructs the continuous latent spatial and temporal ﬁelds allowing us to account for all thesources of uncertainty: ﬁrst, the one associated with the estimates from the numerical models, which is not providedas they are deterministic models, and second, the measurement error associated with ground observations. This ismost useful in the perspective of using the predictions from the air pollution model as a measure of exposure in anepidemiological model, where the uncertainty could be fed forward (see for example Lee et al. et al. R-INLA package (Rue, 2018).Other authors have implemented solutions for spatially misaligned air pollution data in

R-INLA , however their ap-proaches differ from ours under several points of view. In particular, Moraga et al. (2017) show an example ofarea-to-point misalignment addressed via block averaging, in a spatial-only context, without accounting for the uncer-tainty associated with the raster data. Cameletti et al. (2019) implement a spatial upscaler from point to area comparingtwo different averaging methods. Kiﬂe et al. (2017) compare additive and coupled spatio-temporal processes formultivariate data in a biological context (prevalence of vectors for arboviruses), where the data are not misaligned, anddo not include any explanatory covariate.To the best of our knowledge, this is the ﬁrst time a spatio-temporal model for spatially misaligned point-referenced datais implemented through the INLA-SPDE approach considering more than one deterministic model output at differentspatial and temporal resolutions.We compare and contrast several models and through a cross-validation method we evaluate which one produces themost accurate predictions of NO concentration in Greater London and surroundings for the period 2007 to 2011. Wecompare our method with two approaches in which the alignment is done through bilinear interpolation or kriging, hencenot accounting for the measurement error associated with the misaligned covariates. The ﬁrst is a simple hierarchicalmodel that includes linear effects for the covariates and structured spatio-temporal residuals. The second is the recentlyproposed data integration model from Mukhopadhyay and Sahu (2017), which allows for non-stationarity in the residualspatial process.The remainder of the article is organised as follows: section 2 presents the study area and data; section 3 describes themethods used in the analysis, starting with the model speciﬁcation followed by a description of the competitor models;section 4 reports the results of the application of such methods to our air pollution data; ﬁnally, section 5 contains theconclusions and a short discussion, pointing towards further developments. The study focuses on NO , as it is one of the pollutants regulated by national and international directives, and it istrafﬁc driven, hence characterised by high spatio-temporal variability.We used daily averages of hourly observations from different monitoring networks including the AEA and theAuthomatic Urban and Rural Network (AURN) from the DEFRA’s UK Air Quality Archive, and the London AirQuality Network (LAQN) in Greater London and surroundings, managed by the King’s College London EnvironmentalResearch Group (ERG). The combined database was built as part of the Spatio/Temporal Exposure Assessment Methods(STEAM) project (King’s College London ERG, 2016).We also considered the outputs of two deterministic models: (i) annual 1km × × time series: (i) the daily average is computed only for the days where at least 18 hourly observations, i.e. the 75%, are3 PREPRINT - J

UNE

17, 2020Figure 1: NO data locations on the study domain, on the shape of England as reference.present; (ii) we eliminated non-positive daily averages which do not allow for the logarithmic transformation requiredin the analysis (negative observations are due to measurement error); (iii) monitors where the resulting daily NO isavailable for less than 1370 days, not necessarily consecutive, have been excluded (this is because an active monitordoes not necessarily record NO measurements).The monitors are split into 6 groups maximizing similarity criteria between groups, and a 6-fold cross validation isperformed.For each monitor we have information about the site type classiﬁcation, that we aggregated into 3 categories: rural,urban, road-kerb side.We deﬁne our study area as that including all the selected monitors, containing 495 grid cells for AQUM and 44,117grid cells for PCM.The locations of the air pollution data sources described above are displayed in Figure 1.4 PREPRINT - J

UNE

17, 2020Table 1: Descriptive statistics of NO concentration ( µg/m ) for the three datasources Data Min 1st Qu. Median Mean 3rd Qu. MaxPCM 2007 7.276 9.948 11.289 13.039 14.055 66.036PCM 2008 5.615 9.107 10.439 12.064 13.198 56.123PCM 2009 7.097 10.105 11.333 12.844 13.915 53.290PCM 2010 5.613 10.396 11.843 13.404 14.705 62.190PCM 2011 5.437 9.905 11.317 12.545 13.575 55.400AQUM January 0.215 11.354 19.926 21.943 30.150 73.654AQUM February 0.154 13.912 22.491 25.052 33.535 97.435AQUM March 0.291 10.165 16.760 19.180 25.969 84.589AQUM April 0.059 9.982 15.339 17.248 22.457 81.498AQUM May 0.008 6.847 10.340 12.187 15.622 77.623AQUM June 0.000 6.493 9.568 11.088 13.974 63.689AQUM July 0.409 5.855 8.388 9.842 12.244 67.893AQUM August 0.000 6.392 9.210 10.591 13.176 58.456AQUM September 0.054 7.545 11.795 13.899 18.095 68.396AQUM October 0.067 10.521 16.622 18.402 24.576 70.455AQUM November 0.531 11.714 18.694 20.942 28.109 97.137AQUM December 0.000 12.890 23.150 25.040 34.480 121.320Monitors RUR 0.484 5.357 8.995 11.405 14.734 114.375Monitors URB 0.679 11.938 19.091 22.286 29.049 166.792Monitors RKS 0.245 20.208 30.524 34.903 44.422 231.292

The AQUM model includes chemistry, physical and aerosol models, meteorological conﬁguration based on the MetOfﬁce’s North Atlantic and European Model (NAE) and emission data (Savage et al. , 2013); the PCM model inputincludes emission inventory, energy projections, road trafﬁc counts, road transport activity and meteorological hourlydata from Waddington weather station (Ricardo Energy & Environment, 2017).Although data were available, we decided against the inclusion of meteorological variables in our models, as they arealready an input for both the numerical models considered in the analysis.Consistently with the selection criteria, all the 6 training and validation sets have similar distribution of daily NO concentration by site type (see Appendix A, Fig. A.1). As expected, in all sets the road-kerb side monitors have highermean and maximum levels of NO .In particular, 17 road-kerb side sites overcome the limits set by the WHO and the European Commission for the annualaverage of 40 µg/m , for at least 4 of the 5 years under study (see Appendix A, Fig. A.2). Of these, the monitor inLambeth-Brixton Road (LB4) is also well above the threshold of 18 µg/m not to be exceeded more than 18 timesannually, for every year, even though a decreasing trend can be observed (from 865 hourly exceedances in 2007, to 62in 2011), and other 6 monitors exceeded this threshold between 2007 and 2008.Table 1 reports summary statistics for PCM data by year, AQUM data by month and monitor observations by site type. In this section we ﬁrst present some analysis on the AQUM and PCM data, then we introduce the joint model, andﬁnally the models that we use for comparison. Note that we will represent vector/matrices in bold typeface.

In order to quantify the relevance of the temporal component for PCM ( i = 1 ) and the spatial component for AQUM( i = 2 ), we ran three models for each data source separately: (i) one with spatial-only or temporal-only effectrespectively, (ii) one with additive spatial and temporal effects and (iii) one with a spatio-temporal interaction.Let’s deﬁne y i as the vector of air pollution concentration on the logarithmic scale across space and time for the i -thnumerical model. This is assumed to be normally distributed with mean η i and variance σ (cid:15) i : y i ∼ M V N ( η i , σ (cid:15) i I ) PREPRINT - J

UNE

17, 2020Each element of the linear predictor η i (for a time point t and location s identiﬁed by UTM coordinates) for models (i),(ii), (iii) is speciﬁed as follows: η ( s ) = α + z ( s ) η ( t ) = α + z ( t ) (i) η ( s , t ) = α + z ( s ) + z ( t ) η ( s , t ) = α + z ( s ) + z ( t ) (ii) η ( s , t ) = α + z ( s , t ) η ( s , t ) = α + z ( s , t ) (iii)where z i ( s ) is the realisation at location s of the spatial process z i with Matérn covariance function z i ∼ M V N ( , σ z i Σ ) , z i ( t ) ∼ N ( z i ( t − , σ z i ) is a temporal process modelled as a random walk and z i ∼ M V N ( , σ z i Σ t ⊗ Σ s ) is a separable space-time interaction with Matérn covariance function and temporal de-pendence modelled as a random walk.Based on the deviance information criterion (DIC), the results show that AQUM spatial variation is relevant but there isno need for a space-time interaction so model (ii) is selected for AQUM, while the PCM temporal variation is negligibleso model (i) is selected for PCM (see Appendix B for details). Hence, we will use this speciﬁcation in the joint modelpresented in the next section. Following Kiﬂe et al. (2017) we implement an additive space-time model for data observed at different points in space,which share a spatial and a temporal component.Previous similar applications consider measurements of more than one variable at the same locations, but this is not arequirement in the INLA-SPDE approach.Our model is joint in the sense that we specify one likelihood for the response and one for each of the misalignedcovariates, and they contain common components which are estimated using all the data. Even though in

R-INLA theproblem is computationally treated similarly to a multivariate situation, this is not our case as we ultimately considersolely the monitor observations as response variable.We make the assumption that the same temporal dynamics govern AQUM and monitor observations, and likewise thesame spatial dynamics govern PCM, AQUM and monitor observations.Our hierarchical model has three levels: in the ﬁrst we deﬁne the likelihoods, in the second the random effect components,while the third level includes the prior distributions for the model parameters and hyperparameters.The joint model presented below is implemented via INLA, a computationally efﬁcient alternative to Markov chainMonte Carlo (MCMC) methods that works speciﬁcally on hierarchical Gaussian Markov Random Fields (GMRF).Details on how this is done in

R-INLA can be found in Appendix E.

Let y i ( s , t ) denote the PCM ( i = 1 ) and AQUM ( i = 2 ) data and the observed NO concentration ( i = 3 ) at the generictime point t and site s , on the logarithmic scale. These are assumed to be normally distributed, with mean η i ( s , t ) andmeasurement error variance σ (cid:15) i : y ( s , t ) ∼ N ( η ( s ) , σ (cid:15) ) (PCM) y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) (AQUM) y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) (Ground observations)Based on the results from section 3.1 we model the PCM data with an intercept and a spatial component and theAQUM data with an intercept and additive spatial and temporal components. These are shared between the three linearpredictors, which are the following: η ( s ) = α + z ( s ) (1) η ( s , t ) = α + λ , z ( s ) + z ( t ) (2)6 PREPRINT - J

UNE

17, 2020 η ( s , t ) = α + β k s + λ , z ( s ) + λ , z ( t ) + z ( t, k s ) (3)where α i are the intercepts, λ i,j are the scaling parameters for the shared components from η i to η j , β k s is the ﬁxedeffect for the site type as categorical variable ( k s = 0 : rural (reference), k s = 1 : urban and k s = 2 : road-kerb side),and z and z are the shared random effects. The linear predictor for the ground observations η also contains aninteraction term z which allows for a different residual temporal trend for each site type.Note that even though PCM is assumed to be governed only by a spatial effect, its output does vary in both space andtime, so the deterministic model output y has space and time indices (here the locations are the centroids of the 44117PCM grid cell, and the time points are the years), while its latent ﬁeld z has only a spatial index.For AQUM, the space and time indices of y correspond to the centroids of the 495 AQUM grid cell and the 1826 daysrespectively.Finally, y is measured at the 126 monitors on 1826 days. In Equation (3), z ∼ M V N ( , σ z Σ ) is the common spatial latent ﬁeld, with Σ being the correlation matrix deﬁnedby the Matérn stationary and isotropic covariance function (see Appendix D). It is important to note that z is thenrescaled for AQUM and monitor observations through λ , (eq. 2) and λ , (eq. 3).In the same equation, z ( t ) is the t -th element of the temporal latent ﬁeld z , and is modelled as a random walk: z ( t ) ∼ N ( z ( t − , σ z ) . Similarly to z , z is rescaled for the monitor observations through λ , (eq. 3).Finally, z is the residual temporal trend assumed to be different for each site type (rural, urban, road-kerb side), andmodelled as ﬁrst order autoregressive z ( t, k s ) ∼ N ( ρz ( t − , k s ) , σ z ) . In other words, we assume conditionallyindependent replications of the same latent ﬁeld for each site type, with shared hyperparameters (Martins et al. , 2013). The priors on the model parameters are speciﬁed as follows.According to Fuglstad et al. (2019), we choose a penalised complexity prior (Simpson et al. , 2017) for range andvariance of the latent spatial ﬁeld z such that P ( r < r ) = 0 . and P ( σ z > σ ) = 0 . , where r = 1 / of thedomain size and σ = 100 (see Appendix D).For the standard deviation of the random walk we assume a penalised complexity prior such that the probability that σ z is greater than the empirical standard deviation of the AQUM data is 1%, i.e. P ( σ z > SD ( AQU M )) = 0 . .For the time-sitetype interaction we assume the default vague prior deﬁned on the log-precision: log (1 /σ z ) ∼ logGamma (1 , e − ; for the autoregressive parameter we assume ρ ∼ N (0 . , . using information from previousmodelling exercise.The precisions of response variable, AQUM data and PCM data are assigned the default vague prior log (1 /σ (cid:15) i ) ∼ logGamma (1 , e − , i = 1 , , .On the scaling coefﬁcients we put a Normal prior centred on a positive values around 1 with a large variance to ensureminimal information: λ , ∼ N (1 . , , λ , ∼ N (1 . , , λ , ∼ N (0 . , .Finally on the coefﬁcients of the ﬁxed effects α i and β k s we assume the default weak Normal prior distribution N (0 , . We compared our model to three different competitors: (i) a joint model that includes only one misaligned covariate(either AQUM or PCM), (ii) a simple hierarchical model that includes a covariate aligned at the monitoring sites throughbilinear interpolation (Akima, 1978) or kriging, and (iii) a complex hierarchical model which allows for non-stationarityafter interpolating the misaligned covariates via bilinear interpolation (Mukhopadhyay and Sahu, 2017).The aim of this comparison is to evaluate if the inclusion of more than one extra data source actually improves themodel predictive capability and can counterbalance the need for complex random effect structures, and if there is a gainin moving from a simple interpolation to a modelling framework.We describe the three comparators in the rest of this section.7

PREPRINT - J

UNE

17, 2020

The joint model that includes PCM only is speciﬁed as follows: y ( s , t ) ∼ N ( η ( s ) , σ (cid:15) ) (PCM)and y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) (Ground observations)with η ( s ) = α + z ( s ) (PCM) η ( s , t ) = α + β k s + λ , z ( s ) + z ( t, k s ) (Ground observations)Similarly, the joint model that includes AQUM only is deﬁned as: y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) (AQUM)and y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) (Ground observations)with η ( s , t ) = α + z ( s ) + z ( t ) (AQUM) η ( s , t ) = α + β k s + λ , z ( s ) + λ , z ( t ) + z ( t, k s ) (Ground observations)For these models we considered either ﬁxed to 1 or varying calibration coefﬁcients λ i,j , and different priors. The ﬁnalchoice of priors is the one reported in Section 3.2.3. We implement two models that use interpolation techniques to obtain values of AQUM and PCM at the monitoringstations. The ﬁrst is a naive bilinear interpolation, the second can be considered as Bayesian kriging, as we predictAQUM and PCM at the monitoring stations from the models described in section 3.1.In both cases, after aligning the AQUM ( X ) and PCM ( X ) values, we consider a linear effect on the covariates,a spatially structured residual z , a temporally structured residual z and the site-type-speciﬁc temporal effect z speciﬁed as in Section 3.2. We also keep the ﬁxed effects for the site type β k s as in the joint model.We specify a normal likelihood y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) and the linear predictor as follows: η ( s , t ) = β + β X ( s , t ) + β X ( s , t ) + β k s + z ( s ) + z ( t ) + z ( t, k s ) Mukhopadhyay and Sahu (2017) developed a site-type-speciﬁc regression on the AQUM data using our same classiﬁca-tion for the site type. The key feature of their model is the speciﬁcation of a non-stationary spatio-temporal process,which leads to a better predictive performance compared to the stationary Gaussian process (GP) in their application.To obtain a like-for-like comparison, we also include a site-type-speciﬁc regression on the PCM data and implementboth the stationary and the non-stationary versions of this model.Both AQUM ( X ) and PCM ( X ) are interpolated at the monitoring site locations through bilinear interpolation.The hierarchical model speciﬁcation in this case is: y ( s , t ) ∼ N ( η ( s , t ) , σ (cid:15) ) η ( s , t ) = µ ( s , t ) + ν ( s , t ) with µ ( s , t ) = γ + γ X ( s , t ) + (cid:80) k =1 δ k ( s )( γ k + γ k X ( s , t )) for the model with AQUM only, and µ ( s , t ) = γ + γ X ( s , t ) + γ X ( s , t ) + (cid:80) k =1 δ k ( s )( γ k + γ k X ( s , t ) + γ k X ( s , t )) for the model with AQUM and PCM. 8 PREPRINT - J

UNE

17, 2020Here k = 0 indicates rural site type (baseline), k = 1 urban and k = 2 road-kerbside, δ k ( s ) is an indicator functionequal to 1 if site s is of type k and 0 otherwise, γ , γ and γ are the baseline intercept and slopes for X and X ,while γ k , γ k and γ k are site-type-speciﬁc adjustments to the baseline intercept and slopes.For the spatio-temporal process ν we ﬁrst assume a stationary time-independent GP with zero mean and exponentialcorrelation function (note that ν t ( s ) = ν ( s , t ) ): ν t ∼ N ( , σ ν H ν ( φ )) , where H ν ( φ ) = corr ( ν t ( s ) , ν t ( s (cid:48) )) = exp ( −|| s − s (cid:48) || φ ) .Then we specify a non-stationary covariance structure as in Sahu and Mukhopadhyay (2015): given a GP ν ∗ t deﬁned ona set of m = 25 knot locations ν ∗ t ∼ M V N ( , σ ν H ν ∗ ( φ )) , the Gaussian predictive process (GPP) at a new location s is deﬁned as ˜ ν t ( s ) = E [ ν t ( s ) | ν ∗ t ] . From multivariate Gaussian theory it follows that ˜ ν t = C ∗ H − ν ∗ ( φ ) ν ∗ t with C ∗ being the cross-correlation function between ν t and ν ∗ t .The non-stationarity and the anisotropy are given by the fact that corr (˜ ν t ( s ) , ˜ ν t ( s (cid:48) )) = c ∗ ( s ) T H − ν ∗ ( φ ) c ∗ ( s (cid:48) ) whichdepends on both s and s (cid:48) and not only on the separation vector or the distance between locations.To introduce temporal dependence we specify a ﬁrst order autoregressive model for ν ∗ t .The choice of knot locations, model speciﬁcation and prior distributions is based on Mukhopadhyay and Sahu (2017).This is justiﬁed by the fact that we use the same data sources on a subregion of their study area. We compared the predictive capability through proper scoring rules (Gneiting and Raftery, 2007) such as the cross-validated logarithmic score (logScore), the Continuous Ranked Probability Score (CRPS) and the Root Mean SquaredError (RMSE), and also the Predictive Model Choice Criterion (PMCC) proposed by Gelfand and Ghosh (1998).Furthermore, we reported the correlation between the observed and the predicted values for the validation sites (COR),the Mean Absolute Percentage Error (MAPE), and the 95% coverage (COV), deﬁned as the percentage of times that theobserved value falls within the 95% credibility interval of the sampled posterior marginal.To measure the predictive capability we need to report the ﬁtted values at the validation sites on the scale of the outcome,as cannot compare the observed values with the ﬁtted values because they are not accounting for the measurementerror, but only for the uncertainty associated with the model parameters. In order to do so, we draw 50 values from themarginal posterior of the measurement error p ( σ (cid:15) | y ) ﬁrst, then draw η from its conditional posterior p ( η | σ (cid:15) , y ) using the simulated values of σ (cid:15) (Gelman et al. , 2013). These values are used as mean and variance of a Normaldistribution, from which we sampled values at each site (sample size = 100).The following are the formulas for the different measures of predictive capability used in the paper. For simplicity weapply a slight change of notation here: y jt indicates the observed value at monitor j ( m is the number of validationmonitors) and day t ( t = 1 , . . . , T , T =1826), and ˆ y jt is the corresponding predicted value obtained as mean of thevector of Q = 100 × sampled values ˆ y jt = ˆ y , . . . , ˆ y Q ∼ F jt , F jt being the empirical distribution function of ˆ y jt . RM SE = (cid:118)(cid:117)(cid:117)(cid:116) mT m (cid:88) j =1 T (cid:88) t =1 ( y jt − ˆ y jt ) M AP E = 1 mT m (cid:88) j =1 T (cid:88) t =1 | y jt − ˆ y jt | y jt · P M CC = m (cid:88) j =1 T (cid:88) t =1 ( y jt − ˆ y jt ) + m (cid:88) j =1 T (cid:88) t =1 V AR ( ˆ y jt ) CRP S = 1 mT m (cid:88) j =1 T (cid:88) t =1 CRP S ( F jt , y jt ) , with CRP S ( F jt , y jt ) = 1 Q Q (cid:88) q =1 | ˆ y q − y jt | − Q Q (cid:88) q =1 Q (cid:88) r =1 | ˆ y q − ˆ y r | PREPRINT - J

UNE

17, 2020For each model under comparison, the predictive capability measures presented above are computed pooling togetherthe 6 validation sets. It can be calculated by day, by site, by site type or across all sites to obtain speciﬁc and globalmeasures, with lowest measures indicating the best predictive performance.

From the joint model, we extract daily predictions of NO concentration on a regular grid that covers the study area.For the grid we choose an intermediate spatial resolution between PCM and AQUM data to limit the computationalburden while retaining spatial variability.In order to provide the predictions in a reasonable time, we extract samples from the joint posterior marginals andestimate the linear predictor at each time-location for the 1826 days on the regular grid (Thomas et al. , 2019).We compute the predictions from the model that includes all monitors as training set.We extract samples from the posterior marginals of the model components in order to reconstruct the linear predictor ateach time-location for the 1826 days on the regular grid, as: η ( s , t ) = α + β k s + λ , z ( s ) + λ , z ( t ) + z ( t, k s ) In particular, following the tutorial by Bakka (2017), we obtain samples from the posterior of the intercept α and β k s for each site type, samples from the posterior of λ , z at the mesh nodes and reproject it on the prediction grid,samples from the posterior of λ , z at each time point (days), and samples from the posterior of z for each day andsite type.Note that in order to predict at the grid locations we need to know the value of site type classiﬁcation for each gridpoint. With this aim we built a function which assigns each location to road-kerb side, urban or rural depending onthe distance from any road as well as using the Corine land cover for the year 2012 for the UK, Jersey and Guernseyshapeﬁle from the Centre for Ecology and Hydrology (Cole et al. , 2015). See Appendix F for more details.For each sample we then sum up the samples from the ﬁxed effects and random effects to reconstruct the linear predictor,then the prediction is given by average across all samples. In this section we present the results of the model comparison, with particular focus on the advantages of the proposedjoint model, and the daily predictions that we obtained from the best model.

In order to show whether the inclusion of more than one extra data source actually improves the model predictivecapability, we compare our proposed joint model with the corresponding models that include only AQUM or PCM.For AQUM we assume a spatio-temporal effect or temporal-only effect when PCM is included.We also compare our joint model with other well established data integration techniques, the simple interpolationmodels described in section 3.3.2 and the more complex ones described in section 3.3.3.Besides providing information about all the sources of uncertainty, all the joint models have better performance than themodels where the misaligned data are interpolated, even allowing for non-stationarity (see table 2).However, the AQUM data do not seem to provide much information, in fact the model that includes only AQUM hasa far worse performance than the one only including PCM. In addition, allowing for a spatial effect on AQUM doesnot improve the prediction, for the model where PCM is also included. This can be explained by the fact that thetime-sitetype interaction z replaces the role of AQUM in capturing the temporal trend when we remove AQUM fromthe model and the temporal information is still provided by the numerous monitoring stations, while there is no otherstructured spatial component that compensates for PCM when it is removed.Furthermore, as we focus here on spatial prediction rather than temporal forecasting, removing AQUM is less of aburden on the model performance in terms of predictive capability.Nevertheless the model including both AQUM and PCM has the best performance in terms of PMCC and CRPS and wewill report the results from this model in the next section.Note that the predictive capability measures of the models in section 3.3.3 cannot be compared with the others due tothe different model structure. Only the PMCC and the 95% coverage are comparable and reported in table 2.10 PREPRINT - J

UNE

17, 2020Table 2: Model comparison in terms of predictive capability

Predictive capabilityModel ** PMCC CRPS RMSE MAPE CORR COV

AQUM ( s, t ) joint 18277 0.0523 0.5725 16.71% 65.83% 78.15% P CM ( s ) joint 14018 0.0372 0.4615 13.54% 76.77% 86.87% AQU M ( s, t ) + P CM ( s ) joint 13621 0.0338 AQUM + P CM bilinear interpolation 82970 0.2560 0.6911 17.57% 67.14% 68.55%

AQUM + P CM kriging estimates 35017 0.2220 0.4964 14.58% 73.13% 75.27%

AQUM , non-stationary(Mukhopadhyay and Sahu, 2017) 79542 * AQUM + P CM , non-stationary(Mukhopadhyay and Sahu, 2017) 75506 * AQUM + P CM , stationary(Mukhopadhyay and Sahu, 2017) 75510 * * As provided by spT.Gibbs function in R package spAir. ** ( s ) indicates spatial-only random effects; ( t ) indicates temporal-only random effects; ( s, t ) indicatesadditive spatial and temporal random effects. When not speciﬁed, a linear effect is assumed as describedin section 3.3. Table 3: Summary of model parameters and hyperparametersmean SD 0.025q median 0.975q α α α β URB -0.1716 0.0047 -0.1808 -0.1716 -0.1624 β RKS σ (cid:15) σ (cid:15) σ (cid:15) σ z σ z σ z r z (km) 177.8 0.227 177.2 177.77 256.0 ρ z λ , λ , λ , We report the results for the joint model that includes spatial and temporal effects on AQUM and spatial effect on PCM,re-ran using all monitors as training data.Looking at the summary reported in table 3 we see that, as expected, there is an increase in the NO concentrationgoing from rural to road-kerb side locations, but not for urban. For the spatial latent ﬁeld z , the estimated empiricalrange, i.e. the distance after which the spatial correlation function drops to 0.13 (Lindgren and Rue, 2015), is 177 Km,corresponding to approximately 50% of the maximum extension of the spatial domain.11 PREPRINT - J

UNE

17, 2020 lambda13z1z1 lambda12z1 −0.50.00.51.01.5 log(NO2) ( µ g/m ) Figure 2: Posterior mean of the latent spatial ﬁeld z and of the rescaled spatial ﬁelds λ , z and λ , z .The scaling parameters λ i,j are all different from 1, meaning the spatial ﬁeld for PCM needs to be rescaled for AQUM( λ , = 1 . ) and for the monitor observations ( λ , = 1 . ), and the temporal latent ﬁeld for AQUM is also calibratedagainst the monitor observations with λ , = 0 . .The intercepts α i represent the overall mean of PCM, AQUM and ground observations respectively.The spatial latent ﬁeld z (Fig. 2) shared between the PCM data, the AQUM data and the monitor observations showsthe trafﬁc-driven characteristics of NO as we can recognize higher values in correspondence of motorways and majorcity centers. The rescaled ﬁelds are reported as well and for λ , z the magnifying effect of the scaling parameter λ , = 1 . is particularly visible.Figure 3 shows the temporal latent ﬁeld z shared between the AQUM data and the monitor observations which capturesthe seasonality of NO , and the rescaled ﬁeld λ , z which is shrinked by the scaling parameter λ , = 0 . .The latent ﬁelds z and z are both centred in zero as the large scale component of PCM and AQUM is captured bytheir intercepts α and α .Finally, the time-sitetype interaction z in Fig. 4 shows that there is some residual site-type-speciﬁc temporal variability,especially for urban and road-kerb side monitors, which is not captured by the main temporal component z . We selected four NO pollution episodes reported by the LondonAir website (King’s College London, 2018) andcompared the predictions for these four days with four randomly selected summer Sundays across the study period,12 PREPRINT - J

UNE

17, 2020 z (black) and λ z (red) − posterior mean Year µ g / m − . − . . . . . Figure 3: Posterior mean of the latent temporal ﬁeld z (black line) and of the rescaled temporal ﬁeld λ , z (red line). RUR

Year µ g / m − . − . . . . URB

Year µ g / m − . − . . . . RKS

Year µ g / m − . − . . . . Figure 4: Time-sitetype interaction z . Posterior mean in red, 95% CI in black dashed lines.13 PREPRINT - J

UNE

17, 2020 easting no r t h i ng log(NO2) ( µ g/m ) Figure 5: Daily predictions for four days in which an air pollution event was registered (top row) and four days withreported low air pollution concentration (bottom row).where we expect to see low levels of NO . The predictions show the expected behaviour, with high predictedconcentrations during the pollution episodes and low concentrations during the selected Sundays (Fig. 5).A layer with the roads classiﬁed as motorways is plotted on top of each map, showing correspondence between thehighest predicted levels of NO and the major roads. This is expected because NO is a highly trafﬁc-driven pollutant.A peak of NO concentration can also be observed in the area of Heathrow airport, on the left of Greater London, whichis characterised by the highest levels also on low concentration days. We implemented a hierarchical Bayesian model to estimate air pollution concentration, combining misaligned datasources with a joint approach. This approach can be considered in between Bayesian melding and calibration, and it isthe ﬁrst attempt at implementing such methods on spatio-temporal air pollution data in

R-INLA .The proposed model includes information on the site type as well as output from two different numerical modelscharacterised by spatial and temporal variability and accounting for trafﬁc, chemistry, land use and meteorologicalcovariates. Our method is transferable to any available data sources, however the interpretation of the results maychange according to their intrinsic characteristics, in particular referring to the information included in the deterministicor LUR models.We show that including more than one covariate at different spatial and temporal resolution increases model predictivecapability. However, removing AQUM has proven not to be detrimental, but this could be justiﬁed with the fact that weare not doing temporal forecasting.Overall we prove that using as much spatial and temporal information as possible is more beneﬁcial than increasing thecomplexity of the random effect structure.A time-site type interaction was added to the model to account for residual temporal variability observed when lookingat the site type-speciﬁc residuals.The advantages of our method are manyfolds: ﬁrst, reconstructing the entire latent ﬁeld in a Bayesian approach providesus with the marginal posterior distribution for all the uncertainty parameters, allowing us to correctly quantify theuncertainty associated with our predictions and the deterministic models, that is not possible to obtain with otherdownscalers and non-model-based solutions; second, unlike the spatio-temporal downscaler proposed by Berrocal et al. PREPRINT - J

UNE

17, 2020(2012), our model reconstructs the latent ﬁelds of the misaligned covariates as a whole, rather than locally. For the samereason, in order to obtain daily predictions at new locations there is no need to calculate the value of the misalignedcovariates at the prediction locations, as the model already estimates the whole latent ﬁeld.Our analysis presents some limitations related on one side to the computational requirements of INLA due to the highnumber of parameters, and on the other side to the generalizability of the results, as the models are quite data-sensitive.In particular, we have very few rural sites even though we extended the study domain outside Greater London, suggestingthe presence of preferential sampling that we did not account for. Furthermore, we made assumptions of stationarityand isotropy which may not hold when extending the spatial domain to bigger areas.As a next step we will extend the joint model to a multivariate version including other pollutants, such as PM or O .In the future, the predicted air pollution concentration with associated measure of uncertainty could be used as exposurein an epidemiological model, allowing for uncertainty propagation. Acknowledgements

The authors wish to thank the King’s College London Environmental Research Group and Paul Agnew from the MetOfﬁce for providing the data, Ben Barratt (King’s College London) and Monica Pirani (Imperial College London) forthe valuable comments, and Haakon Bakka for the priceless explanations on INLA functions. This work was supportedby the Medical Research Council [grant number MR/M025195/1]. Chiara Forlani was funded by Imperial CollegeLondon President’s PhD scholarship. Michela Cameletti has been supported by the PRIN EphaStat Project (ProjectNo. 20154X8K23, https://sites.google.com/site/ephastat/ ) provided by the Italian Ministry for Education,University and Research.

References

Akima H, 1978. A Method of Bivariate Interpolation and Smooth Surface Fitting for Irregularly Distributed Data Points.

ACM Transactions on Mathematical Software (2): 148–159.Atkinson RW, Mills IC, Walton HA, Anderson HR, 2015. Fine particle components and health—a systematic reviewand meta-analysis of epidemiological time series studies of daily mortality and hospital admissions. Journal ofExposure Science and Environmental Epidemiology : 208–214.Bakka H, 2017. Predictions with iid effects - binomial example: Building up the samples model com-ponent by model component (https://haakonbakkagit.github.io/btopic112.html arXiv e-prints (2): arXiv:1803.03765v2.Berrocal VJ, 2019. Data assimilation. In Gelfand AE, Fuentes M, Hoeting JA, Smith RL (eds.), Handbook of Environ-mental and Ecological Statistics , chapter 7, CRC Press, 854.Berrocal VJ, Gelfand AE, Holland DM, 2010. A spatio-temporal downscaler for output from numerical models.

Journalof Agricultural, Biological, and Environmental Statistics (2): 176–197.Berrocal VJ, Gelfand AE, Holland DM, 2012. Space-Time Data fusion Under Error in Computer Model Output: AnApplication to Modeling Air Quality. Biometrics (3): 837–848.Blangiardo M, Cameletti M, 2015. Spatial and Spatial-temporal Bayesian Models with R-INLA . Wiley.Cameletti M, Gómez-Rubio V, Blangiardo M, 2019. Bayesian modelling for spatially misaligned health and air pollutiondata through the INLA-SPDE approach.

Spatial Statistics : 100353.Chang HH, 2016. Data Assimilation for Environmental Pollution Fields. In Lawson AB, Banerjee S, Haining RP, UgarteMD (eds.), Handbook of Spatial Epidemiology

Epidemiology PREPRINT - J

UNE

17, 2020Esch T, Heldens W, Hirner A, Keil M, Marconcini M, Roth A, Zeidler J, Dech S, Strano E, 2017. Breaking new groundin mapping human settlements from space – The Global Urban Footprint.

ISPRS Journal of Photogrammetry andRemote Sensing (June): 30–42.European Commission, 2018. Air Quality Standards (http://ec.europa.eu/environment/air/quality/standards.htm).Fuglstad GA, Simpson DP, Lindgren FK, Rue H, 2019. Constructing Priors that Penalize the Complexity of GaussianRandom Fields.

Journal of the American Statistical Association (525): 445–452.Gelfand AE, Ghosh SK, 1998. Model choice: A minimum posterior predictive loss approach.

Biometrika (1): 1–11.Gelfand AE, Sahu SK, Holland DM, 2012. On the effect of preferential sampling in spatial prediction. Environmetrics (7): 565–578.Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB, 2013. Bayesian Data Analysis . CRC Press, 3edition, 675 pp.Gneiting T, Raftery AE, 2007. Strictly Proper Scoring Rules, Prediction, and Estimation.

Journal of the AmericanStatistical Association (477): 359–378.Gómez-Rubio V, 2019.

Bayesian inference with INLA .Hoek G, Beelen R, de Hoogh K, Vienneau D, Gulliver J, Fischer PH, Briggs DJ, 2008. A review of land-use regressionmodels to assess spatial variation of outdoor air pollution.Huang G, Lee D, Scott E, 2017. Multivariate space-time modelling of multiple air pollutants and their health effectsaccounting for exposure uncertainty.

Statistics in Medicine (November): 1–15.Huang G, Lee D, Scott M, 2015. An integrated Bayesian model for estimating the long-term health effects of airpollution by fusing modelled and measured pollution data: A case study of nitrogen dioxide concentrations inScotland.

Spatial and Spatio-temporal Epidemiology : 63–74.Johnson M, Isakov V, Touma JS, Mukerjee S, Özkaynak H, 2010. Evaluation of land-use regression models used topredict air quality concentrations in an urban area.

Atmospheric Environment (30): 3660–3668.Keller JP, Peng RD, 2019. Error in estimating area-level air pollution exposures for epidemiology. Environmetrics (8).Kiﬂe YW, Hens N, Faes C, 2017. Cross-covariance functions for additive and coupled joint spatiotemporal SPDEmodels in R-INLA. Environmental and Ecological Statistics AdvancedSpatial Modeling with Stochastic Partial Differential Equations Using R and INLA . CRC Press, 300 pp.Lee D, Ferguson C, Marian Scott E, 2011. Constructing representative air quality indicators with measures of uncertainty.

Journal of the Royal Statistical Society. Series A: Statistics in Society (1): 109–126.Lee D, Mukhopadhyay S, Rushworth A, Sahu SK, 2017. A rigorous statistical framework for spatio-temporal pollutionprediction and estimation of its long-term impact on health.

Biostatistics (2): 370–385.Lee D, Sarran C, 2015. Controlling for unmeasured confounding and spatial misalignment in long-term air pollutionand health studies. Environmetrics (7): 477–487.Lindgren FK, Rue H, 2015. Bayesian Spatial Modelling with R-INLA. Journal of Statistical Software (19): 1–25.Lindgren FK, Rue H, Lindström J, 2011. An explicit link between gaussian ﬁelds and gaussian markov randomﬁelds: The stochastic partial differential equation approach. Journal of the Royal Statistical Society. Series B(Methodological) (4): 423–498.Lipfert FW, 2017. A critical review of the ESCAPE project for estimating long-term health effects of air pollution. Environment International : 87–96.Martins TG, Simpson DP, Lindgren FK, Rue H, 2013. Bayesian computing with INLA: New features. ComputationalStatistics and Data Analysis : 68–83.McMillan NJ, Holland DM, Morara M, Feng J, 2010. Combining numerical model output and particulate data usingBayesian space–time modeling. Environmetrics PREPRINT - J

UNE

17, 2020Moraga P, Cramb SM, Mengersen KL, Pagano M, 2017. A geostatistical model for combined analysis of point-leveland area-level data using INLA and SPDE.

Spatial Statistics : 27–41.Mukhopadhyay S, Sahu SK, 2017. A Bayesian spatiotemporal model to estimate long-term exposure to outdoor airpollution at coarser administrative geographies in England and Wales. Journal of the Royal Statistical Society. SeriesA: Statistics in Society

Spatial and Spatio-temporal Epidemiology :53–62.Raftery AE, Fuentes M, 2005. Model Evaluation and Spatial Interpolation by Bayesian Combination of Observationswith Outputs from Numerical Models. Source: Biometrics Annual Review of Statistics and Its Application : 395–421.Sahu SK, Gelfand AE, Holland DM, 2010. Fusing point and areal level space-time data with application to wetdeposition. Journal of the Royal Statistical Society. Series C: Applied Statistics (1): 77–103.Sahu SK, Mukhopadhyay S, 2015. On generating a ﬂexible class of anisotropic spatial models using Gaussian predictiveprocesses. Technical Report May, University of Southampton.Savage NH, Agnew P, Davis LS, Ord Nez C, Thorpe R, Johnson CE, O ’connor FM, Dalvi M, 2013. Air qualitymodelling using the Met Ofﬁce Uniﬁed Model (AQUM OS24-26): model description and initial evaluation. Geosci.Model Dev : 353–372.Schmidt AM, Gelfand AE, 2003. A Bayesian coregionalization approach for multivariate pollutant data. Journal ofGeophysical Research (D24): n/a–n/a.Shaddick G, Thomas ML, Green A, Brauer M, van Donkelaar A, Burnett R, Chang HH, Cohen A, Dingenen RV, DoraC, Gumy S, Liu Y, Martin RV, Waller LA, West J, Zidek JV, Prüss-Ustün A, 2017. Data integration model for airquality: A hierarchical approach to the global estimation of exposures to ambient air pollution.

Journal of the RoyalStatistical Society. Series C: Applied Statistics : 231–253.Shaddick G, Wakeﬁeld JC, 2002. Modelling multivariate pollutant data at multiple sites.

Applied Statistics (Dlm):351–372.Shaddick G, Zidek JV, Liu Y, 2015. Mitigating the effects of preferentially selected monitoring sites for environmentalpolicy and health risk analysis. Spatial and Spatio-temporal Epidemiology : 44–52.Simpson DP, Rue H, Riebler A, Martins TG, Sørbye SH, 2017. Penalising model component complexity: A principled,practical approach to constructing priors. Journal of Statistical Science (1): 1–28.Thomas ML, Shaddick G, Simpson D, de Hoogh K, Zidek JV, 2019. Data integration for high-resolution, continental-scale estimation of air pollution concentrations. arXiv e-prints (2): arXiv:1907.00093v2.WHO, 2006. Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide Global update2005. Technical report.Wikle CK, Berliner LM, 2005. Combining Information Across Spatial Scales. Technometrics (1): 80–91.Zidek JV, Le ND, Liu Z, 2012. Combining data and simulated data for space-time ﬁelds: Application to ozone. Environmental and Ecological Statistics (1): 37–56. A Further ﬁgures

Additional ﬁgures mentioned in the paper are reported here.17

PREPRINT - J

UNE

17, 2020

Estimation set N O c on c en t r a t i on Site type

RURURBRKS050100150200 1 2 3 4 5 6

Validation set N O c on c en t r a t i on Site type

RURURBRKS

Figure A.1: Boxplot of NO concentration by site type for each estimation and validation set. B Comparison of separate models for AQUM and PCM data

Table B.1 shows the deviance information criterion (DIC) and the logarithmic score for the models implementedseparately for AQUM and PCM.We need to include a spatial component for AQUM as this signiﬁcantly improves the model performance.According to the reported measures, the spatial-only model for PCM is the worst in terms of goodness of ﬁt. However,the temporal component seems negligible and when including a space-time interaction the range becomes unreasonable(5230 Km). For these reasons, and to limit the computational burden, we decided to keep only the spatial componentwhen modelling PCM, also because the temporal information is provided by the monitors and the AQUM data at amuch higher resolution (daily instead of annual).

C INLA-SPDE

In this section we deﬁne the class of models on which we can perform Bayesian inference using INLA and brieﬂyintroduce how the inference is computed, following the notation in Blangiardo and Cameletti (2015).Let us consider a set of data y = ( y , . . . , y n ) with distribution characterized by a parameter µ , usually the mean,deﬁned as a function of an additive linear predictor g ( µ ) = η : g ( µ i ) = η i = α + M (cid:88) m =1 β m x mi + L (cid:88) l =1 f l ( z li ) PREPRINT - J

UNE

17, 2020Figure A.2: Annual averages of NO hourly concentration for monitoring sites that exceeded the WHO annual thresholdof 40 µg/m at least once in the study period.Table B.1: Comparison of performance for separate modelsof AQUM and PCMModel * DIC logScore(i)

AQU M ( t ) AQU M ( s + t ) (iii) AQU M ( s ∗ t ) P CM ( s ) -373298 -0.85 (ii) P CM ( s + t ) -403560 -0.91(iii) P CM ( s ∗ t ) -456714 -1.04 * ( t ) indicates temporal-only random effect; ( s ) indicates spatial-only random effect; ( s + t ) indicates additive spatial and temporal effects; ( s ∗ t ) indicates space-time interaction.19 PREPRINT - J

UNE

17, 2020Deﬁning the vector of parameters θ = ( α, β , f ) T and the vector of hyperparameters ψ = ( ψ , . . . , ψ K ) , the likelihoodis given by p ( y | θ , ψ ) = n (cid:89) i =1 p ( y i | θ i , ψ ) We assume the latent ﬁeld θ to be multivariate Normal with precision matrix Q and conditionally independent, i.e. aGaussian Markov Random Field (GMRF): θ ∼ M V N ( , Q − ( ψ )) The Markov property ensures the sparsity of the precision matrix Q .The aim of Bayesian inference is to obtain the posterior marginal distributions for all the model parameters p ( θ i | y ) = (cid:82) p ( θ i , ψ | y ) d ψ = (cid:82) p ( θ i | ψ , y ) p ( ψ | y ) d ψ and hyperparameters p ( ψ k | y ) = (cid:82) p ( ψ | y ) d ψ − k .Therefore we ﬁrst need to compute (i) the joint posterior marginal of the hyperparameters p ( ψ | y ) and (ii) the posteriorconditional distributions p ( θ i | ψ , y ) .Within such class of models, which includes a wide range of possible model speciﬁcations, we can compute thesedistributions through a Laplace approximation: p ( ψ | y ) = p ( θ , ψ | y ) p ( θ | ψ , y ) = p ( y | θ , ψ ) p ( θ , ψ ) p ( y ) 1 p ( θ | ψ , y ) ∝ p ( y | θ , ψ ) p ( θ , ψ ) p ( θ | ψ , y ) ≈ p ( y | θ , ψ ) p ( θ , ψ )˜ p ( θ | ψ , y ) (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ ( ψ ) =: ˜ p ( ψ | y ) where ˜ p ( θ | ψ , y ) is the Gaussian Laplace approximation of p ( θ | ψ , y ) around its mode θ ∗ ( ψ ) .For p ( θ i | ψ , y ) we consider a partition θ = ( θ i , θ − i ) : p ( θ i | ψ , y ) = p (( θ i , θ − i ) | ψ , y ) p ( θ − i | θ i , ψ , y ) = p ( θ , ψ | y ) p ( ψ | y ) 1 p ( θ − i | theta i , ψ , y ) ∝ p ( θ , ψ | y ) p ( θ − i | theta i , ψ , y ) ≈ p ( θ , ψ | y )˜ p ( θ − i | θ i , ψ , y ) (cid:12)(cid:12)(cid:12)(cid:12) θ − i = θ ∗− i ( θ i , ψ ) =: ˜ p ( θ i | ψ , y ) where ˜ p ( θ i | ψ , y ) is the Gaussian Laplace approximation of p ( θ i | ψ , y ) around its mode θ ∗− i ( θ i , ψ ) .In R-INLA other approximation strategies are implemented and can be chosen to speed up the computation.In particular, p ( θ i | ψ , y ) can be directly derived from the Normal approximation ˜ p ( θ | ψ , y ) already computed in theﬁrst step (Gaussian strategy). This can produce inaccurate approximations, however when the conditional p ( θ | ψ , y ) isGaussian it is an exact approximation and there is no need to apply the Laplace method, as in our case.For point-referenced data (i.e. data observed at point locations typically referenced by coordinates), the latent continuousspatial process is a Gaussian Field (GF) with dense spatial covariance matrix that leads to computational issues. TheSPDE approach proposed by Lindgren et al. (2011) is an alternative which consists in representing a continuous spatialprocess (the GF) as a discretely indexed spatial random process (i.e. a GMRF).The continuous GF with Matérn covariance structure z ( s ) is the exact and stationary solution of the following stochasticpartial differential equation: ( κ − ∆) α/ ( τ z ( s )) = W ( s ) (1)where ∆ is the Laplacian, α is a smoothness parameter, κ is a scale parameter, τ controls the variance, s is the genericspatial location and W ( s ) is a Gaussian spatial white noise process.For the relationship between the SPDE in Eq. (1) and the Matérn parameters see Eq. (2) in Appendix D.For details on how to solve the SPDE that gives a Matérn random ﬁeld see Bakka (2018).This solution z ( s ) can be approximated through a weighted sum of basis functions ψ g deﬁned at the G vertices (nodes)of a triangulation (mesh) of the domain with zero-mean Gaussian-distributed weights ˜ z g (Lindgren et al. , 2011): z ( s ) = G (cid:88) g =1 ψ g ( s )˜ z g PREPRINT - J

UNE

17, 2020Figure C.1: Domain triangulation (mesh) with 686 vertices (nodes). The red lines are the boundaries of Greater Londonand the south-eastern coast of UK representing the study area. The blue dots are the locations of the monitoring stations.Therefore, at a discrete set of locations s = ( s , ..., s G ) , i.e. the mesh nodes, the GP z follows a multivariate Normaldistribution with zero mean and spatially structured correlation matrix.Figure C.1 shows the mesh constructed on our data.At the data locations we write z i = Az ( s ) where A is the projection matrix from the mesh nodes to the data locations.This simpliﬁes the notation as the distribution assumed at the data locations has a complicated form. D Matérn covariance function and INLA-SPDE

Let r = d ( i, j ) = || s i − s j || for two-dimensional domains. Then the Matérn class of stationary and isotropic covariancefunctions is deﬁned as COV ( r ) = σ Γ( λ )2 λ − ( κr ) λ K λ ( κr ) (2)where σ = Γ( λ )Γ( α )(4 π ) d/ κ λ τ is the marginal variance, K λ is the modiﬁed Bessel function of second kind and order λ > , λ = α − d/ is a smoothness parameter, where d is the dimention of the domain (2 in case of spatial processes).We refer the reader to Lindgren and Rue (2015) for a discussion on the choice of α implemented in R-INLA . Finally, κ is a scaling parameter related to the empirically derived range ρ = √ λ/κ .21 PREPRINT - J

UNE

17, 2020We adopt the default α = 2 , which for two-dimensional domains means smoothness parameter λ = 1 .We need a parameterization to represent the Matérn correlation structure in INLA. The easiest way would be to assign ajoint normal prior distribution to ψ = log ( τ ) and ψ = log ( κ ) , in fact from the formula of the marginal variance σ we can derive log ( τ ) = 12 log (cid:18) Γ( λ )Γ( α )(4 π ) d/ (cid:19) − log ( σ ) − λlog ( κ ) (3)and from the formula of the empirical range ρ = √ λ/κ we obtain log ( κ ) = 12 log (8 λ ) − log ( ρ ) (4)A more naturally interpretable parameterization is in terms of standard deviation σ and range ρ , such as log ( σ ) = log ( σ ) + ψ (5) log ( ρ ) = log ( ρ ) + ψ (6)Substituting (5) in (3) and (6) in (4) we get respectively: log ( τ ) = log ( τ ) − ψ − λlog ( κ ) (7) log ( κ ) = log ( κ ) − ψ (8)from which log ( τ ) = log ( τ ) − ψ − λψ (9) log ( κ ) = log ( κ ) − ψ (10)so we can express τ and κ in terms of ψ and ψ (Lindgren and Rue, 2015).More recently, Fuglstad et al. (2019) suggested an extension of the penalized complexity prior (PC prior) proposedby Simpson et al. (2017). The PC prior penalizes the Kullback-Liebler divergence between the model P and the basemodel P , i.e. the information lost when approximating P with P , and it is suggested as a way to reduce overﬁtting.Here P represents a model component, such as a GRF.Fuglstad et al. (2019) developed a joint version of the PC prior for the range and marginal variance of Matérn GRFswith ﬁxed smoothness and d < . The joint PC prior is derived from the alternative parameterization of the Matérncovariance function ( τ, κ ) and then transformed back on the range and variance scale ( ρ, σ ). The advantage of PC priorscompared to the prior described above is that it can be expressed in a user friendly form stating the upper or lower tailprobability for the generic parameter of interest φ : P ( φ > U ) = q or P ( φ < L ) = q . This intuitive interpretationmakes it easy to specify a vague, weak or informative prior by tuning the parameter.In the case of the joint PC prior for GRFs, we must choose an upper limit for the standard deviation and a lower limitfor the range: P ( σ > σ ) = q and P ( ρ < ρ ) = q . E Implementation of the joint model through the INLA-SPDE approach

Because we have misaligned data, we describe the joint model with three likelihoods and three linear predictors (section3.2, equations 1, 2, 3).To implement this in

R-INLA , we need to create a complex data structure: the matrix of observations M is deﬁned as ablock matrix with the number of columns corresponding to the number of likelihoods and each block row correspondingto the data used to estimate one of the linear predictors (Martins et al. , 2013).22 PREPRINT - J

UNE

17, 2020 M =  y (1 , ... y ( s , t )  NA NANA  y (1 , ... y ( s , t )  NANA NA  y (1 , ... y ( s , t )  The dimension of M is then ( s t + s t + s t ) × , with s = 44117 PCM grid cells, s = 495 AQUM grid cells, s = 124 monitoring stations, t = 5 years, t = 1826 days.Through the R-INLA copy function, the z and z random effects are included in the linear predictor, so each of themshares the hyperparameters across the linear predictors in eq. 1-3, but at the same time has a scaling parameter as wellfor calibration purposes (Gómez-Rubio, 2019; Rue et al. , 2017).Since the SPDE provides the approximation of the entire spatial process at the mesh nodes, there is no actual alignmentprocedure involved here. For the blocks with spatial structure (all three in our case), we just need a link between themesh nodes and the locations at which the value is known; this link is provided by a projector matrix deﬁned by built-infunctions. Because these locations are different, we need one projector matrix for the PCM data y , one for the AQUMdata y and one for the ground observations y .For each block, a R-INLA stack object is created to link the data and/or the projector matrix to the model effects includedin the linear predictor: for the y stack we will have the intercept and only one random effect represented by the spatialindex, for y we have the intercept, a temporal and a spatial index, and for y we will need the intercept, the site typecovariate, the spatial index for z , the temporal index for z and the time-site type interaction for z .All the stack objects are then put together and passed on to the inla call as data.With this data structure it is easy to include a validation set: assuming we select m sites for validation and the remaining n = s − m for estimation, we just need to set the observations corresponding to the validation locations to N A , sothat

R-INLA will assume them as unknown and will predict their values. In this case the M matrix will be: M =  y (1 , ... y ( s , t )  NA NANA  y (1 , ... y ( s , t )  NANA NA  y (1 , ... y ( n, t ) N A ( n + 1 , ... N A ( n + m, t )  We also need the projector matrix associated with the validation set and the corresponding stack object. The predictionis computed at each new location (either a validation site or on a regular grid) through the procedure described inSection 3.5.

F Retrieving site type at unknown locations

To determine the site type at unknown locations, we tested 12 different approaches on the known monitoring sites usingfour different land cover sources of information to retrieve the rural and urban classiﬁcation, and three different rules to23

PREPRINT - J

UNE

17, 2020Table F.1: Accuracy of methods to retrieve site type based on the 126 known monitors.Land use R1 R2 R3Corine 43.2% 63.2% 64.8%MODIS 39.2% 60.8% 64.0%GUF 1km 37.6% 58.4% 62.4%GUF 12m 39.2% 60.8% 63.2%determine the road-kerb side classiﬁcation based on the distance from a major or minor road (Greater London OrdnanceSurvey minor and major roads ESRI shapeﬁle - Ordnance Survey 2017).The ﬁrst source for land cover is the Corine land cover for the year 2012 for the UK, Jersey and Guernsey shapeﬁle fromthe Centre for Ecology and Hydrology (Cole et al. , 2015), the second is the MODIS land cover type 1 classiﬁcationraster at 500m resolution for year 2005 from NASA LP DAAC (2013), and the last two are the Global Urban Footprint(GUF) rasters at 1km and 12m resolution respectively (DLR, 2016; Esch et al. , 2017). In all cases, the non-urban landcover types are aggregated to rural, and we assume the land cover did not change signiﬁcantly over the study period.We applied the following three rules to all the land cover data:(R1) The site type is deﬁned as road-kerb side if the location is within 4 m from any road, otherwise urban orrural according to the land cover shapeﬁle. This rule combined with the MODIS land cover is the one applied byMukhopadhyay and Sahu (2017).(R2) The site type is deﬁned as road-kerb side if the location is within 10 m from any road, otherwise urban or ruralaccording to the landcover shapeﬁle.(R3) The site type is deﬁned as road-kerb side if the location is within 50 m from a major road or within 10 m from aminor road, otherwise urban or rural according to the landcover shapeﬁle. This rule accounts for the different width ofthe roads, assuming the road midline corresponds to the center of the street.The percentage of correctly classiﬁed monitors for each method is reported in Table F.1.The Corine shapeﬁle seems to provide more accurate information compared to the MODIS raster, and combined withthe 10/50m rule it gives the highest percentage of correct classiﬁcation (64.8%). This method was therefore applied tothe unknown locations.

G Code

The code is available on the GitHub repository https://github.com/cf416/joint_model_no2https://github.com/cf416/joint_model_no2