Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes
DDetecting British Columbia Coastal Rainfall Patternsby Clustering Gaussian Processes
Forrest Paton and Paul D. McNicholas
Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario, Canada.
Abstract
Functional data analysis is a statistical framework where data are assumed to followsome functional form. This method of analysis is commonly applied to time seriesdata, where time, measured continuously or in discrete intervals, serves as the locationfor a function’s value. Gaussian processes are a generalization of the multivariatenormal distribution to function space and, in this paper, they are used to shed lighton coastal rainfall patterns in British Columbia (BC). Specifically, this work addressedthe question over how one should carry out an exploratory cluster analysis for the BC,or any similar, coastal rainfall data. An approach is developed for clustering multipleprocesses observed on a comparable interval, based on how similar their underlyingcovariance kernel is. This approach provides interesting insights into the BC data, andthese insights can be framed in terms of El Ni˜no and La Ni˜na; however, the resultis not simply one cluster representing El Ni˜no years and another for La Ni˜na years.From one perspective, the results show that clustering annual rainfall can potentiallybe used to identify extreme weather patterns.
Keywords : British Columbia; Clustering; Coastal Rainfall; El Ni˜no; Extreme weather;Gaussian processes; La Ni˜na; Mixture model.
In contrast to predictable yearly seasonal changes, El Ni˜no, a well studied teleconnection,does not occur at regular intervals. This is of particular interest for policy makers, businesses,and people who rely on calculable, foreseeable weather patterns. For example, seasonalchanges can specifically alter food production plans, such as when to deploy fishing vesselsor harvest crops. While El Ni˜no is primarily categorized through warming temperatures inthe eastern and equatorial Pacific Ocean, its effects can be seen around the globe throughteleconnections. Teleconnections are an environmental phenomena that describe correlatedlarge-scale atmospheric changes over non-contiguous geographic regions. While some telecon-nections are well established, others rely on observing statistical irregularities (Gudmundsonand Babkina, 2003; Ward et al. , 2014); namely, El Ni˜no’s effect on precipitation patterns. Sir1 a r X i v : . [ s t a t . A P ] A p r ilbert Walker, a 20th century English scientist, for example, first identified the link betweenAsian monsoons and Pacific coastal barometer readings (Gudmundson and Babkina, 2003).However, the study of teleconnections, such as precipitation patterns, is notoriously complexbecause of intricate spatial and temporal correlations. While spatial data is often modelledunder the assumption that geographical points near one another share more informationthan those far apart, work has been done to model both nearby and remote geographicalcorrelations. Hewitt et al. (2018) predict precipitation in Colorado by modelling both locallyavailable data and remote Pacific Ocean sea surface temperatures. Understanding El Ni˜no’seffect on distant precipitation can shed light on these patterns. Specifically, classifying irreg-ular precipitation patterns in the Americas can help understand El Ni˜no’s impact on localweather systems.A question of particular interest is: which years exhibit distinct rainfall patterns? Forprediction, knowing how Pacific Ocean changes affect rainfall can have significant implica-tions (O’Gorman , 2015). A model for clustering yearly rainfall data will also give insight toresearch on the mechanistic properties of teleconnections. This has overlap in the statisticalfield of functional data clustering, where data are assumed to follow some functional form.Herein, mixture model-based clustering provides an effective approach to cluster data, andGaussian Processes provide a model for rainfall. Much recent work has been done modellingclimatological data with spatial and time dependence (Armal et al. , 2018; Cabral et al. , 2019;Gupta et al. , 2016). Specifically, the use of Gaussian processes (GPs) gives a probabilisticstarting point. A GP is a stochastic process that generalizes a finite-dimensional normaldistribution to function space. GPs have been used to successfully solve complex non-linearregression and classification problems (Rasmussen, 2005; Roberts et al. , 2013). When mul-tiple functions exist on the same interval, usually compact [0 , T ] and finite, it can be usefulto classify them into a finite number of mutually exclusive groups. Here the process wouldbe defined on the index set time, specifically the months of the year. Rasmussen (2005) defines a GP as: “. . . a collection of random variables, any finite numberof which have a joint Gaussian distribution”. A GP f ( x ) creates a set of random variablesevaluated at x . In essence, a GP is a distribution over functions, i.e., f ( x ) ∼ GP (cid:0) m ( x ) , k ( x , x (cid:62) ) (cid:1) , (1)where m ( x ) = E [ f ( x )] and k ( x , x (cid:62) ) = E [( f ( x ) − m ( x ))( f ( x (cid:62) ) − m ( x (cid:62) ))] are the meanfunction and covariance kernel, respectively, x = ( x , . . . , x n ) (cid:62) is the function’s index, and f ( x ) = ( f , . . . , f n ) (cid:62) is the function’s output, i.e., ( x i , f i ) is a point in R . A common GP, andthe kernel considered in this paper, is defined with mean function and squared exponential2SE) covariance function k ( x i , x j ) = σ exp (cid:26) − l ( x i − x j ) (cid:27) , (2)where σ and l are hyper-parameters that control the shape of the process; specifically, σ controls the amount of variation in f ( x ) and l , the length-scale parameter, controls thecorrelation. The SE kernel is a widely used (Rasmussen, 2005). One convenient property ofthe SE kernel is infinite differentiability, which is useful because the first derivative is neededfor hyper-parameter estimation. Here, the use of the term hyper-parameter refers to a set ofparameters that make up the “non-parametric model”. That is, the hyper-parameters onlyappear in the model’s prior and, as shown later in (3), are integrated out of the final model(posterior).Although the SE is the most popular, different kernel functions can be used. Because thekernel is needed to populate a multivariate normal distribution covariance matrix, the kernelis restricted to produce positive semi-definite matrices. While the SE covariance kernel is apopular choice, modelling the shape of the covariance kernel is an open ended problem. Onesystematic solution can be formed by considering a Bayesian model selection framework asdiscussed in Rasmussen (2005). Duvenaud et al. (2013) also provide a solution by consideringautomatic kernel selection through searching over appropriate kernel structures. For the SEcovariance kernel, σ controls the height or the amplitude of the GP and l controls thecorrelation between observations. This kernel is used to construct a matrix, K , which willserve as the covariance matrix in a multivariate normal distribution introduced in the nextsection: K = k ( x , x ) · · · k ( x , x n )... . . . ... k ( x n , x ) · · · k ( x n , x n ) . While a GP is defined on the entire real line, we only observe a finite number n ofrealizations x = ( x , . . . , x n ) (cid:62) , and corresponding y = ( y , . . . , y n ) (cid:62) , where y (cid:44) f ( x ). Thevector x is commonly called the input and represents the location of the the process, i.e.,observation y i = f ( x i ). The vector y is referred to as the output, and is the functionevaluated at location x . This allows for a generalization to a multivariate normal distributionvia f ( x ) ∼ N (cid:0) , K (cid:1) . This is possible because marginalizing a Gaussian distribution istrivial: the resulting distribution is Gaussian and we can ignore the ( x, y ) pairs that areunobserved or missing. Here changing the kernel affects the shape of the function, effectivelycontrolling the magnitude of which observations x i and x j are correlated. Formulating theproblem in this way, we can see the kernel is our prior on the function space, and the(marginal) likelihood for the GP comes after conditioning on the realized points. As shownin the next section, the hyper-parameters are often estimated to maximize the likelihood ofthe GP. 3 .2 Likelihood The previous section introduced the kernel function and how it relates to a prior on functionspace and how the hyper-parameters affect the correlation between the input x and outcome y . Now, the likelihood for a GP will be introduced and strategies for choosing the hyper-parameters will be illustrated. From the definition of a GP, y ∼ N ( , K ), which will beshown formally below. First, let f ∼ N ( , K ), where K is the covariance matrix constructedfrom the SE kernel shown in (2).The likelihood for a GP is conditioned on the observed values to obtain a marginallikelihood, using φ to denote the normal density function: p ( f | x ) = φ ( f | , K ) (cid:124) (cid:123)(cid:122) (cid:125) function prior , p ( y | f ) = n (cid:89) i =1 φ ( y i | f i ) (cid:124) (cid:123)(cid:122) (cid:125) likelihood . We get the marginal for y by using Bayes’ rule and integrating over f : p ( y | x ) = (cid:90) p ( y | f , x ) p ( f | x ) d f . (3)Taking the natural logarithm of (3) giveslog p ( y | x ) = − (cid:110) yK − y (cid:62) + log | K | + n log 2 π (cid:111) . (4)This likelihood can be broken down into three main components, the data fit term, modelcomplexity term, and a constant term:log p ( y | x ) = − (cid:110) yK − y (cid:62) (cid:124) (cid:123)(cid:122) (cid:125) data fit + log | K | (cid:124) (cid:123)(cid:122) (cid:125) complexity + n log 2 π (cid:124) (cid:123)(cid:122) (cid:125) constant (cid:111) . (5)The data fit and complexity component share an interesting tradeoff. For small length-scalevalues l , the model will fit the data well and the data fit component will be small. However,points will not be considered “near” each other, resulting in a high model complexity. Con-versely, if l is large (suggesting no correlation between points), then the the complexity willbe small but the data fit term will be large (Murphy, 2012). This is because the SE kernelwill converge to σ , turning K into a diagonal matrix. Because GPs have these inherentpenalty terms for over- and under-fitting, cross validation is generally not used to estimatekernel hyper-parameters. GPs are commonly used in supervised regression tasks for their ability to non-parametricallyapproximate complex functions and solve functional engineering problems (Bin and Wenlai,4013). It is often of interest to infer the function’s value outside of the paired training data( x , y ). To do this, a predictive distribution can be constructed. Let y ∗ = f ( x ∗ ) be theunobserved outputs to be inferred at locations x ∗ . The joint distribution can be derivedthrough probabilistic terms, f ( x ) ∼ N ( , K ( x , x )) , (6) f ( x ∗ ) ∼ N ( , K ( x ∗ , x ∗ )) , (7) (cid:20) f ( x ) f ( x ∗ ) (cid:21) ∼ N (cid:18) , (cid:20) K ( x , x ) K ( x , x ∗ ) K ( x ∗ , x ) K ( x ∗ , x ∗ ) (cid:21)(cid:19) . (8)Note that (8) is the joint distribution of the observed pairs ( x , y ) and unobserved pairs ( x ∗ , f ( x ∗ )). The expected value for f ( x ∗ ) can be derived using conditional properties which leadsto ˆ f ( x ∗ ) (cid:44) E [ f ( x ∗ ) | x , y , x ∗ ] = K ( x ∗ , x )[ K ( x , x )] − y ; (9)a complete derivation is given by Rasmussen (2005). Then, (9) can then be used to computeestimates for the value of the function f ( x ∗ ) at location x ∗ . Clustering, a.k.a. unsupervised classification, is an unsupervised machine learning task whichattempts to classify (unlabelled) data points into distinct groups. Commonly, clustering isdefined as assigning data into groups such that data in the same cluster are more similarto each other than to data in different clusters. Initially this definition seems intuitive;however, practically there are some problems. Namely, grouping each data point into itsown cluster would satisfy this definition. Following several others (e.g., Tiedeman, 1955;Wolfe, 1963), McNicholas (2016a) provides a definition not based on similarity: “a clusteris a unimodal component within an appropriate finite mixture model”, where the word‘appropriate’ requires consideration and, specifically, that the component densities have thenecessary flexibility to fit the data (see McNicholas, 2016a, for further discussion). Whateverdefinition of a cluster one may prefer, many methods have been developed to tackle thisproblem of unsupervised learning. Model-based refers to using probability distributions tomodel the clusters (as opposed to hierarchical clustering, k -means clustering, etc.). The finite mixture model is a popular tool for (model-based) clustering — recent reviews areprovided by Bouveyron and Brunet-Saumard (2014) and McNicholas (2016b). The densityof a G -component finite mixture model is f ( x | ϕ ) = G (cid:88) g =1 π g f g ( x | θ g ) , (10)5here π g >
0, with (cid:80) Gg =1 π g = 1, is the g th mixing proportion f g ( x | θ g ) is the g th componentdensity, and θ = ( θ , . . . , θ G ) are the component density-specific parameters, with ϕ = ( π , θ )and π = ( π , ..., π G ). The likelihood for x , . . . , x n in a model-based clustering paradigm,using a Gaussian mixture model, is given by L ( ϕ ) = n (cid:89) i =1 G (cid:88) g =1 π g φ ( x i | µ g , Σ g ) , (11)where φ ( x i | µ g , Σ g ) is the density of a multivariate Gaussian distribution with mean µ g and covariance matrix Σ g . Model-based clustering requires estimating the unknown model parameters from the likeli-hood in (11). The expectation-maximization (EM) algorithm (Dempster et al. , 1977) pro-vides a good starting point for this problem. Each iteration of the EM algorithm starts bycomputing the expectation of the complete-data log-likelihood (E-step), then maximizes theconditional expectation of the complete-data log-likelihood (M-step). The E- and M-stepsare iterated until some stopping rule is met. Consider a Gaussian model-based clusteringcomplete-data likelihood, denoted by L c ( ϕ ), where ϕ denotes all the parameters and thecomplete-data comprise the observed x , . . . , x n together with the missing labels z , . . . , z n defined so that z i = ( z i , . . . , z iG ) and z ig = 1 if observation i belongs to component g and z ig = 0 otherwise. Now, for the Gaussian mixture model-based clustering paradigm —corresponding to the likelihood in (11) — we have complete-data likelihood L c ( ϕ ) = n (cid:89) i =1 G (cid:88) g =1 [ π g φ ( x i | µ g , Σ g )] z ig . (12)In the E-step, we computeˆ z ig := E [ Z ig | x i ] = ˆ π g φ ( x i | ˆ µ g , ˆ Σ g ) (cid:80) Gh =1 ˆ π h φ ( x i | ˆ µ h , ˆ Σ h ) (13)conditional on the current parameter updates (estimates). Next, in the M-step, the param-eters are updated. This amounts to estimating the covariance matrix and mean vector for aGaussian mixture model. In the M-step, the updates are:ˆ π g = n g n , (14) ˆ µ g = 1 n g n (cid:88) i =1 ˆ z ig x i , (15)ˆ Σ g = 1 n g n (cid:88) i =1 ˆ z ig ( x i − ˆ µ g )( x i − ˆ µ g ) (cid:62) , (16)6here n g = (cid:80) ni =1 ˆ z ig . After parameter estimation is completed, the clustering results areexpressed through the probabilities ˆ z ig , i.e., ˆ z ig is the probability that x i belongs to compo-nent g . These soft probabilities ˆ z ig ∈ [0 ,
1] are often converted into hard classifications viamaximum a posteriori probabilities:MAP(ˆ z ig ) = (cid:40) g = argmax h (ˆ z ih ) , We have seen that the log-likelihood for a GP with the observed output vector y and corre-sponding input vector x was distributed according to a multivariate Gaussian distribution.When clustering GPs, the goal will be to find clusters that contain processes which havesimilar paths. The meaning of similar path refers not only to how close two processes’ valuesare but also to how similar their shapes are (smooth, wiggly, etc.). Now let us define thenotation used for the model: the i th GP will have output vector y i , input vector x i , and p ( y i | θ i , x i ) = exp (cid:26) − (cid:16) y i K − y (cid:62) i + log | K | + n log 2 π (cid:17)(cid:27) . (18)The density in (18) is the probability density function, i.e., p g ( y i | θ g , x i ), used as the com-ponent density for the finite mixture model p ( y i | θ , x i ) = G (cid:88) g =1 π g p g ( y i | θ g , x i ) , (19)where θ g = { l g , σ g } denotes the hyper-parameters for the g th cluster. Because the likelihoodof a GP is a Gaussian distribution with covariance matrix K , the complete-data likelihoodis given by L c ( ϕ ) = n (cid:89) i =1 G (cid:88) g =1 [ π g φ ( y i | , K g )] z ig , (20)where K g is the covariance matrix corresponding to cluster g and φ ( y i | , K g ) is the Gaussiandensity with mean and covariance K g . An SE covariance kernel is used as the prior on thefunction space, i.e., k ( x i , x j ) = σ exp (cid:26) − l ( x i − x j ) (cid:27) . (21)The goal is to recover the G pairs of kernel hyper-parameters θ g = { l g , σ g } and the mixingparameters π = ( π , ..., π G ), and thence to estimate the latent variables z , . . . , z n .7 .2 GP Parameter Estimation The first step is to estimate each GP’s kernel hyperparameters. Herein, the kernel hyper-parameters for the i th GP is denoted by Θ i = { l i , σ i } . In this step, the maximized kernelhyper-parameters for each GP, l max i and σ max i , are estimated. To find these maximized hyper-parameters, an MLE solution is found using gradient ascent, starting with the log-likelihoodi.e., log p ( y i | x i , Θ i ) = − (cid:8) y i K − y (cid:62) i + log | K | + n log 2 π (cid:9) . (22)The derivative is then taken w.r.t. to the kernel hyper-parameters ∂∂ Θ i log p ( y | x , Θ i ) = 12 y (cid:62) K − ∂ K ∂ Θ i K − y −
12 tr (cid:18) K − ∂ K ∂ Θ i (cid:19) = 12 tr (cid:26)(cid:0) αα (cid:62) − K − (cid:1) ∂ K ∂ Θ i (cid:27) , (23)where α = K − y . The partial derivatives for l i and σ i are calculated from the first derivativesof the kernel function: ∂ K ∂l i = σ i exp (cid:110) − l i ( x i − x j ) (cid:111) ( x i − x j ) l − i , (24) ∂ K ∂σ i = 2 σ i exp (cid:110) − l ( x i − x j ) (cid:111) . (25)After finding the gradient for the likelihood, a gradient ascent algorithm is used to find asufficiently close solution. This algorithm is given by repeating the following updates untila stopping rule is statisfied: l ( t +1) i = l ( t ) i + λ ∂∂l i log p ( y i | x , l ( t ) i ) , (26) σ ( t +1) i = σ ( t ) i + λ ∂∂σ i log p ( y i | x , σ ( t ) i ) , where superscript ( t ) denotes iteration t . After maximizing the kernel hyper-parameters,we have ˆΘ = { ˆΘ , ˆΘ , . . . , ˆΘ n } , where ˆΘ = { l max1 , σ max1 } denotes the maximized kernelhyper-parameters for the first GP, ˆΘ denotes the hyper-parameters for the second GP, andso on. The model seeks to cluster the processes and make inferences on the latent variables. Amodified EM approach is used. First, the mixing proportions and cluster hyper-parametersare initialized randomly, i.e., we initialize πππ , lll , and σσσ , where lll = { l , l , . . . , l G } , σσσ =8 σ , σ , . . . , σ G } , and πππ = { π , π , . . . , π G } randomly. Next, each GP’s (GP , . . . , GP n ) re-sponsibilities are calculated for each of the G clusters to get n × G responsibilitiesˆ r ig = π g φ ( y i | , K g ) (cid:80) Gh =1 π h φ ( y i | , K h ) . (27)Note that ˆ r ig represents the responsibility, or conditional expected value, of the i th processbelonging to the g th cluster, and ˆ r ig (cid:44) ˆ z ig . After the responsibilities are calculated, themixing proportions π , . . . , π G are conditionally maximized on these responsibilities. Theupdate for the g th mixing proportion is π g = m g m , where m g = (cid:80) ni =1 r ig is the responsibility for cluster g and m = (cid:80) Gg =1 m g . The cluster-specific kernel hyper-parameters, i.e., l g and σ g , are then updated, where l g is the length-scaleparameter for cluster g and σ g is the height parameter for cluster g :ˆ l g = 1 m g n (cid:88) i =1 ˆ r ig l max i , ˆ σ g = 1 m g n (cid:88) i =1 ˆ r ig σ max i , i.e., we are weighting the maximized hyper-parameters, σ max i and l max i , by their respectivecluster responsibility ˆ r ig .This scheme, for calculating the responsibilities then updating the cluster parameters,is repeated until some stopping rule is met. In this case, when the change in expectedcomplete-data log likelihood at iteration t becomes small, i.e., until (cid:12)(cid:12) Q ( ϕ ( t ) , ϕ ( t − ) − Q ( ϕ ( t − , ϕ ( t − ) (cid:12)(cid:12) < (cid:15), where Q ( ϕ ( t ) , ϕ ( t − ) = E [log (cid:0) L c ( ϕ ( t ) ) (cid:1) | y , ϕ ( t − ] is the expectation of the complete-datalog likelihood. At each iteration of gradient ascent, the GP’s likelihood gradient needs to be computed: ∂∂ Θ i log p ( y i | x , Θ i ) = 12 tr (cid:26)(cid:0) αα (cid:62) − K − (cid:1) ∂ K ∂ Θ i (cid:27) . (28)This operation requires inverting a t × t matrix K − . Inverting large matrices is notoriouslycomputationally unstable, especially when the matrices are not full rank (or sufficiently close)and eigenvalues become very large or very small. One solution is to first decompose thematrix into lower-triangular form, i.e., K = LL (cid:62) and then invert K via K − = ( L − ) (cid:62) L − .We use the R (R Core Team, 2018) package FastGP (Gopalan and Bornn, 2016), whichimplements the package
RcppEigen (Bates and Eddelbuettel, 2013) to invert the lower-triangular matrix L . 9 Simulation Studies
This section will first look at two cases of simulated data. The hyper-parameters θ = { l , σ } and the mixing proportions π will vary based on the simulated sets. The methoddeveloped in the previous section will then be applied to recover the hyper-parameters andclassify each GP into their respective groups. For the two simulation studies, noiselesssquared exponential covariance functions will be used, which in effect means that a perfectlyinterpolated, noiseless process is observed for the simulation. The first simulation starts with generating 30 GPs. The processes are generated on theinterval [0 ,
10] with T = 7 evenly spaced realizations, i.e., each process has seven valuesspread evenly on the interval. In all, 10 of the 30 GPs are generated from a multivariatenormal distribution using the R package mvtnorm (Genz et al. , 2009). Where the covariancematrix was constructed using an SE covariance kernel with hyper-parameters l = 1 and σ = 3. The remaining 20 were generated similarly but with a covariance matrix constructedwith hyper-parameters l = 3 and σ = 3.After running the algorithm described in Section 3, estimates for the set of hyper-parameters and mixing proportion were recorded (Table 1). The mixing proportion is easilyidentified and accurately estimated. Using the MAP classification, the algorithm was ableto correctly classify each process. Table 1 gives the mean parameter estimates and standarderrors. This was done by randomly starting the algorithm 10 times — i.e., initializing theparameters from a random uniform draw — and calculating the mean and standard errorfrom these 10 starts.Table 1: Mean values for recovered hyper-parameters, with standard errors, for Simulation I.Parameter Truth Mean Estimate Standard Error π π l l σ σ g = 2, blue) with thelarger length-scale l = 3 are smoother compared to those generated from the process withlength-scale l = 1.The length-scale parameter l was also readily recovered in this scenario, producing similarestimates to the true hyper-parameter. The hyper-parameter σ , which, recall, controls thefunction’s variance (in y , the function’s output), is not near the true parameter value. One10 x (a) f ( x ) −2.50.02.55.0 0.0 2.5 5.0 7.5 10.0 x (c) f ( x ) −2.50.02.55.0 0.0 2.5 5.0 7.5 10.0 x (b) f ( x ) −2024 0.0 2.5 5.0 7.5 10.0 x (d) f ( x ) Figure 1: The 30 GPs from Simulation I: a) coloured by MAP classification, red lines areCluster 1 and blue lines are Cluster 2. b) Coloured by individual GP. c) GPs from Cluster1. d) GPs from Cluster 2.reason for this could because this cluster has a comparatively small length-scale l = 1, whichmodels the relative correlation between the points. The second simulation was carried out by first generating 20 GPs. Ten were generated froman SE covariance kernel with hyper-parameters l = 1 and σ = 1. The remaining 10 GPswere generated from a covariance kernel with hyper-parameters l = 2 and σ = 2. Similarlyto Simulation I, the GPs were generated first by constructing the covariance matrix, thenby generating random samples using the R package mvtnorm . In all, T = 9 equally spacedobserved values were recorded for each GP (Figure 2). Based on the plot in Figure 2, thereseems to be no clear distinction or natural groups of processes. After coloring the processesby their (correct) classifications (Figure 2), there is still ambiguity about the two groupsseparation.Again, the method accurately recovers the mixing parameter and length-scale (Table 2).However, for cluster 1, the length-scale l is slightly overestimated and the method inflates σ to account for the variance in the function’s output. The parameter estimates were calculatedby randomly starting the algorithm 10 times and using the mean. Notably, the processeslook very similar between groups, so much so that this solution might seem unconvincing iftrue group labels were unknown. 11 x (a) f ( x ) −2−1012 0.0 2.5 5.0 7.5 10.0 x (c) f ( x ) −2−1012 0.0 2.5 5.0 7.5 10.0 x (b) f ( x ) −2−1012 0.0 2.5 5.0 7.5 10.0 x (d) f ( x ) Figure 2: The 20 GPs from Simulation II: a) Coloured by their MAP classification, red linesare Cluster 1, blue lines are Cluster 2. b) Coloured by individual GP. c) GPs from Cluster1. d) GPs from Cluster 2.
This section will look at historical monthly precipitation data for coastal regions of BritishColumbia (BC), Canada. These data are recorded by the Government of Canada and col-lected from the weather stations: Tofino A, Vancouver International Airport, Port Hardy A,and Victoria International Airport (Figure 3). These data were derived from the followingresources available in the public domain.Total monthly precipitation was recorded from January 1991 to December 2000. Theten years will be treated as independent GPs and each station will be treated separately forclustering purposes. The objective is to find G = 2 cluster solutions for each station andcompare the resulting clusters — because there are only 10 processes per station, G > θ = { l } willbe estimated and modelled holding σ = 1 constant. Note that the observed outputs y are(artificially) connected by lines for illustrative purposes (Figure 4).Figure 4 illustrates the scaled Tofino data plotted by year. Next, the maximized hyper-parameter l is fitted and results are shown in Figure 6. Because of multi-modal likelihoodsin the gradient ascent approach to hyper-parameter fitting, a grid search was used to choose12able 2: Mean values for recovered hyper-parameters, with standard errors, for Simulation II.Parameter Truth Mean Estimate Standard Error π π l l σ σ Port HardyTofino VictoriaVancouver ° N50 ° N52 ° N54 ° N56 ° N58 ° N60 ° N 135 ° W 130 ° W 125 ° W 120 ° W 115 ° W Latitude
Long i t ude Figure 3: Map of four weather stations, three of which are located on Vancouver Island.the optimal value. An example likelihood from this is shown in Figure 5.Instead of applying the EM algorithm to estimate mixture parameters, the relativelysmall number of processes for each station meant that an exhaustive search could be used.That is, each two-group combination of the ten years was considered. This amounted tomaximizing 2 = 1024 likelihoods and selecting the model with the highest (complete-data)likelihood. Using these results, there seem to be two groups emerging, one group with asmaller length-scale and one group with a larger length-scale parameter. Group two hasa larger group length scale parameter (0 . ≤ l ≤ . . ≤ l ≤ . (a) P r e c i p i t a t i on ( sc a l ed ) (a) P r e c i p i t a t i on ( sc a l ed ) Figure 4: Scaled and season trend removed monthly precipitation for the Tofino coastal re-gion of B.C., Canada. The points are (artificially) connected between months for illustrativepurposes, there are 12 measurements per year.Table 3: Years in group two, for each weather station, based on maximizing the complete-data log-likelihood, where remaining years belong to group one.Weather Station Years in Group TwoVancouver 1995Tofino 1995Port Hardy 1992, 1994, 1995, 1998, 2000Victoria 1999Vancouver, Tofino, and Port Hardy all had the year 1995 assigned to group two (Table 3).Additionally, Port Hardy had the years 1992, 1994, 1998, and 2000 assigned to group two.The years assigned to group two in at least one weather station (1992, 1994, 1995, 1998, 1999,2000) share an interesting characteristic related to Pacific Ocean temperatures. Specifically,El Ni˜no and La Ni˜na are events classified using the Oceanic Ni˜no Index (ONI), an indexthat measures irregular ocean temperature changes over a three month moving average. AnEl Ni˜no (irregularly warm) event immediately preceded a La Ni˜na (irregularly cold) eventtwice during the studied time period. Once in 1995 and again in 1998. Both times the yearstarted with warm enough ocean temperatures to classify it as an El Ni˜no period, and by theend of the calendar year the ocean had cooled enough to be classified as La Ni˜na (NOAA,2019). The other years in the irregular cluster also differed in terms of regular Pacific Oceantemperatures. Generally, years clustered into the irregular group tended to correspond withfalling Equatorial Pacific Ocean temperatures as shown in Figure 7A). These irregular yearsalso coincided with years where the middle region of the Pacific Ocean (5S-5N and 170-120W)started the calendar year warmer than it ended, as illustrated in Figure 7B).14
Length−Scale
Log − L i k e li hood Year
Victoria Weather Station
Figure 5: The length scale parameter was chosen by running a grid search over values between0 and 2. Two years are shown here with each year’s respective maximum denoted by theblack point.Table 4: Clustering results, parameters recovered. These estimates are the result of consider-ing each possible two-group combination and selecting the one with the greatest (complete-data) likelihood. The clusters differed mainly with respect to their length-scale parameter l and l . Tofino
Parameter Mean Estimate π π l l Victoria
Parameter Mean Estimate π π l l Vancouver
Parameter Mean Estimate π π l l Port Hardy
Parameter Mean Estimate π π l l .30.60.91.2 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Leng t h − sc a l e pa r a m e t e r Port HardyTofinoVancouverVictoria
Figure 6: Optimized length-scale hyper-parameter for the ten years of precipitation data, byweather station. The years in Cluster 2 have larger length scale values.
A method for clustering functional data has been introduced to cluster coastal rainfall datafrom BC. First, the hyper-parameters that make up a GP were optimized through a gradient-based maximum likelihood optimizer. Because of computational feasibility, parameter esti-mates were obtained by considering every two-group combination of years and choosing themaximum likelihood. For the simulation studies, the usual EM algorithm for finite Gaus-sian mixture models was modified. Instead of maximizing the standard covariance matrix,hyper-parameters for a kernel function that measures correlation in x were optimized. Thecovariance matrix was then constructed from the optimal kernel parameters. Herein, we fix G = 2 as known; however, if one were to consider data with more processes per station then G > l (i.e., 1 versus 3), parameters were readilyrecovered. When GPs had similar length-scale parameters, l was recovered but σ tended toshrink towards a common estimate between both clusters. The application to the rainfalldata from the coastal region of B.C discovered two groups of years, one which contained “reg-ular” years and the other “irregular” years. The irregular years consisted of years where therewas a transition from El Ni˜no to La Ni˜na, or more generally cooling Pacific Ocean tempera-tures. These results suggest El Ni˜no events have some effect on kernel hyper-parameters. Thedata were standardized to have a zero mean function, implying correlation between rainfallpatterns month-to-month can discriminate some El Ni˜no events (as apposed to magnitude16 −2−1012 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec A) A no m a l y D epa r t u r e f r o m A v e r age −1012 DJF JFM FMA MAM AMJ MJJ JJA JAS ASO SON OND NDJ B) T e m pe r a t u r e A no m o li e s Figure 7: A) Equatorial (160E-80W) upper (surface to 300M) ocean temperature averageanomaly based on 1981-2010 climatology. Shows monthly temperature relative to 30 yearaverage. Blue lines (1992, 1994, 1995, 1998, 1999, 2000) represent years that were clusteredinto group two for at least one weather station. Red lines represent the remaining years.Group two years tend to coincide with cooling temperatures. B) Pacific Ocean temperatures3 month running mean of anomalies for region 3.4 (Middle Pacific Ocean, 5S-5N and 170-120W). Blue lines represent years that at least one weather station clustered into group two(irregular). Red lines represent the remaining years. Warm (red dashed line) and cold (bluedashed line) are a +/- 0.5 threshold for the Oceanic Nio Index (ONI).17f rainfall).The most obvious direction for future work is to apply the approach developed herein toother rainfall data. In terms of the BC coastal rainfall data, one could carry out a clusteringof all locations combined to see whether some years stand out from others. The BC datacould be treated as matrix variate data and clustered accordingly, perhaps after the fashionof Gallaugher and McNicholas (2018). In both cases, it is of interest to observe whether
G >
Acknowledgements
This work was supported by the Canada Research Chairs program and an E.W.R. SteacieMemorial Fellowship.
References
Armal, S., N., Devineni, R., Khanbilvardi (2018) Trends in Extreme Rainfall Frequency inthe Contiguous United States: Attribution to Climate Change and Climate VariabilityModes.
J. Climate , 369–385.Bates, D. and Eddelbuettel, D. (2013). Fast and Elegant Numerical Linear Algebra Usingthe RcppEigen Package. Journal of Statistical Software , (5), 1–24.Berrocal, V. J. (2016). Identifying trends in the spatial errors of a regional climate modelvia clustering. Environmetrics , , 90–102.Bin, S. and Wenlai, Y. (2013). Application of gaussian process regression to predictionof thermal comfort index. In , volume 2, pp. 958–961.Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of high-dimensionaldata: A review. Computational Statistics and Data Analysis , 52–78.Cabral, R, Ferreira, A, Friederichs, P. (2019). Spacetime trends and dependence of precipi-tation extremes in NorthWestern Germany. Environmetrics
Journal of the Royal Statistical Society: Series B , ,1–38.Duvenaud, D., Lloyd, J., Grosse, R., Tenenbaum, J., Zoubin, G. (2013). Structure Discoveryin Nonparametric Regression through Compositional Kernel Search. Proceedings of the30th International Conference on Machine Learning , , 1166–1174.Gallaugher, M. P. B. and P. D. McNicholas (2018). Finite mixtures of skewed matrix variatedistributions. Pattern Recognition 80 , 83–93.18enz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Hothornm T., (2009). mvtnorm: Multi-variate Normal and t Distributions.
R package version 1.0-10.Gopalan, G. and Bornn, L. (2016).
FastGP: Efficiently Using Gaussian Processes with Rcppand RcppEigen . R package version 1.2.[dataset]Government of Canada (2007). Monthly Precipitation Data. http://climate.weather.gc.ca/climate_data/monthly_data_e.html . Accessed 2018-08-05.Gudmundson, C. and Babkina, A. M. (2003).
El Ni˜no Overview and Bibliography . NovaScience Publishers, Inc, Hauppauge, New York.Gupta, U., Jitkajornwanich, K, Elmasri, R., Fegaras, L. (2016). Adapting K-means clusteringto identify spatial patterns in storms.
Environmetrics :e2523.McNicholas, P. D. (2016a). Mixture Model-Based Classification . Chapman & Hall/CRCPress, Boca Raton.McNicholas, P. D. (2016b). Model-based clustering.
Journal of Classification , (3), 331–373.Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective . The MIT Press,Cambridge.NOAA (2019). Cold & Warm Episodes by Season.
NOAA/ National Weather Service .National Centers for Environmental Prediction. https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_v5.php . Accessed 2019-05-05.O’Gorman, P. A.(2015). Precipitation Extremes Under Climate Change.
Current ClimateChange Reports , R: A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing, Vienna, Austria.Rasmussen, C. E. (2005).
Gaussian Processes for Machine Learning . MIT Press, Cambridge.Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N. and Aigrain, S. (2013). Gaussianprocesses for time-series modelling.
Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences , .Tiedeman, D. V. (1955). On the study of types. In Symposium on Pattern Analysis , ed.S. B. Sells (Randolph Field, Texas: Air University, U.S.A.F. School of Aviation Medicine).19ard, P. J., Jongman, B., Kummu, M., Dettinger, M. D., Sperna Weiland, F. C., andWinsemius, H. C. (2014). Strong influence of el nino southern oscillation on flood riskaround the world.