Scalable computation of predictive probabilities in probit models with Gaussian process priors
SScalable computation of predictive probabilities in probitmodels with Gaussian process priors
Jian Cao ∗ , Daniele Durante † and Marc G. Genton ‡ Abstract
Predictive models for binary data are fundamental in various fields, and the growing complexityof modern applications has motivated several flexible specifications for modeling the relationshipbetween the observed predictors and the binary responses. A widely–implemented solution ex-presses the probability parameter via a probit mapping of a Gaussian process indexed by predic-tors. However, unlike for continuous settings, there is a lack of closed–form results for predictivedistributions in binary models with Gaussian process priors. Markov chain Monte Carlo meth-ods and approximation strategies provide common solutions to this problem, but state–of–the–artalgorithms are either computationally intractable or inaccurate in moderate–to–high dimensions.In this article, we aim to cover this gap by deriving closed–form expressions for the predictiveprobabilities in probit Gaussian processes that rely either on cumulative distribution functionsof multivariate Gaussians or on functionals of multivariate truncated normals. To evaluate suchquantities we develop novel scalable solutions based on tile–low–rank Monte Carlo methods forcomputing multivariate Gaussian probabilities, and on variational approximations of multivariatetruncated normals. Closed–form expressions for marginal likelihoods and posterior distributions ofthe Gaussian process are also discussed. As illustrated in empirical studies, the proposed methodsscale to dimensions where state–of–the–art solutions are impractical.
Keywords:
Binary data, Gaussian process, Multivariate truncated normal, Probit model, Unified skew–normal, Variational Bayes.
There is a growing demand in various fields for flexible models that are able to accurately characterizecomplex relations among a vector of binary responses y = ( y , . . . , y n ) (cid:124) and a set of predictors X =( x , . . . , x n ) (cid:124) , with y i ∈ {
0; 1 } and x i = ( x i , . . . , x iq ) (cid:124) ∈ R q , for each i = 1 , . . . , n . Common solutionsaddress this goal by replacing the linear predictor X β = ( x (cid:124) β , . . . , x (cid:124) n β ) (cid:124) ∈ R n in the generalized linear ∗ Statistics Program, King Abdullah University of Science and Technology, Saudi Arabia † Department of Decision Sciences and Bocconi Institute for Data Science and Analytics, Bocconi University, Italy ‡ Statistics Program, King Abdullah University of Science and Technology, Saudi Arabia a r X i v : . [ s t a t . C O ] S e p odel for y (Nelder and Wedderburn, 1972), with a more flexible vector f ( X ) = [ f ( x ) , . . . , f ( x n )] (cid:124) ∈ R n that accounts for complex non–linear relations between the response and the predictors, thus enhancingpredictive power. Notable examples of this approach within the Bayesian setting define f ( X ) via additivetrees (Chipman et al., 2010), Bayesian p –splines (Brezger and Lang, 2006) and Gaussian processes ( gp )(Rasmussen and Williams, 2006), among others. Motivated by the success of gp s for classification(e.g., Neal, 1999; Opper and Winther, 2000; De Oliveira, 2005; Chu and Ghahramani, 2005; Girolamiand Rogers, 2006; Rasmussen and Williams, 2006; Choudhuri et al., 2007; Riihim¨aki et al., 2013),we focus on deriving improved methods to evaluate the predictive probabilities for the latter classof models under the probit link. In particular, we assume that the responses y i , i = 1 , . . . , n areconditionally independent realizations from Bernoulli variables with probability parameters Φ[ f ( x i )], i = 1 , . . . , n , where Φ[ f ( x )] is the cumulative distribution function of a standard Gaussian evaluated at f ( x ), whereas f ( x ) is assigned a gp prior with mean function m ( x ) = E [ f ( x )] and covariance kernel K ( x , x (cid:48) ) = E { [ f ( x ) − m ( x )][ f ( x (cid:48) ) − m ( x (cid:48) )] } . Exploiting standard gp ’s properties (Rasmussen andWilliams, 2006) and assuming no overlap in x , . . . , x n , such assumptions lead to the model p [ y | f ( X )] = (cid:89) ni =1 Φ[ f ( x i )] y i { − Φ[ f ( x i )] } − y i , with p [ f ( X )] = φ n [ f ( X ) − ξ ; Ω ] , (1)where φ n [ f ( X ) − ξ ; Ω ] denotes the density of a multivariate Gaussian distribution N n ( ξ , Ω ) for f ( X ), withmean vector ξ = [ m ( x ) , . . . , m ( x n )] (cid:124) , and n × n covariance matrix Ω having entries Ω [ ii (cid:48) ] = K ( x i , x i (cid:48) ),for every i = 1 , . . . , n and i (cid:48) = 1 , . . . , n . Model (1) has attracted considerable interest due to itsflexibility and its direct connection with binary discrete choice models based on Gaussian latent utilities z i = f ( x i ) + ε i , with ε i ∼ N(0 , i = 1 , . . . , n (Albert and Chib, 1993). In fact, p [ y i = 1 | f ( x i )] = Φ[ f ( x i )] = p [ z i > | f ( x i )]. In such settings, a main goal of inference is to evaluatethe predictive probabilities of new responses y n +1 , defined as p ( y n +1 = 1 | y ) = 1 − p ( y n +1 = 0 | y ) = (cid:82) Φ[ f ( x n +1 )] · (cid:2)(cid:82) p [ f ( x n +1 ) , f ( X ) | y ]d f ( X ) (cid:3) d f ( x n +1 ) , (2)where p [ f ( x n +1 ) , f ( X ) | y ] is the joint posterior density of [ f ( x n +1 ) , f ( X )] under model (1), which seemsnot available in closed form due to the apparent absence of conjugacy between the probit likelihood andthe multivariate Gaussian prior for [ f ( x n +1 ) , f ( X )] under model (1). This issue has motivated extensiveresearch to compute predictive probabilities in probit models with multivariate Gaussian priors either viaMonte Carlo methods relying on samples from p [ f ( x n +1 ) , f ( X ) | y ] (Neal, 1999; Albert and Chib, 1993;De Oliveira, 2005; Holmes and Held, 2006; Choudhuri et al., 2007; Pakman and Paninski, 2014; Hoffmanand Gelman, 2014; Durante, 2019) or by deriving tractable approximations of p [ f ( x n +1 ) , f ( X ) | y ]2Consonni and Marin, 2007; Girolami and Rogers, 2006; Chu and Ghahramani, 2005; Rasmussen andWilliams, 2006; Riihim¨aki et al., 2013) that allow simple evaluation of (2). These methods provide state–of–the–art solutions in small–to–moderate dimensional settings, but tend to become rapidly inaccurateor computationally impractical in higher dimension (Chopin and Ridgway, 2017; Johndrow et al., 2019;Durante, 2019; Fasano et al., 2019). This issue is inherently found in probit gp s where, by definition,there are p ≈ n parameters in f ( X ), with the sample size n being typically large in most studies.In this article we aim to cover the above gap by providing novel closed–form expressions for thepredictive probabilities in probit gp s along with improved methods to evaluate the involved quantitiesin high dimensions. More specifically, in Section 2.1 we first derive a closed–form expression for themarginal likelihood p ( y ) under model (1), and then exploit this result to show that p ( y n +1 = 1 | y )can be expressed as the ratio between cumulative distribution functions of multivariate Gaussians withdimensions n + 1 and n , respectively. To overcome the known issues associated with the evaluation ofthese two quantities in high dimensions (Chopin, 2011; Botev, 2017; Cao et al., 2019, 2020) we introducean error–reduction technique for computing ratios of Gaussian cumulative distribution functions thatbuilds on the tile–low–rank method in Cao et al. (2020) and substantially reduces the computational timeof state–of–the–art strategies such as minimax tilting methods (Botev, 2017) and Hamiltonian MonteCarlo samplers ( stan ) (Hoffman and Gelman, 2014), without affecting accuracy. To further improve thescalability of the methods presented in Section 2.1, we show in Section 2.2 that p ( y n +1 = 1 | y ) has alsoan alternative representation based on functionals of multivariate truncated normals, and we address theintractability of such variables in high dimensions by proposing a variational approximation based onunivariate truncated normals which allows accurate and efficient evaluation of predictive probabilities inhigh dimensions, thus scaling to settings where available strategies are unfeasible. Such results are alsorelated to the conditional distribution of the gp given the binary responses which we show to coincidewith a unified skew–normal ( sun ) (Arellano-Valle and Azzalini, 2006) by adapting recent results inDurante (2019) on classical Bayesian probit regression. The magnitude of the improvements providedby the new methods presented in Sections 2.1–2.2 relative to state–of–the–art competitors is illustratedin simulations in Section 3 and in an environmental application to Saudi Arabia windspeed in Section4. Section 5 contains concluding remarks, whereas all the proofs can be found in the Appendix.3 Computation of Predictive Probabilities in Probit GaussianProcesses
In Sections 2.1 and 2.2 we present novel expressions for the predictive probabilities in probit Gaussianprocesses along with improved methods to evaluate efficiently the involved quantities in high dimensions.
To introduce the closed–form expression for p ( y n +1 = 1 | y ) based on ratios of multivariate Gaussiancumulative distribution functions, first notice that by leveraging known properties of Gaussian variables,the probit likelihood in (1) can be re–expressed as p [ y | f ( X )] = (cid:81) ni =1 Φ[ f ( x i )] y i { − Φ[ f ( x i )] } − y i = (cid:81) ni =1 Φ[(2 y i − f ( x i )] = Φ n [ Df ( X ); I n ], where D = diag[(2 y − , . . . , (2 y n − n [ Df ( X ); I n ] isthe cumulative distribution function of a zero–mean n –variate Gaussian with identity covariance matrix I n , evaluated at Df ( X ). Leveraging this form and adapting Lemma 7.1 in Azzalini and Capitanio (2014)to our setting, we can easily express the marginal likelihood under model (1) as p ( y ) = (cid:82) Φ n [ Df ( X ); I n ] φ n [ f ( X ) − ξ ; Ω ]d f ( X ) = Φ n ( D ξ ; I n + DΩD (cid:124) ) . (3)Equation (3) provides a closed–form expression that can be useful to estimate fixed parameters in the gp kernel via maximization of p ( y ) and, as shown in Proposition 1, is also a key to obtain closed–formexpressions for p ( y n +1 = 1 | y ). Proposition 1.
Under (1) , the predictive probability for a new response y n +1 ∈ {
0; 1 } with predictor x n +1 ∈ R q is p ( y n +1 = 1 | y ) = 1 − p ( y n +1 = 0 | y ) = Φ n +1 ( D ∗ ξ ∗ ; I n +1 + D ∗ Ω ∗ D ∗ (cid:124) )Φ n ( D ξ ; I n + DΩD (cid:124) ) , (4) where D ∗ = diag[(2 y − , . . . , (2 y n − , , ξ ∗ = [ ξ (cid:124) , m ( x n +1 )] (cid:124) , and Ω ∗ is obtained by adding an addi-tional row and column to Ω defined as Ω ∗ (cid:124) [ n +1 , · ] = Ω ∗ [ · ,n +1] = [ K ( x n +1 , x ) , . . . , K ( x n +1 , x n ) , K ( x n +1 , x n +1 )] (cid:124) . To prove Proposition 1, it is sufficient to notice that, by the Bayes rule, p ( y n +1 = 1 | y ) = p ( y n +1 =1 , y ) /p ( y ) where p ( y n +1 = 1 , y ) and p ( y ) are the marginal likelihoods of ( y n +1 = 1 , y ) and y , respectively,under model (1). Replacing such quantities with their closed–form expression in (3), leads to (4). Seethe Appendix for a more detailed proof which also includes additional clarifications on equation (3).4valuation of (4) requires the calculation of cumulative distribution functions of multivariate Gaus-sians, which is known to be a challenging task in high dimensions (e.g., Genz, 1992; Chopin, 2011;Botev, 2017; Genton et al., 2018; Cao et al., 2019, 2020). Recent advances via minimax tilting methods(Botev, 2017) allow accurate evaluation of such quantities, but face an increased computational costwhich makes such strategies rapidly impractical as n grows. A more scalable solution can be found inthe separation–of–variable ( sov ) algorithm originally introduced by Genz (1992) and subsequently im-proved in terms of scalability by Cao et al. (2020). Such a routine decomposes the generic multivariateGaussian probability Φ n ( a , b ; Σ ) asΦ n ( a , b ; Σ ) = (cid:82) ba φ n ( u ; Σ )d u = ( e − d ) (cid:82) ( e − d ) · · · (cid:82) ( e n − d n ) (cid:82) d w = E w [( e − d ) · · · ( e n − d n )] , (5)with d i = Φ( { a i − (cid:80) i − j =1 l ij Φ − [ d j + w j ( e j − d j )] } /l ii ) and e i = Φ( { b i − (cid:80) i − j =1 l ij Φ − [ d j + w j ( e j − d j )] } /l ii )for each i = 1 , . . . , n , where l ij is the ( ij )-th coefficient in the lower Cholesky factor of Σ , and w = ( w , . . . , w n − ) (cid:124) denotes a vector with uniform entries w j ∼ U(0 , j = 1 , . . . , n −
1. Such adecomposition transforms the integration region into the unit hypercube, thus allowing the evaluationof Φ n ( a , b ; Σ ) via functionals of uniform densities. To further improve the quality of the above esti-mator, more recent implementations (Trinh and Genz, 2015) combine (5) with a univariate reorderingpreconditioner that rearranges the integration variables and produces the corresponding Cholesky factorsimultaneously at the same O ( n ) cost of the Cholesky factorization. This ordering strategy processesthe integration variables from left to right iteratively and, at each step, it switches the original integrationvariable with the one having the narrowest conditional integration limits on its right side. Positioningleft the integration variables with narrower integration limits is shown in Trinh and Genz (2015) andCao et al. (2020) to improve the Monte Carlo convergence rate of (5), whose integrand is evaluated R times — corresponding to the Monte Carlo sample size — each of which has a cost of O ( n ). Such costsallow the implementation of this strategy in settings with n ≤ , Σ that reduces the cost of the sov algorithm by substituting the dense matrix–vector multiplicationwith the low–rank matrix–vector multiplication. A compatible block–reordering is also introduced inplace of the univariate reordering to improve the convergence rate at the same cost as the low–rankCholesky factorization. Specifically, the block–reordering orders integration variables on the block levelbased on crude estimates of the block–wise marginal probabilities. Both the block–reordering and thetile–low–rank version of the sov algorithm reach their optimal complexities of O ( n / ) and O ( n / ),5espectively, when the block size in the tile–low–rank representation is n / , thus reducing the com-putational complexity of the classical sov algorithm by n / and allowing implementation in tens ofthousands of dimensions.Although the above techniques can be effectively implemented to evaluate multivariate Gaussianprobabilities as in (3), the calculation of ratios among such quantities as in (4) has typically high accuracyrequirements. Unfortunately, as discussed in Botev (2017) and Cao et al. (2020), the estimation errorsof tail multivariate Gaussian probabilities, that also include the cumulative distribution function, can beas large as the probability estimates themselves when n is in hundreds to thousands of dimensions, thusproducing unreliable ratio estimates. To address this issue, we propose an error–reduction technique thatavoids computing the numerator and the denominator in (4) separately, but combines their evaluationunder the tile–low–rank representation. Indeed, as is clear from Proposition 1, the denominator in (4) isthe marginalization of the numerator over the last integration variable. Therefore, keeping the generalnotation of the sov algorithm and leveraging equation (5), expression (4) can be re–written in thegeneral form Φ n +1 ( a , b ; Σ )Φ n ( a − ( n +1) , b − ( n +1) ; Σ − ( n +1) ) = E w [( e − d ) · · · ( e n − d n ) · ( e n +1 − d n +1 )] E w − n [( e − d ) · · · ( e n − d n )] , (6)where e i and d i are defined as in equation (5) for i = 1 , . . . , n + 1, whereas a − ( n +1) , b − ( n +1) and w − n areobtained by removing the ( n + 1)-th entry in a and b , and the n -th entry in w , respectively. Similarly, Σ − ( n +1) coincides with Σ without the ( n + 1)-th row and column. As is clear from (6), the quantities( e − d ) , . . . , ( e n − d n ) are the same deterministic functions of w both in the numerator and in thedenominator, and hence, using the same set of Monte Carlo samples w in the n –dimensional hypercubefor estimating the two expectations might significantly reduce the estimation error of their ratio. Inparticular, our proposed ratio estimator isˆ p ( y n +1 = 1 | y ) = (cid:80) Rr =1 [ e ( w ( r ) − n ) − d ( w ( r ) − n )] · · · [ e n ( w ( r ) − n ) − d n ( w ( r ) − n )] · [ e n +1 ( w ( r ) ) − d n +1 ( w ( r ) )] (cid:80) Rr =1 [ e ( w ( r ) − n ) − d ( w ( r ) − n )] · · · [ e n ( w ( r ) − n ) − d n ( w ( r ) − n )] , (7)where the generic e i ( w ( r ) ) and d i ( w ( r ) ) denote the values of e i and d i in (5) evaluated at the Monte Carlosample w ( r ) of w . This estimator is asymptotically unbiased because the numerator and the denominatorconverge to E w [( e − d ) · · · ( e n − d n ) · ( e n +1 − d n +1 )] and E w − n [( e − d ) · · · ( e n − d n )], respectively, andhence equation (7) converges to (6) in probability. Moreover, equation (7) is guaranteed to be in (0 , .
25. This is not the case when thenumerator and the denominator in (4) are estimated separately. Indeed, as discussed in Botev (2017)6 lgorithm 1:
Compute (4) via the estimator (7) [a]
Define a = −∞ , b = D ∗ ξ ∗ , Σ = I n +1 + D ∗ Ω ∗ D ∗ (cid:124) , a − ( n +1) = −∞ , b − ( n +1) = D ξ , Σ − ( n +1) = I n + DΩD (cid:124) and let w ( r ) , . . . , w ( R ) denote uniform samples from the n –dimensionalunit hypercurbe. [b] Apply block reordering (Cao et al., 2020) to ( a − ( n +1) , b − ( n +1) , Σ − ( n +1) ), which producesthe tile–low–rank Cholesky factor L after variable reordering, and reorder a − ( n +1) and b − ( n +1) accordingly. [c] Compute L [ n +1 , n +1] using Σ and L . [d] Obtain the quantities required to evaluate equation (7). for r=1,. . . ,R do[d.1]
Compute the differences e ( w ( r ) − n ) − d ( w ( r ) − n ) , . . . , e n ( w ( r ) − n ) − d n ( w ( r ) − n ) by applying thetile–low–rank variant of equation (5) (Cao et al., 2020) to ( a , b , L ). Store also the vector v ( r ) = [Φ − { d ( w ( r ) − n )+ w ( r )1 [ e ( w ( r ) − n ) − d ( w ( r ) − n )] } , . . . , Φ − { d n ( w ( r ) − n )+ w ( r ) n [ e n ( w ( r ) − n ) − d n ( w ( r ) − n )] } ] (cid:124) . [d.2] Set e n +1 ( w ( r ) ) − d n +1 ( w ( r ) ) = Φ (cid:104) b n +1 − L [ n +1 , n ] v ( r ) l n +1 ,n +1 (cid:105) − Φ (cid:104) a n +1 − L [ n +1 , n ] v ( r ) l n +1 ,n +1 (cid:105) . [e] Estimate (4) via Monte Carlo as in (7) using the quantities computed in step [d] .and Cao et al. (2020), when n is high the estimation errors of the two cumulative distribution functionsin (4) are often as large as the estimates themselves, thus producing estimated ratios possibly outsideof the range (0 ,
1) and with high variance.The pseudo–code for evaluating (4) via the estimator outlined in (7) is provided in Algorithm 1. Instep [b] , the block reordering produces a new variable order that is used to reorder the integration limits a and b , whereas in step [c] the inverse matrices of the diagonal blocks of L computed in step [b] arerecycled to maximize efficiency. Also the quantities in step [d.1] do not need to be re–evaluated everytime a new prediction is required since they only depend on the observed training data, and hence suchquantities can be pre–computed and stored separately. The methods presented in Section 2.1 allow substantial improvements in terms of accuracy and scala-bility in the evaluation of the predictive probabilities under probit gp s. However, the cost of the tile–low–rank version of the sov algorithm might still be too expensive in high dimensional settings with a7arge n . To further reduce computational times, we derive an alternative expression for p ( y n +1 = 1 | y )relying on functionals of multivariate truncated normals which are then approximated via mean–fieldvariational Bayes (e.g., Blei et al., 2017) to facilitate simple evaluation of p ( y n +1 = 1 | y ) using MonteCarlo samples from univariate truncated normals.To derive this alternative expression, we shall first notice that the joint posterior p [ f ( x n +1 ) , f ( X ) | y ]in (2) can be factorized as p [ f ( x n +1 ) | f ( X )] · p [ f ( X ) | y ], provided that f ( x n +1 ) does not appear inthe likelihood for y , which is true because there is no overlap among predictors. Exploiting the well–known properties of gp s (Rasmussen and Williams, 2006), the first factor p [ f ( x n +1 ) | f ( X )] in the aboveexpression can be easily derived by applying the closure under conditioning property of multivariateGaussians, thus obtaining the univariate normal density p [ f ( x n +1 ) | f ( X )]= φ ( f ( x n +1 ) − [ m ( x n +1 ) + Ω ∗ [ n +1 , n ] Ω − ( f ( X ) − ξ )]; K ( x n +1 , x n +1 ) − Ω ∗ [ n +1 , n ] Ω − Ω ∗ [1: n,n +1] ) , = φ ( f ( x n +1 ) − [ µ x n +1 + H x n +1 f ( X )]; σ x n +1 ) , (8)where H x n +1 = Ω ∗ [ n +1 , n ] Ω − , µ x n +1 = m ( x n +1 ) − H x n +1 ξ and σ x n +1 = K ( x n +1 , x n +1 ) − Ω ∗ [ n +1 , n ] Ω − Ω ∗ [1: n,n +1] ,whereas the other quantities are defined as in equation (1) and (4). Adapting the recent conjugacy re-sults for probit models with Gaussian priors in Durante (2019) to the gp setting, it is also possible toexpress p [ f ( X ) | y ] as the density of the unified skew–normal ( sun ) (Arellano-Valle and Azzalini, 2006)SUN n,n [ ξ , Ω , ¯ Ω ω D (cid:124) s − , s − D ξ , s − ( DΩD (cid:124) + I n ) s − ], with s = [( DΩD (cid:124) + I n ) (cid:12) I n ] / , ¯ Ω = ω − Ω ω − and ω = ( Ω (cid:12) I n ) / . Indeed, recalling the results discussed in Sections 1–2.1 and applying the Bayesrule, we have that p [ f ( X ) | y ] ∝ p [ f ( X )] · p [ y | f ( X )] = φ n [ f ( X ) − ξ ; Ω ] · Φ n [ Df ( X ); I n ] which coincideswith the kernel of a sun density as discussed in the proof of Theorem 1 in Durante (2019). Such aclass of random variables introduces asymmetric shapes in Gaussian densities via a skewness–inducingmechanism driven by the cumulative distribution function of an n –variate Gaussian with a full-rankcovariance matrix. Hence, the evaluation of p [ f ( X ) | y ] still requires calculation of multivariate Gaus-sian probabilities, leading to the same issues discussed in Section 2.1; see Arellano-Valle and Azzalini(2006), Azzalini and Capitanio (2014) and Durante (2019) for an in–depth discussion on the propertiesof sun variables for posterior inference.A possibility to address the above issue is to leverage the discrete-choice interpretation of the probit gp introduced in Section 1. Under this alternative representation, model (1) can be equivalently re–expressed as y i = 1( z i > z i | f ( x i )] ∼ N[ f ( x i ) , i = 1 , . . . , n , and f ( X ) =[ f ( x ) , . . . , f ( x n )] (cid:124) ∼ N n ( ξ , Ω ). Adapting the results in Holmes and Held (2006) to our gp setting,8he joint posterior p [ f ( X ) , z | y ] of f ( X ) and the augmented data z = ( z , . . . , z n ) (cid:124) , factorizes as p [ f ( X ) | z ] · p ( z | y ), with p [ f ( X ) | z ] = φ n [ f ( X ) − ( Ω − + I n ) − ( Ω − ξ + z ); ( Ω − + I n ) − ] = φ n ( f ( X ) − [ µ X + Σ X z ]; Σ X ) ,p ( z | y ) ∝ φ n [ z − ξ ; I n + Ω ] (cid:89) ni =1 y i − z i >
0] = φ n [ z − ξ ; Σ z ] (cid:89) ni =1 y i − z i > , (9)where Σ X = ( Ω − + I n ) − , µ X = Σ X Ω − ξ and Σ z = I n + Ω . Therefore, the joint posterior density p [ f ( X ) | z ] · p ( z | y ) factorizes as the product of a Gaussian for p [ f ( X ) | z ] and a multivariate truncatednormal for p ( z | y ) obtained via component–wise truncation of N( ξ , Σ z ) above or below 0, dependingon whether y i = 1 or y i = 0, respectively, for each i = 1 , . . . , n . As shown in Proposition 2, bycombining equations (8)–(9) with Lemma 7.1 in Azzalini and Capitanio (2014), it is possible to obtainan alternative expression for p ( y n +1 = 1 | y ) based on functionals of multivariate truncated normals. Seethe Appendix for a detailed proof. Proposition 2.
Under (1) , the predictive probability for a new response y n +1 ∈ {
0; 1 } with predictor x n +1 ∈ R q is p ( y n +1 = 1 | y ) = 1 − p ( y n +1 = 0 | y ) = E z | y [ E f ( X ) | z ( E f ( x n +1 ) | f ( X ) { Φ[ f ( x n +1 )] } )] , = E z | y [ E f ( X ) | z [Φ( µ x n +1 + H x n +1 f ( X ); 1+ σ x n +1 )]] , = E z | y [Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ]; 1+ σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 )] , (10) where the different quantities in (10) are defined as in equations (8) and (9) , whereas E z | y [ · ] denotesthe expectation with respect to the multivariate truncated normal density p ( z | y ) in (9) . Leveraging Proposition 2 it is possible to evaluate p ( y n +1 = 1 | y ) via Monte Carlo methods basedon independent samples from the multivariate truncated normal with density as in (9), thus producingthe estimate ˆ p ( y n +1 = 1 | y ) = 1 − ˆ p ( y n +1 = 0 | y ) = (cid:80) Rr =1 Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ( r ) ]; 1 + σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 ) /R , where z (1) , . . . , z ( R ) denote independent and identically distributed samplesfrom p ( z | y ). Unfortunately, sampling from multivariate truncated normals in settings where n is largerthan a few hundreds raises the same computational issues discussed in Section 2.1, i.e., the evaluation ofmultivariate Gaussian cumulative distribution functions (Holmes and Held, 2006; Botev, 2017; Pakmanand Paninski, 2014; Durante, 2019; Fasano et al., 2019).To address the above issue, we propose to replace the intractable sampling density p ( z | y ) with amean–field variational approximation q ∗ ( z ) = (cid:81) ni =1 q ∗ ( z i ) that factorizes over its marginals q ∗ ( z ) , . . . , q ∗ ( z n ).In this way, the Monte Carlo estimate for p ( y n +1 = 1 | y ) can be obtained by sampling R times from9 lgorithm 2: Compute (10) via Monte Carlo based on a mean–field approximation of p ( z | y ) [a] Initialize z (0) = [0 , . . . , (cid:124) . [b] Apply the cavi algorithm to obtain the optimal mean–field approximation q ∗ ( z ) = (cid:81) ni =1 q ∗ ( z i ) for p ( z | y ). for t=1 until convergence dofor i=1, . . . , n do Set the univariate truncated normal approximating density q ( t ) ( z i ) for z i at step t equal to q ( t ) ( z i ) ∝ φ { z i − [ ξ i + H z i ( z ( t − − i − ξ − i )]; σ z i } y i − z i > z ( t − − i = [ E q ( t ) ( z ) ( z ) , . . . , E q ( t ) ( z i − ) ( z i − ) , E q ( t − ( z i +1 ) ( z i +1 ) , . . . , E q ( t − ( z n ) ( z n )] (cid:124) . Output: q ∗ ( z ) = (cid:81) ni =1 q ∗ ( z i ), where each q ∗ ( z i ) is a univariate truncated normal. [c] Estimate (10) via Monte Carlo as in (13). n independent univariate approximate densities q ∗ ( z ) , . . . , q ∗ ( z n ) instead of the exact but intractablejoint density p ( z | y ). Recalling the classical mean–field variational Bayes framework (e.g., Blei et al.,2017), the optimal approximating density q ∗ ( z ) is the one that minimizes the Kullback–Leibler ( kl ) di-vergence kl [ q ( z ) (cid:107) p ( z | y )] = E q ( z ) { log[ q ( z ) /p ( z | y )] } (Kullback and Leibler, 1951) to p ( z | y ) among allthe densities in the mean–field variational family Q = { q ( z ) : q ( z ) = (cid:81) ni =1 q ( z i ) } . The solution of sucha minimization problem is, typically, not available in closed–form but can be obtained via coordinateascent variational inference ( cavi ) algorithms (Bishop, 2006; Blei et al., 2017) that iteratively minimizethe kl with respect to one component q ( z i ) at a time, keeping fixed the others at their most recentestimate q ( t − ( z − i ) = [ q ( t ) ( z ) , . . . , q ( t ) ( z i − ) , q ( t − ( z i +1 ) , . . . , q ( t − ( z n )]. Recalling Bishop (2006), thisis accomplished via the updates q ( t ) ( z i ) ∝ exp { E q ( t − ( z − i ) [log[ p ( z i | z − i , y )]] } , for each i = 1 , . . . , n, (11)at iteration t , until convergence. In (11), the quantity p ( z i | z − i , y ) denotes the full conditional densityof z i . Due to the closure under conditioning property of the multivariate truncated normal (Horrace,2005), such a quantity can be derived explicitly from p ( z | y ) in (9) and coincides with the density of aunivariate truncated normal. In particular, we can express p ( z i | z − i , y ) as p ( z i | z − i , y ) ∝ φ { z i − [ ξ i + H z i ( z − i − ξ − i )]; σ z i } y i − z i > , (12)where ξ − i denotes the vector ξ without the i th entry, H z i = Σ z [ i, − i ] Σ − z [ − i, − i ] , and σ z i = Σ z [ i,i ] − z [ i, − i ] Σ − z [ − i, − i ] Σ z [ − i,i ] . The density in (12) has an exponential kernel which is linear in z − i and,hence, replacing the expression for p ( z i | z − i , y ) in the cavi updates reported in (11), it followsthat also q ( t ) ( z i ) has a univariate truncated normal density as in (12) with z − i replaced by z ( t − − i =[ E q ( t ) ( z ) ( z ) , . . . , E q ( t ) ( z i − ) ( z i − ) , E q ( t − ( z i +1 ) ( z i +1 ) , . . . , E q ( t − ( z n ) ( z n )] (cid:124) . Each term in z ( t − − i is the ex-pected value of a univariate truncated normal which is available in closed–form, thus producing asimple cavi relying on closed–form updates, as outlined in Algorithm 2.Once the optimal univariate truncated normal approximating densities q ∗ ( z ) , . . . , q ∗ ( z n ) are avail-able, equation (10) can be easily evaluated via Monte Carlo by lettingˆ p ( y n +1 = 1 | y ) = 1 R (cid:88) Rr =1 Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ∗ ( r ) ]; 1 + σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 ) , (13)with z ∗ ( r ) = [ z ∗ ( r )1 , . . . , z ∗ ( r ) n ] (cid:124) , for r = 1 , . . . , R , where each z ∗ ( r ) i can be efficiently sampled from thecorresponding univariate truncated normal approximating density q ∗ ( z i ), independently for i = 1 , . . . , n and r = 1 , . . . , R . Unlike for the multivariate case, sampling from independent univariate truncatednormals can be effectively done in standard statistical softwares, thus avoiding issues in large n settings.As outlined in the simulation studies in Section 3, such an approximate strategy massively reducescomputational times without affecting accuracy. In this section, we study the gains in accuracy and computational scalability of the methods devel-oped in Section 2 relative to state–of–the–art competitors, which include Monte Carlo inference underthe widely–used stan implementation (see R package rstan ) of the Hamiltonian no–u–turn sampler(Hoffman and Gelman, 2014), and minimax tilting (see R package TruncatedNormal ) methods (Botev,2017) to evaluate the multivariate Gaussian cumulative distribution functions involved in the predictiveprobability expressed in equation (4).To evaluate performance in high–dimensional settings, we generate the binary responses on the 100 ×
100 unit grid G = { x = ( x , x ) : x ∈ (1 / , / , . . . , / , x ∈ (1 / , / , . . . , / } with equally–spaced predictors, thus obtaining n = 10 ,
000 non–overlapping configurations. At these lo-cations, we simulate the responses y , . . . , y , from independent Bernoulli distributions with probabil-ity parameters Φ[ f ( x )] , . . . , Φ[ f ( x , )] displayed in Figure 1, where f ( X ) = [ f ( x ) , . . . , f ( x , )] (cid:124) is a sample from a gp with mean function m ( x ) = 0 and squared exponential covariance kernel K ( x , x (cid:48) ) = exp( − α · (cid:107) x − x (cid:48) (cid:107) ), where α = 30. Such a kernel is frequently used in machine learning,11 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . x x ll lll ll l ll l lll ll llll l ll ll lll ll ll lll ll ll lll ll lll ll l llllll ll ll ll ll l ll l lll ll ll ll ll ll ll l ll lll l llll ll l ll l . . . . . . x x llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll Figure 1: Simulated probabilities Φ[ f ( x )] on the 100 ×
100 grid G = { x = ( x , x ) : x ∈ (1 / , / , . . . , / , x ∈ (1 / , / , . . . , / } in the unit square, where f ( x ) is a zeromean gp with squared exponential covariance kernel. White circles denote the 100 unknown locationsdistributed randomly (left) and on a grid (right), used for prediction.thus we focus on this choice for the simulation study. To assess performance in estimating the predictiveprobabilities, we also simulate the probability parameters and the associated binary responses for 100out–of–sample units under two scenarios. As outlined in Figure 1, the first one relies on randomlydistributed locations, whereas the second focuses on a grid structure. We also compare performance inlower–dimensional problems with n ∈ { , , } obtained by selecting a n / × n / sub–grid of G with equally spaced configurations between 0 and 1, along with their associated probability parametersand simulated binary responses.Table 1 summarizes the performance in terms of accuracy and computational scalability of the dif-ferent methods analyzed, at varying n and under the two different scenarios considered for prediction.In reporting the results, we set a conservative computational budget of one day and display the perfor-mance measures only for those routines with a running time below this computational budget. MonteCarlo inference via stan (Hoffman and Gelman, 2014) relies on the rstan package applied to model(1) in order to obtain posterior samples from f ( X ) which are then used to compute the predictiveprobabilities at the unknown locations via ordinary kriging. Such evaluations rely on 10 , mcmc samples after a burnin of 10 , α = 30. In evaluating the performance of minimax tilting( tn ) (Botev, 2017), we compute the numerator and the denominator in (4) separately via the R package TruncatedNormal , using the default settings. Since the denominator is shared among all predictions in12able 1: Computational efficiency and predictive performance, at varying sample size n , of stan (Hoff-man and Gelman, 2014), tn (Botev, 2017), tlr (Section 2.1) and vb (Section 2.2), when the unitsto be predicted are distributed either randomly [random] or on a grid [grid]. time : cost in secondsto compute 100 predictive probabilities. mse : mean squared error between the estimated predictiveprobabilities and the true ones. auc : area under the roc when predicting the out–of–sample binaryresponses with the estimated predictive probabilities. Method Perfomance measures n = 250 n = 625 n = 2 , n = 10 , time [seconds] mse [random] stan mse [grid] auc [random] auc [grid] time [seconds]
349 2,339 — — mse [random] tn mse [grid] auc [random] auc [grid] time [seconds]
108 402 2,509 18,709 mse [random] tlr mse [grid] auc [random] auc [grid] time [seconds] mse [random] vb mse [grid] auc [random] auc [grid]
TruncatedNormal is close to that of computingone multivariate Gaussian cumulative distribution function. Equation (4) is also evaluated under thenovel methods proposed in Section 2.1 ( tlr ) and summarized in Algorithm 1, which can be imple-mented via simple adaptations of the R package tlrmvnmvt (Cao et al., 2020). In implementing such aroutine, we set the block size to n / , the truncation level to 10 − and R = 20 , tn and tlr , we avoid setting α equal to the true value 30, but insteadestimate such a quantity by applying the R packages TruncatedNormal and tlrmvnmvt to maximize,with respect to α , the marginal likelihood in (3) on a grid of 60 equally spaced α values between 15and 45. Results are comparable, although TruncatedNormal provides slightly less noisy estimates of(3) than tlrmvnmvt , at the cost of a substantially higher running time. The estimate of α provided by tlrmvnmvt is also used in the implementation of the novel variational strategy ( vb ) presented in Section2.2 and summarized in Algorithm 2. Also in this case we consider R = 20 ,
000 Monte Carlo samples toevaluate (10). Such values are generated from the optimal univariate truncated normal approximatingdensities produced by the cavi in Algorithm 2, which can be implemented via minor adaptations of thesource code in the GitHub repository
Probit-PFMVB (Fasano et al., 2019). All computations were runon a 2.5 GHz Intel Core i7 CPU workstation, without multithreading.As clarified in Table 1, the methods proposed in Sections 2.1–2.2 notably reduce the running timesrelative to state–of–the–art competitors, thus making prediction under probit gp computationally fea-sible in those high–dimensional settings that arise commonly in various applications. This is especiallytrue for the vb solution proposed in Section 2.2 which is orders of magnitude faster than its competitors.Such a notable reduction in running times under tlr and vb is crucially obtained at almost no costs interms of accuracy in out–of–sample prediction (see auc ) and in the estimation of the predictive prob-abilities (see mse ), when compared to competitors relying on mcmc samples from the exact posterior( stan ) or on exact evaluation of multivariate Gaussian cumulative distribution functions ( tn ). We conclude by applying the methods developed in Sections 2.1 and 2.2 to a real–world environmentalapplication aimed at modeling whether the local windspeed exceeds a pre–specified working threshold forenergy production in a given region of interest in Saudi Arabia. Wind turbines for generating electricitytypically have two windspeed thresholds, of which the lower controls when the blades of the turbinestart to be in motion and the higher indicates if the turbine should be switched off to avoid strong14 x x Saudi ArabiaYemenIraq IranEgypt Israel EritreaSudan United Arab EmiratesJordan KuwaitBahrainQatar OmanEgyptSudan
35 40 45 50 55 x x Saudi ArabiaYemenIraq IranEgypt Israel EritreaSudan United Arab EmiratesJordan KuwaitBahrainQatar OmanEgyptSudan
Figure 2: Heatmaps representing the windspeed at 140 meters high (left) and a binary version y ofthis measure defining whether the local windspeed is sufficiently high for energy production (dark gray: yes ; light gray: no ) based on the 4m/s threshold (right) on Jan 21st, 2014. The red area denotes thespatial region that is used for modeling and prediction.wind damage. Here, the binary response y i is defined as whether the windspeed at the i -th locationexceeds the lower threshold, thus allowing production of wind power, which is referred to as the workingthreshold of wind turbines. This important application is motivated by the growing domestic energyconsumption in Saudi Arabia and by the attempt to reduce the reliance on fossil fuels, thereby leadingto an increasing interest on renewable energy sources, including wind (Shaahid et al., 2014; Chen et al.,2018; Tagle et al., 2019; Giani et al., 2020). The effective exploitation of such resources and the carefulmanagement of the energy stations require careful modeling and prediction at a fine spatial resolutionof whether the local windspeed exceeds or not a given threshold for energy production. As we willdiscuss in the following, such a fine grid of observations commonly produces a sample size around tensof thousands units. This makes state–of–the–art algorithms for probit gp computationally unfeasiblein such studies, thus motivating our scalable solutions presented in Sections 2.1 and 2.2.The windspeed dataset considered in this article is produced by the Weather Research and Fore-casting ( wrf ) model (Yip, 2018), which constructs the weather system via partial differential equationson the mesoscale and demands strong computation capacity to serve meteorological applications (Ska-marock et al., 2008). The time resolution of our data is daily and we use the windspeed over theregion of north–west Saudi Arabia on January 21st, 2014 for modeling and out–of–sample prediction.Such a region covers the wind farm at Dumat Al Jandal, which is the first wind farm in Saudi Arabia15 x x llll llll l llll ll l lll ll l ll lll ll llll lll llll ll l ll l lll l lll ll ll ll l ll llll l lll l lll l l lll lll l ll lll llll ll llll l l lll l lll lll ll lll llll l ll ll ll l l ll lll ll lll l l l ll ll l ll l l
34 36 38 40 42 44 x x llllllllllllllllllll llllllllll llllllllll llllllllllllllllllll llllllllll llllllllll llllllllll llllllllll Figure 3: For the the spatial region that is used for modeling and prediction, heatmaps defining whetherthe local windspeed is sufficiently high for energy production (dark gray: yes ; light gray: no ) basedon the 4m/s threshold on Jan 21st, 2014. Black circles denote the 100 unknown locations distributedrandomly (left) and on a grid (right), used for prediction.and currently under construction as well as the future smart city of NEOM, a strategic componentof the Saudi 2030 Vision, where wind power is expected to be a key energy source. Moreover, thewindspeed on January 21st, 2014 has high variability across this region, which makes the out–of–sampleprediction task more challenging. As shown in Figures 2 and 3 the region under analysis is obtainedby intersecting the Saudi Arabia territorial map with the rectangle ranging from e ◦ (cid:48) to e ◦ andfrom n ◦ to n ◦ . Within this region we consider a fine grid of n = 9 ,
036 equally–spaced locations x i = ( x i , x i ) (cid:124) = ( long i , lat i ) (cid:124) at which we monitor whether the windspeed is either above ( y i = 1) orbelow ( y i = 0) the working threshold of wind turbines for each i = 1 , . . . , , gp with zeromean function and squared exponential covariance kernel K ( x , x (cid:48) ) = exp( − α · (cid:107) x − x (cid:48) (cid:107) ), where α isestimated via maximization of the marginal likelihood in (3) evaluated via the tlrmvnmvt package ona grid of values in [1 , α is close to 17, which has an effective range of 0 .
42. Thisrelatively short effective range is consistent with the abrupt changes of the binary responses. Recallingthe results in Table 1, the calculation of the predictive probabilities is only performed under the methods16resented in Sections 2.1 ( tlr ) and 2.2 ( vb ) since stan and tn would be computationally impracticalin such a high–dimensional setting with n = 9 , α = ˆ α = 17and consider the same settings as in the simulation study in Section 3, thus obtaining running times thatare comparable to the simulation scenario with n = 10 , tlr and vb require about168 and 35 seconds per prediction, respectively, thus confirming the feasibility of such novel methods inhigh–dimensional settings. We shall also emphasize that these running times are mostly affected by thematrix pre–computation operations in Algorithms 1–2, and hence could be carefully reduced via sparsematrix representations or careful algebraic operations. Out–of–sample predictive performance measuredvia the auc is similarly accurate under both methods, with a slightly increased improvement providedby vb . In particular, the auc for the random scenario is 0 .
968 for tlr and 0 .
971 for vb , whereas inthe grid setting such a measure is 0 .
881 for tlr and 0 .
906 for vb . This article provides novel expressions for the predictive probabilities under probit models with gp pri-ors, relying either on multivariate Gaussian cumulative distribution functions or on functionals of mul-tivariate truncated normals, and proposes scalable computational strategies to evaluate such quantitiesin common high–dimensional settings, thus covering an important gap in the literature. As highlightedin the simulations studies in Section 3, these computational gains are notable and do not sacrificeaccuracy, thereby allowing tractable prediction under probit gp in applications that were previouslycomputationally impractical, such as the Saudi Arabia windspeed study in Section 4.The above results open up several avenues for future research. For instance, the methods in Section2 can be adapted to any probit model with a multivariate Gaussian prior for the linear predictor. Theseinclude classical Bayesian probit regression, multivariate probit models and general additive represen-tations relying on basis expansions. Extensions to categorical responses under a multinomial probit gp model or to more general priors such as unified skew–normals can also be explored by leveraging resultsin Durante (2019), Fasano and Durante (2020) and Benavoli et al. (2020). Acknowledgments
This publication is based upon work supported by the King Abdullah University of Science and Tech-nology (KAUST) Office of Sponsored Research (OSR) under Award No: OSR-2018-CRG7-3742.17
Appendix: Proof of Theoretical Results
To prove Propositions 1 and 2 let us first state the following Lemma.
Lemma 1 (Lemma 7.1 in Azzalini and Capitanio (2014)) . If U ∼ N p ( , Σ ) then E [Φ q ( H (cid:124) U + k ; Ψ )] =Φ q ( k ; Ψ + H (cid:124) ΣH ) , for any choice of the vector k ∈ R p , the p × q matrix H and the q × q symmetricpositive–definite matrix Ψ . Combining the closure under conditioning property of multivariate Gaussians with the above result— whose proof can be found in Azzalini and Capitanio (2014) — the proof of Propositions 1 and 2 canbe obtained via simple derivations described below.
Proof of Proposition 1.
To prove Proposition 1, first to notice that by simple application of the Bayesrule p ( y n +1 | y ) = p ( y n +1 = 1 , y ) /p ( y ). Hence, it suffices to show that p ( y n +1 = 1 , y ) = Φ n +1 ( D ∗ ξ ∗ ; I n +1 + D ∗ Ω ∗ D ∗ (cid:124) ) and p ( y ) = Φ n ( D ξ ; I n + DΩD (cid:124) ). Recalling our discussion in Section 2.1, p ( y ) is the marginallikelihood for the observed data and can be expressed as p ( y ) = (cid:82) Φ n [ Df ( X ); I n ] φ n [ f ( X ) − ξ ; Ω ]d f ( X ) = E { Φ n [ D ( f ( X ) − ξ )+ D ξ ; I n ] } where ( f ( X ) − ξ ) ∼ N n ( , Ω ). Hence, by applying Lemma 1 to this expecta-tion, we obtain E { Φ n [ D ( f ( X ) − ξ )+ D ξ ; I n ] } = Φ n ( D ξ ; I n + DΩD (cid:124) ). Such a result also clarifies equation(3). The proof of the equation p ( y n +1 = 1 , y ) = Φ n +1 ( D ∗ ξ ∗ ; I n +1 + D ∗ Ω ∗ D ∗ (cid:124) ) proceeds in a similarmanner, after noticing that p ( y n +1 = 1 , y ) = (cid:82) (Φ[ f ( x n +1 )] · Φ n [ Df ( X ); I n ]) φ n +1 [ f ∗ ( X ) − ξ ∗ ; Ω ∗ ]d f ∗ ( X ) = (cid:82) Φ n +1 [ D ∗ f ∗ ( X ); I n +1 ] φ n +1 [ f ∗ ( X ) − ξ ∗ ; Ω ∗ ]d f ∗ ( X ) = E { Φ n +1 [ D ∗ ( f ∗ ( X ) − ξ ∗ )+ D ∗ ξ ∗ ; I n +1 ] } , where f ∗ ( X ) =[ f ( X ) (cid:124) , f ( x n +1 )] (cid:124) ∼ N n +1 ( ξ ∗ , Ω ∗ ), with ξ ∗ , Ω ∗ and D ∗ defined as in Proposition 1. Proof of Proposition 2.
Recalling the results discussed in Section 2.2, the predictive probability p ( y n +1 =1 | y ) can be defined as E [ z , f ( X ) ,f ( x n +1 ) | y ] { Φ[ f ( x n +1 )] } , with the joint conditional density p [ z , f ( X ) , f ( x n +1 ) | y ] factorizing as p [ f ( x n +1 ) | f ( X )] · p [ f ( X ) | z ] · p ( z | y ). Hence, by the law of the total expectation, wehave that p ( y n +1 = 1 | y ) = E z | y [ E f ( X ) | z ( E f ( x n +1 ) | f ( X ) { Φ[ f ( x n +1 )] } )]. Since, by equation (8) [ f ( x n +1 ) | f ( X )] ∼ N( µ x n +1 + H x n +1 f ( X ) , σ x n +1 ), we can apply Lemma 1 to obtain E f ( x n +1 ) | f ( X ) { Φ[ f ( x n +1 )] } =Φ( µ x n +1 + H x n +1 f ( X ); 1 + σ x n +1 ). To conclude the proof, note that by equation (9), also [ f ( X ) | z ] ∼ N n ( µ X + Σ X z , Σ X ). Therefore, applying again Lemma 1 leads to E f ( X ) | z [Φ( µ x n +1 + H x n +1 f ( X ); 1+ σ x n +1 )] =Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ]; 1+ σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 ).18 eferences Albert, J. H., and Chib, S. (1993), “Bayesian analysis of binary and polychotomous response data,”
Journal ofthe American Statistical Association , 88, 669–679.Arellano-Valle, R. B., and Azzalini, A. (2006), “On the unification of families of skew-normal distributions,”
Scandinavian Journal of Statistics , 33, 561–574.Azzalini, A., and Capitanio, A. (2014),
The Skew-normal and Related Families , Cambridge University Press.Benavoli, A., Azzimonti, D., and Piga, D. (2020), “Skew Gaussian processes for classification,” arXiv preprintarXiv:2005.12987 .Bishop, C. M. (2006),
Pattern Recognition and Machine Learning , Springer.Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017), “Variational inference: A review for statisticians,”
Journal of the American Statistical Association , 112, 859–877.Botev, Z. (2017), “The normal law under linear restrictions: simulation and estimation via minimax tilting,”
Journal of the Royal Statistical Society: Series B , 79, 125–148.Brezger, A., and Lang, S. (2006), “Generalized structured additive regression based on Bayesian P-splines,”
Computational Statistics & Data Analysis , 50, 967–991.Cao, J., Genton, M. G., Keyes, D. E., and Turkiyyah, G. M. (2019), “Hierarchical-block conditioning approxi-mations for high-dimensional multivariate normal probabilities,”
Statistics and Computing , 29, 585–598.Cao, J., Genton, M. G., Keyes, D. E.— (2020), “Exploiting low rank covariance structures for computinghigh-dimensional normal and student-t probabilities,” arXiv preprint arXiv:2003.11183 .Chen, W., Castruccio, S., Genton, M. G., and Crippa, P. (2018), “Current and future estimates of wind energypotential over Saudi Arabia,”
Journal of Geophysical Research: Atmospheres , 123, 6443–6459.Chipman, H. A., George, E. I., and McCulloch, R. E. (2010), “BART: Bayesian additive regression trees,”
TheAnnals of Applied Statistics , 4, 266–298.Chopin, N. (2011), “Fast simulation of truncated Gaussian distributions,”
Statistics and Computing , 21, 275–288.Chopin, N., and Ridgway, J. (2017), “Leave Pima Indians alone: Binary regression as a benchmark for Bayesiancomputation,”
Statistical Science , 32, 64–87.Choudhuri, N., Ghosal, S., and Roy, A. (2007), “Nonparametric binary regression using a Gaussian processprior,”
Statistical Methodology , 4, 227–243. hu, W., and Ghahramani, Z. (2005), “Gaussian processes for ordinal regression,” Journal of Machine LearningResearch , 6, 1019–1041.Consonni, G., and Marin, J.-M. (2007), “Mean-field variational approximate Bayesian inference for latentvariable models,”
Computational Statistics & Data Analysis , 52, 790–798.De Oliveira, V. (2005), “Bayesian inference and prediction of Gaussian random fields based on censored data,”
Journal of Computational and Graphical Statistics , 14, 95–115.Durante, D. (2019), “Conjugate Bayes for probit regression via unified skew-normal distributions,”
Biometrika ,106, 765–779.Fasano, A., and Durante, D. (2020), “A class of conjugate priors for multinomial probit models which includesthe multivariate normal one,” arXiv preprint arXiv:2007.06944 .Fasano, A., Durante, D., and Zanella, G. (2019), “Scalable and accurate variational Bayes for high-dimensionalbinary regression models,” arXiv preprint arXiv:1911.06743 .Genton, M. G., Keyes, D. E., and Turkiyyah, G. (2018), “Hierarchical decompositions for the computation ofhigh-dimensional multivariate normal probabilities,”
Journal of Computational and Graphical Statistics , 27,268–277.Genz, A. (1992), “Numerical computation of multivariate normal probabilities,”
Journal of Computational andGraphical Statistics , 1, 141–149.Giani, P., Tagle, F., Genton, M. G., Castruccio, S., and Crippa, P. (2020), “Closing the gap between windenergy targets and implementation for emerging countries,”
Applied Energy , 269, 115085.Girolami, M., and Rogers, S. (2006), “Variational Bayesian multinomial probit regression with Gaussian processpriors,”
Neural Computation , 18, 1790–1817.Hoffman, M. D., and Gelman, A. (2014), “The No-U-turn sampler: Adaptively setting path lengths in Hamil-tonian Monte Carlo,”
Journal of Machine Learning Research , 15, 1593–1623.Holmes, C. C., and Held, L. (2006), “Bayesian auxiliary variable models for binary and multinomial regression,”
Bayesian Analysis , 1, 145–168.Horrace, W. C. (2005), “Some results on the multivariate truncated normal distribution,”
Journal of Multi-variate Analysis , 94, 209–221.Johndrow, J. E., Smith, A., Pillai, N., and Dunson, D. B. (2019), “MCMC for imbalanced categorical data,”
Journal of the American Statistical Association , 114, 1394–1403. ullback, S., and Leibler, R. A. (1951), “On information and sufficiency,” The Annals of Mathematical Statis-tics , 22, 79–86.Neal, R. (1999), “Regression and classification using Gaussian process priors,”
Bayesian Statistics , 6, 475–501.Nelder, J. A., and Wedderburn, R. W. (1972), “Generalized linear models,”
Journal of the Royal StatisticalSociety: Series A , 135, 370–384.Opper, M., and Winther, O. (2000), “Gaussian processes for classification: Mean–field algorithms,”
NeuralComputation , 12, 2655–2684.Pakman, A., and Paninski, L. (2014), “Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians,”
Journal of Computational and Graphical Statistics , 23, 518–542.Rasmussen, C. E., and Williams, C. K. I. (2006),
Gaussian Processes for Machine Learning , MIT Press.Riihim¨aki, J., Jyl¨anki, P., and Vehtari, A. (2013), “Nested expectation propagation for Gaussian processclassification with a multinomial probit likelihood,”
Journal of Machine Learning Research , 14, 75–109.Shaahid, S., Al-Hadhrami, L. M., and Rahman, M. (2014), “Potential of establishment of wind farms in westernprovince of Saudi Arabia,”
Energy Procedia , 52, 497–505.Skamarock, W. C., Klemp, J. B., Dudhia, J., Gill, D. O., Barker, D. M., Duda, M. G., Huang, X.-Y., Wang,W., and Powers, J. G. (2008), “A description of the Advanced Research WRF version 3,” in
NCAR TechincalNote NCAR , vol. 113, pp. 1–125.Tagle, F., Castruccio, S., Crippa, P., and Genton, M. G. (2019), “A non-Gaussian spatio-temporal model fordaily wind speeds based on a multivariate skew-t distribution,”
Journal of Time Series Analysis , 40, 312–326.Trinh, G., and Genz, A. (2015), “Bivariate conditioning approximations for multivariate normal probabilities,”
Statistics and Computing , 25, 989–996.Yip, C. M. A. (2018), “Statistical characteristics and mapping of near-surface and elevated wind resources inthe Middle East,” Ph.D. thesis., 25, 989–996.Yip, C. M. A. (2018), “Statistical characteristics and mapping of near-surface and elevated wind resources inthe Middle East,” Ph.D. thesis.