[PDF] Scalable computation of predictive probabilities in probit models with Gaussian process priors

Abstract

Predictive models for binary data are fundamental in various fields, and the growing complexity of modern applications has motivated several flexible specifications for modeling the relationship between the observed predictors and the binary responses. A widely-implemented solution is to express the probability parameter via a probit mapping of a Gaussian process indexed by predictors. However, unlike for continuous settings, there is a lack of closed-form results for predictive distributions in binary models with Gaussian process priors. Markov chain Monte Carlo methods and approximation strategies provide common solutions to this problem, but state-of-the-art algorithms are either computationally intractable or inaccurate in moderate-to-high dimensions. In this article, we aim to cover this gap by deriving closed-form expressions for the predictive probabilities in probit Gaussian processes that rely either on cumulative distribution functions of multivariate Gaussians or on functionals of multivariate truncated normals. To evaluate these quantities we develop novel scalable solutions based on tile-low-rank Monte Carlo methods for computing multivariate Gaussian probabilities, and on mean-field variational approximations of multivariate truncated normals. Closed-form expressions for the marginal likelihood and for the posterior distribution of the Gaussian process are also discussed. As shown in simulated and real-world empirical studies, the proposed methods scale to dimensions where state-of-the-art solutions are impractical.

Full PDF

SScalable computation of predictive probabilities in probitmodels with Gaussian process priors

Jian Cao ∗ , Daniele Durante † and Marc G. Genton ‡ Abstract

Predictive models for binary data are fundamental in various ﬁelds, and the growing complexityof modern applications has motivated several ﬂexible speciﬁcations for modeling the relationshipbetween the observed predictors and the binary responses. A widely–implemented solution ex-presses the probability parameter via a probit mapping of a Gaussian process indexed by predic-tors. However, unlike for continuous settings, there is a lack of closed–form results for predictivedistributions in binary models with Gaussian process priors. Markov chain Monte Carlo meth-ods and approximation strategies provide common solutions to this problem, but state–of–the–artalgorithms are either computationally intractable or inaccurate in moderate–to–high dimensions.In this article, we aim to cover this gap by deriving closed–form expressions for the predictiveprobabilities in probit Gaussian processes that rely either on cumulative distribution functionsof multivariate Gaussians or on functionals of multivariate truncated normals. To evaluate suchquantities we develop novel scalable solutions based on tile–low–rank Monte Carlo methods forcomputing multivariate Gaussian probabilities, and on variational approximations of multivariatetruncated normals. Closed–form expressions for marginal likelihoods and posterior distributions ofthe Gaussian process are also discussed. As illustrated in empirical studies, the proposed methodsscale to dimensions where state–of–the–art solutions are impractical.

Keywords:

Binary data, Gaussian process, Multivariate truncated normal, Probit model, Uniﬁed skew–normal, Variational Bayes.

There is a growing demand in various ﬁelds for ﬂexible models that are able to accurately characterizecomplex relations among a vector of binary responses y = ( y , . . . , y n ) (cid:124) and a set of predictors X =( x , . . . , x n ) (cid:124) , with y i ∈ {

0; 1 } and x i = ( x i , . . . , x iq ) (cid:124) ∈ R q , for each i = 1 , . . . , n . Common solutionsaddress this goal by replacing the linear predictor X β = ( x (cid:124) β , . . . , x (cid:124) n β ) (cid:124) ∈ R n in the generalized linear ∗ Statistics Program, King Abdullah University of Science and Technology, Saudi Arabia † Department of Decision Sciences and Bocconi Institute for Data Science and Analytics, Bocconi University, Italy ‡ Statistics Program, King Abdullah University of Science and Technology, Saudi Arabia a r X i v : . [ s t a t . C O ] S e p odel for y (Nelder and Wedderburn, 1972), with a more ﬂexible vector f ( X ) = [ f ( x ) , . . . , f ( x n )] (cid:124) ∈ R n that accounts for complex non–linear relations between the response and the predictors, thus enhancingpredictive power. Notable examples of this approach within the Bayesian setting deﬁne f ( X ) via additivetrees (Chipman et al., 2010), Bayesian p –splines (Brezger and Lang, 2006) and Gaussian processes ( gp )(Rasmussen and Williams, 2006), among others. Motivated by the success of gp s for classiﬁcation(e.g., Neal, 1999; Opper and Winther, 2000; De Oliveira, 2005; Chu and Ghahramani, 2005; Girolamiand Rogers, 2006; Rasmussen and Williams, 2006; Choudhuri et al., 2007; Riihim¨aki et al., 2013),we focus on deriving improved methods to evaluate the predictive probabilities for the latter classof models under the probit link. In particular, we assume that the responses y i , i = 1 , . . . , n areconditionally independent realizations from Bernoulli variables with probability parameters Φ[ f ( x i )], i = 1 , . . . , n , where Φ[ f ( x )] is the cumulative distribution function of a standard Gaussian evaluated at f ( x ), whereas f ( x ) is assigned a gp prior with mean function m ( x ) = E [ f ( x )] and covariance kernel K ( x , x (cid:48) ) = E { [ f ( x ) − m ( x )][ f ( x (cid:48) ) − m ( x (cid:48) )] } . Exploiting standard gp ’s properties (Rasmussen andWilliams, 2006) and assuming no overlap in x , . . . , x n , such assumptions lead to the model p [ y | f ( X )] = (cid:89) ni =1 Φ[ f ( x i )] y i { − Φ[ f ( x i )] } − y i , with p [ f ( X )] = φ n [ f ( X ) − ξ ; Ω ] , (1)where φ n [ f ( X ) − ξ ; Ω ] denotes the density of a multivariate Gaussian distribution N n ( ξ , Ω ) for f ( X ), withmean vector ξ = [ m ( x ) , . . . , m ( x n )] (cid:124) , and n × n covariance matrix Ω having entries Ω [ ii (cid:48) ] = K ( x i , x i (cid:48) ),for every i = 1 , . . . , n and i (cid:48) = 1 , . . . , n . Model (1) has attracted considerable interest due to itsﬂexibility and its direct connection with binary discrete choice models based on Gaussian latent utilities z i = f ( x i ) + ε i , with ε i ∼ N(0 , i = 1 , . . . , n (Albert and Chib, 1993). In fact, p [ y i = 1 | f ( x i )] = Φ[ f ( x i )] = p [ z i > | f ( x i )]. In such settings, a main goal of inference is to evaluatethe predictive probabilities of new responses y n +1 , deﬁned as p ( y n +1 = 1 | y ) = 1 − p ( y n +1 = 0 | y ) = (cid:82) Φ[ f ( x n +1 )] · (cid:2)(cid:82) p [ f ( x n +1 ) , f ( X ) | y ]d f ( X ) (cid:3) d f ( x n +1 ) , (2)where p [ f ( x n +1 ) , f ( X ) | y ] is the joint posterior density of [ f ( x n +1 ) , f ( X )] under model (1), which seemsnot available in closed form due to the apparent absence of conjugacy between the probit likelihood andthe multivariate Gaussian prior for [ f ( x n +1 ) , f ( X )] under model (1). This issue has motivated extensiveresearch to compute predictive probabilities in probit models with multivariate Gaussian priors either viaMonte Carlo methods relying on samples from p [ f ( x n +1 ) , f ( X ) | y ] (Neal, 1999; Albert and Chib, 1993;De Oliveira, 2005; Holmes and Held, 2006; Choudhuri et al., 2007; Pakman and Paninski, 2014; Hoﬀmanand Gelman, 2014; Durante, 2019) or by deriving tractable approximations of p [ f ( x n +1 ) , f ( X ) | y ]2Consonni and Marin, 2007; Girolami and Rogers, 2006; Chu and Ghahramani, 2005; Rasmussen andWilliams, 2006; Riihim¨aki et al., 2013) that allow simple evaluation of (2). These methods provide state–of–the–art solutions in small–to–moderate dimensional settings, but tend to become rapidly inaccurateor computationally impractical in higher dimension (Chopin and Ridgway, 2017; Johndrow et al., 2019;Durante, 2019; Fasano et al., 2019). This issue is inherently found in probit gp s where, by deﬁnition,there are p ≈ n parameters in f ( X ), with the sample size n being typically large in most studies.In this article we aim to cover the above gap by providing novel closed–form expressions for thepredictive probabilities in probit gp s along with improved methods to evaluate the involved quantitiesin high dimensions. More speciﬁcally, in Section 2.1 we ﬁrst derive a closed–form expression for themarginal likelihood p ( y ) under model (1), and then exploit this result to show that p ( y n +1 = 1 | y )can be expressed as the ratio between cumulative distribution functions of multivariate Gaussians withdimensions n + 1 and n , respectively. To overcome the known issues associated with the evaluation ofthese two quantities in high dimensions (Chopin, 2011; Botev, 2017; Cao et al., 2019, 2020) we introducean error–reduction technique for computing ratios of Gaussian cumulative distribution functions thatbuilds on the tile–low–rank method in Cao et al. (2020) and substantially reduces the computational timeof state–of–the–art strategies such as minimax tilting methods (Botev, 2017) and Hamiltonian MonteCarlo samplers ( stan ) (Hoﬀman and Gelman, 2014), without aﬀecting accuracy. To further improve thescalability of the methods presented in Section 2.1, we show in Section 2.2 that p ( y n +1 = 1 | y ) has alsoan alternative representation based on functionals of multivariate truncated normals, and we address theintractability of such variables in high dimensions by proposing a variational approximation based onunivariate truncated normals which allows accurate and eﬃcient evaluation of predictive probabilities inhigh dimensions, thus scaling to settings where available strategies are unfeasible. Such results are alsorelated to the conditional distribution of the gp given the binary responses which we show to coincidewith a uniﬁed skew–normal ( sun ) (Arellano-Valle and Azzalini, 2006) by adapting recent results inDurante (2019) on classical Bayesian probit regression. The magnitude of the improvements providedby the new methods presented in Sections 2.1–2.2 relative to state–of–the–art competitors is illustratedin simulations in Section 3 and in an environmental application to Saudi Arabia windspeed in Section4. Section 5 contains concluding remarks, whereas all the proofs can be found in the Appendix.3 Computation of Predictive Probabilities in Probit GaussianProcesses

In Sections 2.1 and 2.2 we present novel expressions for the predictive probabilities in probit Gaussianprocesses along with improved methods to evaluate eﬃciently the involved quantities in high dimensions.

To introduce the closed–form expression for p ( y n +1 = 1 | y ) based on ratios of multivariate Gaussiancumulative distribution functions, ﬁrst notice that by leveraging known properties of Gaussian variables,the probit likelihood in (1) can be re–expressed as p [ y | f ( X )] = (cid:81) ni =1 Φ[ f ( x i )] y i { − Φ[ f ( x i )] } − y i = (cid:81) ni =1 Φ[(2 y i − f ( x i )] = Φ n [ Df ( X ); I n ], where D = diag[(2 y − , . . . , (2 y n − n [ Df ( X ); I n ] isthe cumulative distribution function of a zero–mean n –variate Gaussian with identity covariance matrix I n , evaluated at Df ( X ). Leveraging this form and adapting Lemma 7.1 in Azzalini and Capitanio (2014)to our setting, we can easily express the marginal likelihood under model (1) as p ( y ) = (cid:82) Φ n [ Df ( X ); I n ] φ n [ f ( X ) − ξ ; Ω ]d f ( X ) = Φ n ( D ξ ; I n + DΩD (cid:124) ) . (3)Equation (3) provides a closed–form expression that can be useful to estimate ﬁxed parameters in the gp kernel via maximization of p ( y ) and, as shown in Proposition 1, is also a key to obtain closed–formexpressions for p ( y n +1 = 1 | y ). Proposition 1.

Under (1) , the predictive probability for a new response y n +1 ∈ {

0; 1 } with predictor x n +1 ∈ R q is p ( y n +1 = 1 | y ) = 1 − p ( y n +1 = 0 | y ) = Φ n +1 ( D ∗ ξ ∗ ; I n +1 + D ∗ Ω ∗ D ∗ (cid:124) )Φ n ( D ξ ; I n + DΩD (cid:124) ) , (4) where D ∗ = diag[(2 y − , . . . , (2 y n − , , ξ ∗ = [ ξ (cid:124) , m ( x n +1 )] (cid:124) , and Ω ∗ is obtained by adding an addi-tional row and column to Ω deﬁned as Ω ∗ (cid:124) [ n +1 , · ] = Ω ∗ [ · ,n +1] = [ K ( x n +1 , x ) , . . . , K ( x n +1 , x n ) , K ( x n +1 , x n +1 )] (cid:124) . To prove Proposition 1, it is suﬃcient to notice that, by the Bayes rule, p ( y n +1 = 1 | y ) = p ( y n +1 =1 , y ) /p ( y ) where p ( y n +1 = 1 , y ) and p ( y ) are the marginal likelihoods of ( y n +1 = 1 , y ) and y , respectively,under model (1). Replacing such quantities with their closed–form expression in (3), leads to (4). Seethe Appendix for a more detailed proof which also includes additional clariﬁcations on equation (3).4valuation of (4) requires the calculation of cumulative distribution functions of multivariate Gaus-sians, which is known to be a challenging task in high dimensions (e.g., Genz, 1992; Chopin, 2011;Botev, 2017; Genton et al., 2018; Cao et al., 2019, 2020). Recent advances via minimax tilting methods(Botev, 2017) allow accurate evaluation of such quantities, but face an increased computational costwhich makes such strategies rapidly impractical as n grows. A more scalable solution can be found inthe separation–of–variable ( sov ) algorithm originally introduced by Genz (1992) and subsequently im-proved in terms of scalability by Cao et al. (2020). Such a routine decomposes the generic multivariateGaussian probability Φ n ( a , b ; Σ ) asΦ n ( a , b ; Σ ) = (cid:82) ba φ n ( u ; Σ )d u = ( e − d ) (cid:82) ( e − d ) · · · (cid:82) ( e n − d n ) (cid:82) d w = E w [( e − d ) · · · ( e n − d n )] , (5)with d i = Φ( { a i − (cid:80) i − j =1 l ij Φ − [ d j + w j ( e j − d j )] } /l ii ) and e i = Φ( { b i − (cid:80) i − j =1 l ij Φ − [ d j + w j ( e j − d j )] } /l ii )for each i = 1 , . . . , n , where l ij is the ( ij )-th coeﬃcient in the lower Cholesky factor of Σ , and w = ( w , . . . , w n − ) (cid:124) denotes a vector with uniform entries w j ∼ U(0 , j = 1 , . . . , n −

1. Such adecomposition transforms the integration region into the unit hypercube, thus allowing the evaluationof Φ n ( a , b ; Σ ) via functionals of uniform densities. To further improve the quality of the above esti-mator, more recent implementations (Trinh and Genz, 2015) combine (5) with a univariate reorderingpreconditioner that rearranges the integration variables and produces the corresponding Cholesky factorsimultaneously at the same O ( n ) cost of the Cholesky factorization. This ordering strategy processesthe integration variables from left to right iteratively and, at each step, it switches the original integrationvariable with the one having the narrowest conditional integration limits on its right side. Positioningleft the integration variables with narrower integration limits is shown in Trinh and Genz (2015) andCao et al. (2020) to improve the Monte Carlo convergence rate of (5), whose integrand is evaluated R times — corresponding to the Monte Carlo sample size — each of which has a cost of O ( n ). Such costsallow the implementation of this strategy in settings with n ≤ , Σ that reduces the cost of the sov algorithm by substituting the dense matrix–vector multiplicationwith the low–rank matrix–vector multiplication. A compatible block–reordering is also introduced inplace of the univariate reordering to improve the convergence rate at the same cost as the low–rankCholesky factorization. Speciﬁcally, the block–reordering orders integration variables on the block levelbased on crude estimates of the block–wise marginal probabilities. Both the block–reordering and thetile–low–rank version of the sov algorithm reach their optimal complexities of O ( n / ) and O ( n / ),5espectively, when the block size in the tile–low–rank representation is n / , thus reducing the com-putational complexity of the classical sov algorithm by n / and allowing implementation in tens ofthousands of dimensions.Although the above techniques can be eﬀectively implemented to evaluate multivariate Gaussianprobabilities as in (3), the calculation of ratios among such quantities as in (4) has typically high accuracyrequirements. Unfortunately, as discussed in Botev (2017) and Cao et al. (2020), the estimation errorsof tail multivariate Gaussian probabilities, that also include the cumulative distribution function, can beas large as the probability estimates themselves when n is in hundreds to thousands of dimensions, thusproducing unreliable ratio estimates. To address this issue, we propose an error–reduction technique thatavoids computing the numerator and the denominator in (4) separately, but combines their evaluationunder the tile–low–rank representation. Indeed, as is clear from Proposition 1, the denominator in (4) isthe marginalization of the numerator over the last integration variable. Therefore, keeping the generalnotation of the sov algorithm and leveraging equation (5), expression (4) can be re–written in thegeneral form Φ n +1 ( a , b ; Σ )Φ n ( a − ( n +1) , b − ( n +1) ; Σ − ( n +1) ) = E w [( e − d ) · · · ( e n − d n ) · ( e n +1 − d n +1 )] E w − n [( e − d ) · · · ( e n − d n )] , (6)where e i and d i are deﬁned as in equation (5) for i = 1 , . . . , n + 1, whereas a − ( n +1) , b − ( n +1) and w − n areobtained by removing the ( n + 1)-th entry in a and b , and the n -th entry in w , respectively. Similarly, Σ − ( n +1) coincides with Σ without the ( n + 1)-th row and column. As is clear from (6), the quantities( e − d ) , . . . , ( e n − d n ) are the same deterministic functions of w both in the numerator and in thedenominator, and hence, using the same set of Monte Carlo samples w in the n –dimensional hypercubefor estimating the two expectations might signiﬁcantly reduce the estimation error of their ratio. Inparticular, our proposed ratio estimator isˆ p ( y n +1 = 1 | y ) = (cid:80) Rr =1 [ e ( w ( r ) − n ) − d ( w ( r ) − n )] · · · [ e n ( w ( r ) − n ) − d n ( w ( r ) − n )] · [ e n +1 ( w ( r ) ) − d n +1 ( w ( r ) )] (cid:80) Rr =1 [ e ( w ( r ) − n ) − d ( w ( r ) − n )] · · · [ e n ( w ( r ) − n ) − d n ( w ( r ) − n )] , (7)where the generic e i ( w ( r ) ) and d i ( w ( r ) ) denote the values of e i and d i in (5) evaluated at the Monte Carlosample w ( r ) of w . This estimator is asymptotically unbiased because the numerator and the denominatorconverge to E w [( e − d ) · · · ( e n − d n ) · ( e n +1 − d n +1 )] and E w − n [( e − d ) · · · ( e n − d n )], respectively, andhence equation (7) converges to (6) in probability. Moreover, equation (7) is guaranteed to be in (0 , .

25. This is not the case when thenumerator and the denominator in (4) are estimated separately. Indeed, as discussed in Botev (2017)6 lgorithm 1:

Compute (4) via the estimator (7) [a]

Deﬁne a = −∞ , b = D ∗ ξ ∗ , Σ = I n +1 + D ∗ Ω ∗ D ∗ (cid:124) , a − ( n +1) = −∞ , b − ( n +1) = D ξ , Σ − ( n +1) = I n + DΩD (cid:124) and let w ( r ) , . . . , w ( R ) denote uniform samples from the n –dimensionalunit hypercurbe. [b] Apply block reordering (Cao et al., 2020) to ( a − ( n +1) , b − ( n +1) , Σ − ( n +1) ), which producesthe tile–low–rank Cholesky factor L after variable reordering, and reorder a − ( n +1) and b − ( n +1) accordingly. [c] Compute L [ n +1 , n +1] using Σ and L . [d] Obtain the quantities required to evaluate equation (7). for r=1,. . . ,R do[d.1]

Compute the diﬀerences e ( w ( r ) − n ) − d ( w ( r ) − n ) , . . . , e n ( w ( r ) − n ) − d n ( w ( r ) − n ) by applying thetile–low–rank variant of equation (5) (Cao et al., 2020) to ( a , b , L ). Store also the vector v ( r ) = [Φ − { d ( w ( r ) − n )+ w ( r )1 [ e ( w ( r ) − n ) − d ( w ( r ) − n )] } , . . . , Φ − { d n ( w ( r ) − n )+ w ( r ) n [ e n ( w ( r ) − n ) − d n ( w ( r ) − n )] } ] (cid:124) . [d.2] Set e n +1 ( w ( r ) ) − d n +1 ( w ( r ) ) = Φ (cid:104) b n +1 − L [ n +1 , n ] v ( r ) l n +1 ,n +1 (cid:105) − Φ (cid:104) a n +1 − L [ n +1 , n ] v ( r ) l n +1 ,n +1 (cid:105) . [e] Estimate (4) via Monte Carlo as in (7) using the quantities computed in step [d] .and Cao et al. (2020), when n is high the estimation errors of the two cumulative distribution functionsin (4) are often as large as the estimates themselves, thus producing estimated ratios possibly outsideof the range (0 ,

1) and with high variance.The pseudo–code for evaluating (4) via the estimator outlined in (7) is provided in Algorithm 1. Instep [b] , the block reordering produces a new variable order that is used to reorder the integration limits a and b , whereas in step [c] the inverse matrices of the diagonal blocks of L computed in step [b] arerecycled to maximize eﬃciency. Also the quantities in step [d.1] do not need to be re–evaluated everytime a new prediction is required since they only depend on the observed training data, and hence suchquantities can be pre–computed and stored separately. The methods presented in Section 2.1 allow substantial improvements in terms of accuracy and scala-bility in the evaluation of the predictive probabilities under probit gp s. However, the cost of the tile–low–rank version of the sov algorithm might still be too expensive in high dimensional settings with a7arge n . To further reduce computational times, we derive an alternative expression for p ( y n +1 = 1 | y )relying on functionals of multivariate truncated normals which are then approximated via mean–ﬁeldvariational Bayes (e.g., Blei et al., 2017) to facilitate simple evaluation of p ( y n +1 = 1 | y ) using MonteCarlo samples from univariate truncated normals.To derive this alternative expression, we shall ﬁrst notice that the joint posterior p [ f ( x n +1 ) , f ( X ) | y ]in (2) can be factorized as p [ f ( x n +1 ) | f ( X )] · p [ f ( X ) | y ], provided that f ( x n +1 ) does not appear inthe likelihood for y , which is true because there is no overlap among predictors. Exploiting the well–known properties of gp s (Rasmussen and Williams, 2006), the ﬁrst factor p [ f ( x n +1 ) | f ( X )] in the aboveexpression can be easily derived by applying the closure under conditioning property of multivariateGaussians, thus obtaining the univariate normal density p [ f ( x n +1 ) | f ( X )]= φ ( f ( x n +1 ) − [ m ( x n +1 ) + Ω ∗ [ n +1 , n ] Ω − ( f ( X ) − ξ )]; K ( x n +1 , x n +1 ) − Ω ∗ [ n +1 , n ] Ω − Ω ∗ [1: n,n +1] ) , = φ ( f ( x n +1 ) − [ µ x n +1 + H x n +1 f ( X )]; σ x n +1 ) , (8)where H x n +1 = Ω ∗ [ n +1 , n ] Ω − , µ x n +1 = m ( x n +1 ) − H x n +1 ξ and σ x n +1 = K ( x n +1 , x n +1 ) − Ω ∗ [ n +1 , n ] Ω − Ω ∗ [1: n,n +1] ,whereas the other quantities are deﬁned as in equation (1) and (4). Adapting the recent conjugacy re-sults for probit models with Gaussian priors in Durante (2019) to the gp setting, it is also possible toexpress p [ f ( X ) | y ] as the density of the uniﬁed skew–normal ( sun ) (Arellano-Valle and Azzalini, 2006)SUN n,n [ ξ , Ω , ¯ Ω ω D (cid:124) s − , s − D ξ , s − ( DΩD (cid:124) + I n ) s − ], with s = [( DΩD (cid:124) + I n ) (cid:12) I n ] / , ¯ Ω = ω − Ω ω − and ω = ( Ω (cid:12) I n ) / . Indeed, recalling the results discussed in Sections 1–2.1 and applying the Bayesrule, we have that p [ f ( X ) | y ] ∝ p [ f ( X )] · p [ y | f ( X )] = φ n [ f ( X ) − ξ ; Ω ] · Φ n [ Df ( X ); I n ] which coincideswith the kernel of a sun density as discussed in the proof of Theorem 1 in Durante (2019). Such aclass of random variables introduces asymmetric shapes in Gaussian densities via a skewness–inducingmechanism driven by the cumulative distribution function of an n –variate Gaussian with a full-rankcovariance matrix. Hence, the evaluation of p [ f ( X ) | y ] still requires calculation of multivariate Gaus-sian probabilities, leading to the same issues discussed in Section 2.1; see Arellano-Valle and Azzalini(2006), Azzalini and Capitanio (2014) and Durante (2019) for an in–depth discussion on the propertiesof sun variables for posterior inference.A possibility to address the above issue is to leverage the discrete-choice interpretation of the probit gp introduced in Section 1. Under this alternative representation, model (1) can be equivalently re–expressed as y i = 1( z i > z i | f ( x i )] ∼ N[ f ( x i ) , i = 1 , . . . , n , and f ( X ) =[ f ( x ) , . . . , f ( x n )] (cid:124) ∼ N n ( ξ , Ω ). Adapting the results in Holmes and Held (2006) to our gp setting,8he joint posterior p [ f ( X ) , z | y ] of f ( X ) and the augmented data z = ( z , . . . , z n ) (cid:124) , factorizes as p [ f ( X ) | z ] · p ( z | y ), with p [ f ( X ) | z ] = φ n [ f ( X ) − ( Ω − + I n ) − ( Ω − ξ + z ); ( Ω − + I n ) − ] = φ n ( f ( X ) − [ µ X + Σ X z ]; Σ X ) ,p ( z | y ) ∝ φ n [ z − ξ ; I n + Ω ] (cid:89) ni =1 y i − z i >

0] = φ n [ z − ξ ; Σ z ] (cid:89) ni =1 y i − z i > , (9)where Σ X = ( Ω − + I n ) − , µ X = Σ X Ω − ξ and Σ z = I n + Ω . Therefore, the joint posterior density p [ f ( X ) | z ] · p ( z | y ) factorizes as the product of a Gaussian for p [ f ( X ) | z ] and a multivariate truncatednormal for p ( z | y ) obtained via component–wise truncation of N( ξ , Σ z ) above or below 0, dependingon whether y i = 1 or y i = 0, respectively, for each i = 1 , . . . , n . As shown in Proposition 2, bycombining equations (8)–(9) with Lemma 7.1 in Azzalini and Capitanio (2014), it is possible to obtainan alternative expression for p ( y n +1 = 1 | y ) based on functionals of multivariate truncated normals. Seethe Appendix for a detailed proof. Proposition 2.

Under (1) , the predictive probability for a new response y n +1 ∈ {

0; 1 } with predictor x n +1 ∈ R q is p ( y n +1 = 1 | y ) = 1 − p ( y n +1 = 0 | y ) = E z | y [ E f ( X ) | z ( E f ( x n +1 ) | f ( X ) { Φ[ f ( x n +1 )] } )] , = E z | y [ E f ( X ) | z [Φ( µ x n +1 + H x n +1 f ( X ); 1+ σ x n +1 )]] , = E z | y [Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ]; 1+ σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 )] , (10) where the diﬀerent quantities in (10) are deﬁned as in equations (8) and (9) , whereas E z | y [ · ] denotesthe expectation with respect to the multivariate truncated normal density p ( z | y ) in (9) . Leveraging Proposition 2 it is possible to evaluate p ( y n +1 = 1 | y ) via Monte Carlo methods basedon independent samples from the multivariate truncated normal with density as in (9), thus producingthe estimate ˆ p ( y n +1 = 1 | y ) = 1 − ˆ p ( y n +1 = 0 | y ) = (cid:80) Rr =1 Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ( r ) ]; 1 + σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 ) /R , where z (1) , . . . , z ( R ) denote independent and identically distributed samplesfrom p ( z | y ). Unfortunately, sampling from multivariate truncated normals in settings where n is largerthan a few hundreds raises the same computational issues discussed in Section 2.1, i.e., the evaluation ofmultivariate Gaussian cumulative distribution functions (Holmes and Held, 2006; Botev, 2017; Pakmanand Paninski, 2014; Durante, 2019; Fasano et al., 2019).To address the above issue, we propose to replace the intractable sampling density p ( z | y ) with amean–ﬁeld variational approximation q ∗ ( z ) = (cid:81) ni =1 q ∗ ( z i ) that factorizes over its marginals q ∗ ( z ) , . . . , q ∗ ( z n ).In this way, the Monte Carlo estimate for p ( y n +1 = 1 | y ) can be obtained by sampling R times from9 lgorithm 2: Compute (10) via Monte Carlo based on a mean–ﬁeld approximation of p ( z | y ) [a] Initialize z (0) = [0 , . . . , (cid:124) . [b] Apply the cavi algorithm to obtain the optimal mean–ﬁeld approximation q ∗ ( z ) = (cid:81) ni =1 q ∗ ( z i ) for p ( z | y ). for t=1 until convergence dofor i=1, . . . , n do Set the univariate truncated normal approximating density q ( t ) ( z i ) for z i at step t equal to q ( t ) ( z i ) ∝ φ { z i − [ ξ i + H z i ( z ( t − − i − ξ − i )]; σ z i } y i − z i > z ( t − − i = [ E q ( t ) ( z ) ( z ) , . . . , E q ( t ) ( z i − ) ( z i − ) , E q ( t − ( z i +1 ) ( z i +1 ) , . . . , E q ( t − ( z n ) ( z n )] (cid:124) . Output: q ∗ ( z ) = (cid:81) ni =1 q ∗ ( z i ), where each q ∗ ( z i ) is a univariate truncated normal. [c] Estimate (10) via Monte Carlo as in (13). n independent univariate approximate densities q ∗ ( z ) , . . . , q ∗ ( z n ) instead of the exact but intractablejoint density p ( z | y ). Recalling the classical mean–ﬁeld variational Bayes framework (e.g., Blei et al.,2017), the optimal approximating density q ∗ ( z ) is the one that minimizes the Kullback–Leibler ( kl ) di-vergence kl [ q ( z ) (cid:107) p ( z | y )] = E q ( z ) { log[ q ( z ) /p ( z | y )] } (Kullback and Leibler, 1951) to p ( z | y ) among allthe densities in the mean–ﬁeld variational family Q = { q ( z ) : q ( z ) = (cid:81) ni =1 q ( z i ) } . The solution of sucha minimization problem is, typically, not available in closed–form but can be obtained via coordinateascent variational inference ( cavi ) algorithms (Bishop, 2006; Blei et al., 2017) that iteratively minimizethe kl with respect to one component q ( z i ) at a time, keeping ﬁxed the others at their most recentestimate q ( t − ( z − i ) = [ q ( t ) ( z ) , . . . , q ( t ) ( z i − ) , q ( t − ( z i +1 ) , . . . , q ( t − ( z n )]. Recalling Bishop (2006), thisis accomplished via the updates q ( t ) ( z i ) ∝ exp { E q ( t − ( z − i ) [log[ p ( z i | z − i , y )]] } , for each i = 1 , . . . , n, (11)at iteration t , until convergence. In (11), the quantity p ( z i | z − i , y ) denotes the full conditional densityof z i . Due to the closure under conditioning property of the multivariate truncated normal (Horrace,2005), such a quantity can be derived explicitly from p ( z | y ) in (9) and coincides with the density of aunivariate truncated normal. In particular, we can express p ( z i | z − i , y ) as p ( z i | z − i , y ) ∝ φ { z i − [ ξ i + H z i ( z − i − ξ − i )]; σ z i } y i − z i > , (12)where ξ − i denotes the vector ξ without the i th entry, H z i = Σ z [ i, − i ] Σ − z [ − i, − i ] , and σ z i = Σ z [ i,i ] − z [ i, − i ] Σ − z [ − i, − i ] Σ z [ − i,i ] . The density in (12) has an exponential kernel which is linear in z − i and,hence, replacing the expression for p ( z i | z − i , y ) in the cavi updates reported in (11), it followsthat also q ( t ) ( z i ) has a univariate truncated normal density as in (12) with z − i replaced by z ( t − − i =[ E q ( t ) ( z ) ( z ) , . . . , E q ( t ) ( z i − ) ( z i − ) , E q ( t − ( z i +1 ) ( z i +1 ) , . . . , E q ( t − ( z n ) ( z n )] (cid:124) . Each term in z ( t − − i is the ex-pected value of a univariate truncated normal which is available in closed–form, thus producing asimple cavi relying on closed–form updates, as outlined in Algorithm 2.Once the optimal univariate truncated normal approximating densities q ∗ ( z ) , . . . , q ∗ ( z n ) are avail-able, equation (10) can be easily evaluated via Monte Carlo by lettingˆ p ( y n +1 = 1 | y ) = 1 R (cid:88) Rr =1 Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ∗ ( r ) ]; 1 + σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 ) , (13)with z ∗ ( r ) = [ z ∗ ( r )1 , . . . , z ∗ ( r ) n ] (cid:124) , for r = 1 , . . . , R , where each z ∗ ( r ) i can be eﬃciently sampled from thecorresponding univariate truncated normal approximating density q ∗ ( z i ), independently for i = 1 , . . . , n and r = 1 , . . . , R . Unlike for the multivariate case, sampling from independent univariate truncatednormals can be eﬀectively done in standard statistical softwares, thus avoiding issues in large n settings.As outlined in the simulation studies in Section 3, such an approximate strategy massively reducescomputational times without aﬀecting accuracy. In this section, we study the gains in accuracy and computational scalability of the methods devel-oped in Section 2 relative to state–of–the–art competitors, which include Monte Carlo inference underthe widely–used stan implementation (see R package rstan ) of the Hamiltonian no–u–turn sampler(Hoﬀman and Gelman, 2014), and minimax tilting (see R package TruncatedNormal ) methods (Botev,2017) to evaluate the multivariate Gaussian cumulative distribution functions involved in the predictiveprobability expressed in equation (4).To evaluate performance in high–dimensional settings, we generate the binary responses on the 100 ×

100 unit grid G = { x = ( x , x ) : x ∈ (1 / , / , . . . , / , x ∈ (1 / , / , . . . , / } with equally–spaced predictors, thus obtaining n = 10 ,

000 non–overlapping conﬁgurations. At these lo-cations, we simulate the responses y , . . . , y , from independent Bernoulli distributions with probabil-ity parameters Φ[ f ( x )] , . . . , Φ[ f ( x , )] displayed in Figure 1, where f ( X ) = [ f ( x ) , . . . , f ( x , )] (cid:124) is a sample from a gp with mean function m ( x ) = 0 and squared exponential covariance kernel K ( x , x (cid:48) ) = exp( − α · (cid:107) x − x (cid:48) (cid:107) ), where α = 30. Such a kernel is frequently used in machine learning,11 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . x x ll lll ll l ll l lll ll llll l ll ll lll ll ll lll ll ll lll ll lll ll l llllll ll ll ll ll l ll l lll ll ll ll ll ll ll l ll lll l llll ll l ll l . . . . . . x x llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll llllllllll Figure 1: Simulated probabilities Φ[ f ( x )] on the 100 ×

100 grid G = { x = ( x , x ) : x ∈ (1 / , / , . . . , / , x ∈ (1 / , / , . . . , / } in the unit square, where f ( x ) is a zeromean gp with squared exponential covariance kernel. White circles denote the 100 unknown locationsdistributed randomly (left) and on a grid (right), used for prediction.thus we focus on this choice for the simulation study. To assess performance in estimating the predictiveprobabilities, we also simulate the probability parameters and the associated binary responses for 100out–of–sample units under two scenarios. As outlined in Figure 1, the ﬁrst one relies on randomlydistributed locations, whereas the second focuses on a grid structure. We also compare performance inlower–dimensional problems with n ∈ { , , } obtained by selecting a n / × n / sub–grid of G with equally spaced conﬁgurations between 0 and 1, along with their associated probability parametersand simulated binary responses.Table 1 summarizes the performance in terms of accuracy and computational scalability of the dif-ferent methods analyzed, at varying n and under the two diﬀerent scenarios considered for prediction.In reporting the results, we set a conservative computational budget of one day and display the perfor-mance measures only for those routines with a running time below this computational budget. MonteCarlo inference via stan (Hoﬀman and Gelman, 2014) relies on the rstan package applied to model(1) in order to obtain posterior samples from f ( X ) which are then used to compute the predictiveprobabilities at the unknown locations via ordinary kriging. Such evaluations rely on 10 , mcmc samples after a burnin of 10 , α = 30. In evaluating the performance of minimax tilting( tn ) (Botev, 2017), we compute the numerator and the denominator in (4) separately via the R package TruncatedNormal , using the default settings. Since the denominator is shared among all predictions in12able 1: Computational eﬃciency and predictive performance, at varying sample size n , of stan (Hoﬀ-man and Gelman, 2014), tn (Botev, 2017), tlr (Section 2.1) and vb (Section 2.2), when the unitsto be predicted are distributed either randomly [random] or on a grid [grid]. time : cost in secondsto compute 100 predictive probabilities. mse : mean squared error between the estimated predictiveprobabilities and the true ones. auc : area under the roc when predicting the out–of–sample binaryresponses with the estimated predictive probabilities. Method Perfomance measures n = 250 n = 625 n = 2 , n = 10 , time [seconds] mse [random] stan mse [grid] auc [random] auc [grid] time [seconds]

349 2,339 — — mse [random] tn mse [grid] auc [random] auc [grid] time [seconds]

108 402 2,509 18,709 mse [random] tlr mse [grid] auc [random] auc [grid] time [seconds] mse [random] vb mse [grid] auc [random] auc [grid]

TruncatedNormal is close to that of computingone multivariate Gaussian cumulative distribution function. Equation (4) is also evaluated under thenovel methods proposed in Section 2.1 ( tlr ) and summarized in Algorithm 1, which can be imple-mented via simple adaptations of the R package tlrmvnmvt (Cao et al., 2020). In implementing such aroutine, we set the block size to n / , the truncation level to 10 − and R = 20 , tn and tlr , we avoid setting α equal to the true value 30, but insteadestimate such a quantity by applying the R packages TruncatedNormal and tlrmvnmvt to maximize,with respect to α , the marginal likelihood in (3) on a grid of 60 equally spaced α values between 15and 45. Results are comparable, although TruncatedNormal provides slightly less noisy estimates of(3) than tlrmvnmvt , at the cost of a substantially higher running time. The estimate of α provided by tlrmvnmvt is also used in the implementation of the novel variational strategy ( vb ) presented in Section2.2 and summarized in Algorithm 2. Also in this case we consider R = 20 ,

000 Monte Carlo samples toevaluate (10). Such values are generated from the optimal univariate truncated normal approximatingdensities produced by the cavi in Algorithm 2, which can be implemented via minor adaptations of thesource code in the GitHub repository

Probit-PFMVB (Fasano et al., 2019). All computations were runon a 2.5 GHz Intel Core i7 CPU workstation, without multithreading.As clariﬁed in Table 1, the methods proposed in Sections 2.1–2.2 notably reduce the running timesrelative to state–of–the–art competitors, thus making prediction under probit gp computationally fea-sible in those high–dimensional settings that arise commonly in various applications. This is especiallytrue for the vb solution proposed in Section 2.2 which is orders of magnitude faster than its competitors.Such a notable reduction in running times under tlr and vb is crucially obtained at almost no costs interms of accuracy in out–of–sample prediction (see auc ) and in the estimation of the predictive prob-abilities (see mse ), when compared to competitors relying on mcmc samples from the exact posterior( stan ) or on exact evaluation of multivariate Gaussian cumulative distribution functions ( tn ). We conclude by applying the methods developed in Sections 2.1 and 2.2 to a real–world environmentalapplication aimed at modeling whether the local windspeed exceeds a pre–speciﬁed working threshold forenergy production in a given region of interest in Saudi Arabia. Wind turbines for generating electricitytypically have two windspeed thresholds, of which the lower controls when the blades of the turbinestart to be in motion and the higher indicates if the turbine should be switched oﬀ to avoid strong14 x x Saudi ArabiaYemenIraq IranEgypt Israel EritreaSudan United Arab EmiratesJordan KuwaitBahrainQatar OmanEgyptSudan

35 40 45 50 55 x x Saudi ArabiaYemenIraq IranEgypt Israel EritreaSudan United Arab EmiratesJordan KuwaitBahrainQatar OmanEgyptSudan

Figure 2: Heatmaps representing the windspeed at 140 meters high (left) and a binary version y ofthis measure deﬁning whether the local windspeed is suﬃciently high for energy production (dark gray: yes ; light gray: no ) based on the 4m/s threshold (right) on Jan 21st, 2014. The red area denotes thespatial region that is used for modeling and prediction.wind damage. Here, the binary response y i is deﬁned as whether the windspeed at the i -th locationexceeds the lower threshold, thus allowing production of wind power, which is referred to as the workingthreshold of wind turbines. This important application is motivated by the growing domestic energyconsumption in Saudi Arabia and by the attempt to reduce the reliance on fossil fuels, thereby leadingto an increasing interest on renewable energy sources, including wind (Shaahid et al., 2014; Chen et al.,2018; Tagle et al., 2019; Giani et al., 2020). The eﬀective exploitation of such resources and the carefulmanagement of the energy stations require careful modeling and prediction at a ﬁne spatial resolutionof whether the local windspeed exceeds or not a given threshold for energy production. As we willdiscuss in the following, such a ﬁne grid of observations commonly produces a sample size around tensof thousands units. This makes state–of–the–art algorithms for probit gp computationally unfeasiblein such studies, thus motivating our scalable solutions presented in Sections 2.1 and 2.2.The windspeed dataset considered in this article is produced by the Weather Research and Fore-casting ( wrf ) model (Yip, 2018), which constructs the weather system via partial diﬀerential equationson the mesoscale and demands strong computation capacity to serve meteorological applications (Ska-marock et al., 2008). The time resolution of our data is daily and we use the windspeed over theregion of north–west Saudi Arabia on January 21st, 2014 for modeling and out–of–sample prediction.Such a region covers the wind farm at Dumat Al Jandal, which is the ﬁrst wind farm in Saudi Arabia15 x x llll llll l llll ll l lll ll l ll lll ll llll lll llll ll l ll l lll l lll ll ll ll l ll llll l lll l lll l l lll lll l ll lll llll ll llll l l lll l lll lll ll lll llll l ll ll ll l l ll lll ll lll l l l ll ll l ll l l

34 36 38 40 42 44 x x llllllllllllllllllll llllllllll llllllllll llllllllllllllllllll llllllllll llllllllll llllllllll llllllllll Figure 3: For the the spatial region that is used for modeling and prediction, heatmaps deﬁning whetherthe local windspeed is suﬃciently high for energy production (dark gray: yes ; light gray: no ) basedon the 4m/s threshold on Jan 21st, 2014. Black circles denote the 100 unknown locations distributedrandomly (left) and on a grid (right), used for prediction.and currently under construction as well as the future smart city of NEOM, a strategic componentof the Saudi 2030 Vision, where wind power is expected to be a key energy source. Moreover, thewindspeed on January 21st, 2014 has high variability across this region, which makes the out–of–sampleprediction task more challenging. As shown in Figures 2 and 3 the region under analysis is obtainedby intersecting the Saudi Arabia territorial map with the rectangle ranging from e ◦ (cid:48) to e ◦ andfrom n ◦ to n ◦ . Within this region we consider a ﬁne grid of n = 9 ,

036 equally–spaced locations x i = ( x i , x i ) (cid:124) = ( long i , lat i ) (cid:124) at which we monitor whether the windspeed is either above ( y i = 1) orbelow ( y i = 0) the working threshold of wind turbines for each i = 1 , . . . , , gp with zeromean function and squared exponential covariance kernel K ( x , x (cid:48) ) = exp( − α · (cid:107) x − x (cid:48) (cid:107) ), where α isestimated via maximization of the marginal likelihood in (3) evaluated via the tlrmvnmvt package ona grid of values in [1 , α is close to 17, which has an eﬀective range of 0 .

42. Thisrelatively short eﬀective range is consistent with the abrupt changes of the binary responses. Recallingthe results in Table 1, the calculation of the predictive probabilities is only performed under the methods16resented in Sections 2.1 ( tlr ) and 2.2 ( vb ) since stan and tn would be computationally impracticalin such a high–dimensional setting with n = 9 , α = ˆ α = 17and consider the same settings as in the simulation study in Section 3, thus obtaining running times thatare comparable to the simulation scenario with n = 10 , tlr and vb require about168 and 35 seconds per prediction, respectively, thus conﬁrming the feasibility of such novel methods inhigh–dimensional settings. We shall also emphasize that these running times are mostly aﬀected by thematrix pre–computation operations in Algorithms 1–2, and hence could be carefully reduced via sparsematrix representations or careful algebraic operations. Out–of–sample predictive performance measuredvia the auc is similarly accurate under both methods, with a slightly increased improvement providedby vb . In particular, the auc for the random scenario is 0 .

968 for tlr and 0 .

971 for vb , whereas inthe grid setting such a measure is 0 .

881 for tlr and 0 .

906 for vb . This article provides novel expressions for the predictive probabilities under probit models with gp pri-ors, relying either on multivariate Gaussian cumulative distribution functions or on functionals of mul-tivariate truncated normals, and proposes scalable computational strategies to evaluate such quantitiesin common high–dimensional settings, thus covering an important gap in the literature. As highlightedin the simulations studies in Section 3, these computational gains are notable and do not sacriﬁceaccuracy, thereby allowing tractable prediction under probit gp in applications that were previouslycomputationally impractical, such as the Saudi Arabia windspeed study in Section 4.The above results open up several avenues for future research. For instance, the methods in Section2 can be adapted to any probit model with a multivariate Gaussian prior for the linear predictor. Theseinclude classical Bayesian probit regression, multivariate probit models and general additive represen-tations relying on basis expansions. Extensions to categorical responses under a multinomial probit gp model or to more general priors such as uniﬁed skew–normals can also be explored by leveraging resultsin Durante (2019), Fasano and Durante (2020) and Benavoli et al. (2020). Acknowledgments

This publication is based upon work supported by the King Abdullah University of Science and Tech-nology (KAUST) Oﬃce of Sponsored Research (OSR) under Award No: OSR-2018-CRG7-3742.17

Appendix: Proof of Theoretical Results

To prove Propositions 1 and 2 let us ﬁrst state the following Lemma.

Lemma 1 (Lemma 7.1 in Azzalini and Capitanio (2014)) . If U ∼ N p ( , Σ ) then E [Φ q ( H (cid:124) U + k ; Ψ )] =Φ q ( k ; Ψ + H (cid:124) ΣH ) , for any choice of the vector k ∈ R p , the p × q matrix H and the q × q symmetricpositive–deﬁnite matrix Ψ . Combining the closure under conditioning property of multivariate Gaussians with the above result— whose proof can be found in Azzalini and Capitanio (2014) — the proof of Propositions 1 and 2 canbe obtained via simple derivations described below.

Proof of Proposition 1.

To prove Proposition 1, ﬁrst to notice that by simple application of the Bayesrule p ( y n +1 | y ) = p ( y n +1 = 1 , y ) /p ( y ). Hence, it suﬃces to show that p ( y n +1 = 1 , y ) = Φ n +1 ( D ∗ ξ ∗ ; I n +1 + D ∗ Ω ∗ D ∗ (cid:124) ) and p ( y ) = Φ n ( D ξ ; I n + DΩD (cid:124) ). Recalling our discussion in Section 2.1, p ( y ) is the marginallikelihood for the observed data and can be expressed as p ( y ) = (cid:82) Φ n [ Df ( X ); I n ] φ n [ f ( X ) − ξ ; Ω ]d f ( X ) = E { Φ n [ D ( f ( X ) − ξ )+ D ξ ; I n ] } where ( f ( X ) − ξ ) ∼ N n ( , Ω ). Hence, by applying Lemma 1 to this expecta-tion, we obtain E { Φ n [ D ( f ( X ) − ξ )+ D ξ ; I n ] } = Φ n ( D ξ ; I n + DΩD (cid:124) ). Such a result also clariﬁes equation(3). The proof of the equation p ( y n +1 = 1 , y ) = Φ n +1 ( D ∗ ξ ∗ ; I n +1 + D ∗ Ω ∗ D ∗ (cid:124) ) proceeds in a similarmanner, after noticing that p ( y n +1 = 1 , y ) = (cid:82) (Φ[ f ( x n +1 )] · Φ n [ Df ( X ); I n ]) φ n +1 [ f ∗ ( X ) − ξ ∗ ; Ω ∗ ]d f ∗ ( X ) = (cid:82) Φ n +1 [ D ∗ f ∗ ( X ); I n +1 ] φ n +1 [ f ∗ ( X ) − ξ ∗ ; Ω ∗ ]d f ∗ ( X ) = E { Φ n +1 [ D ∗ ( f ∗ ( X ) − ξ ∗ )+ D ∗ ξ ∗ ; I n +1 ] } , where f ∗ ( X ) =[ f ( X ) (cid:124) , f ( x n +1 )] (cid:124) ∼ N n +1 ( ξ ∗ , Ω ∗ ), with ξ ∗ , Ω ∗ and D ∗ deﬁned as in Proposition 1. Proof of Proposition 2.

Recalling the results discussed in Section 2.2, the predictive probability p ( y n +1 =1 | y ) can be deﬁned as E [ z , f ( X ) ,f ( x n +1 ) | y ] { Φ[ f ( x n +1 )] } , with the joint conditional density p [ z , f ( X ) , f ( x n +1 ) | y ] factorizing as p [ f ( x n +1 ) | f ( X )] · p [ f ( X ) | z ] · p ( z | y ). Hence, by the law of the total expectation, wehave that p ( y n +1 = 1 | y ) = E z | y [ E f ( X ) | z ( E f ( x n +1 ) | f ( X ) { Φ[ f ( x n +1 )] } )]. Since, by equation (8) [ f ( x n +1 ) | f ( X )] ∼ N( µ x n +1 + H x n +1 f ( X ) , σ x n +1 ), we can apply Lemma 1 to obtain E f ( x n +1 ) | f ( X ) { Φ[ f ( x n +1 )] } =Φ( µ x n +1 + H x n +1 f ( X ); 1 + σ x n +1 ). To conclude the proof, note that by equation (9), also [ f ( X ) | z ] ∼ N n ( µ X + Σ X z , Σ X ). Therefore, applying again Lemma 1 leads to E f ( X ) | z [Φ( µ x n +1 + H x n +1 f ( X ); 1+ σ x n +1 )] =Φ( µ x n +1 + H x n +1 [ µ X + Σ X z ]; 1+ σ x n +1 + H x n +1 Σ X H (cid:124) x n +1 ).18 eferences Albert, J. H., and Chib, S. (1993), “Bayesian analysis of binary and polychotomous response data,”

Journal ofthe American Statistical Association , 88, 669–679.Arellano-Valle, R. B., and Azzalini, A. (2006), “On the uniﬁcation of families of skew-normal distributions,”

Scandinavian Journal of Statistics , 33, 561–574.Azzalini, A., and Capitanio, A. (2014),

The Skew-normal and Related Families , Cambridge University Press.Benavoli, A., Azzimonti, D., and Piga, D. (2020), “Skew Gaussian processes for classiﬁcation,” arXiv preprintarXiv:2005.12987 .Bishop, C. M. (2006),

Pattern Recognition and Machine Learning , Springer.Blei, D. M., Kucukelbir, A., and McAuliﬀe, J. D. (2017), “Variational inference: A review for statisticians,”

Journal of the American Statistical Association , 112, 859–877.Botev, Z. (2017), “The normal law under linear restrictions: simulation and estimation via minimax tilting,”

Journal of the Royal Statistical Society: Series B , 79, 125–148.Brezger, A., and Lang, S. (2006), “Generalized structured additive regression based on Bayesian P-splines,”

Computational Statistics & Data Analysis , 50, 967–991.Cao, J., Genton, M. G., Keyes, D. E., and Turkiyyah, G. M. (2019), “Hierarchical-block conditioning approxi-mations for high-dimensional multivariate normal probabilities,”

Statistics and Computing , 29, 585–598.Cao, J., Genton, M. G., Keyes, D. E.— (2020), “Exploiting low rank covariance structures for computinghigh-dimensional normal and student-t probabilities,” arXiv preprint arXiv:2003.11183 .Chen, W., Castruccio, S., Genton, M. G., and Crippa, P. (2018), “Current and future estimates of wind energypotential over Saudi Arabia,”

Journal of Geophysical Research: Atmospheres , 123, 6443–6459.Chipman, H. A., George, E. I., and McCulloch, R. E. (2010), “BART: Bayesian additive regression trees,”

TheAnnals of Applied Statistics , 4, 266–298.Chopin, N. (2011), “Fast simulation of truncated Gaussian distributions,”

Statistics and Computing , 21, 275–288.Chopin, N., and Ridgway, J. (2017), “Leave Pima Indians alone: Binary regression as a benchmark for Bayesiancomputation,”

Statistical Science , 32, 64–87.Choudhuri, N., Ghosal, S., and Roy, A. (2007), “Nonparametric binary regression using a Gaussian processprior,”

Statistical Methodology , 4, 227–243. hu, W., and Ghahramani, Z. (2005), “Gaussian processes for ordinal regression,” Journal of Machine LearningResearch , 6, 1019–1041.Consonni, G., and Marin, J.-M. (2007), “Mean-ﬁeld variational approximate Bayesian inference for latentvariable models,”

Computational Statistics & Data Analysis , 52, 790–798.De Oliveira, V. (2005), “Bayesian inference and prediction of Gaussian random ﬁelds based on censored data,”

Journal of Computational and Graphical Statistics , 14, 95–115.Durante, D. (2019), “Conjugate Bayes for probit regression via uniﬁed skew-normal distributions,”

Biometrika ,106, 765–779.Fasano, A., and Durante, D. (2020), “A class of conjugate priors for multinomial probit models which includesthe multivariate normal one,” arXiv preprint arXiv:2007.06944 .Fasano, A., Durante, D., and Zanella, G. (2019), “Scalable and accurate variational Bayes for high-dimensionalbinary regression models,” arXiv preprint arXiv:1911.06743 .Genton, M. G., Keyes, D. E., and Turkiyyah, G. (2018), “Hierarchical decompositions for the computation ofhigh-dimensional multivariate normal probabilities,”

Journal of Computational and Graphical Statistics , 27,268–277.Genz, A. (1992), “Numerical computation of multivariate normal probabilities,”

Journal of Computational andGraphical Statistics , 1, 141–149.Giani, P., Tagle, F., Genton, M. G., Castruccio, S., and Crippa, P. (2020), “Closing the gap between windenergy targets and implementation for emerging countries,”

Applied Energy , 269, 115085.Girolami, M., and Rogers, S. (2006), “Variational Bayesian multinomial probit regression with Gaussian processpriors,”

Neural Computation , 18, 1790–1817.Hoﬀman, M. D., and Gelman, A. (2014), “The No-U-turn sampler: Adaptively setting path lengths in Hamil-tonian Monte Carlo,”

Journal of Machine Learning Research , 15, 1593–1623.Holmes, C. C., and Held, L. (2006), “Bayesian auxiliary variable models for binary and multinomial regression,”

Bayesian Analysis , 1, 145–168.Horrace, W. C. (2005), “Some results on the multivariate truncated normal distribution,”

Journal of Multi-variate Analysis , 94, 209–221.Johndrow, J. E., Smith, A., Pillai, N., and Dunson, D. B. (2019), “MCMC for imbalanced categorical data,”

Journal of the American Statistical Association , 114, 1394–1403. ullback, S., and Leibler, R. A. (1951), “On information and suﬃciency,” The Annals of Mathematical Statis-tics , 22, 79–86.Neal, R. (1999), “Regression and classiﬁcation using Gaussian process priors,”

Bayesian Statistics , 6, 475–501.Nelder, J. A., and Wedderburn, R. W. (1972), “Generalized linear models,”

Journal of the Royal StatisticalSociety: Series A , 135, 370–384.Opper, M., and Winther, O. (2000), “Gaussian processes for classiﬁcation: Mean–ﬁeld algorithms,”

NeuralComputation , 12, 2655–2684.Pakman, A., and Paninski, L. (2014), “Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians,”

Journal of Computational and Graphical Statistics , 23, 518–542.Rasmussen, C. E., and Williams, C. K. I. (2006),

Gaussian Processes for Machine Learning , MIT Press.Riihim¨aki, J., Jyl¨anki, P., and Vehtari, A. (2013), “Nested expectation propagation for Gaussian processclassiﬁcation with a multinomial probit likelihood,”

Journal of Machine Learning Research , 14, 75–109.Shaahid, S., Al-Hadhrami, L. M., and Rahman, M. (2014), “Potential of establishment of wind farms in westernprovince of Saudi Arabia,”

Energy Procedia , 52, 497–505.Skamarock, W. C., Klemp, J. B., Dudhia, J., Gill, D. O., Barker, D. M., Duda, M. G., Huang, X.-Y., Wang,W., and Powers, J. G. (2008), “A description of the Advanced Research WRF version 3,” in

NCAR TechincalNote NCAR , vol. 113, pp. 1–125.Tagle, F., Castruccio, S., Crippa, P., and Genton, M. G. (2019), “A non-Gaussian spatio-temporal model fordaily wind speeds based on a multivariate skew-t distribution,”

Journal of Time Series Analysis , 40, 312–326.Trinh, G., and Genz, A. (2015), “Bivariate conditioning approximations for multivariate normal probabilities,”

Statistics and Computing , 25, 989–996.Yip, C. M. A. (2018), “Statistical characteristics and mapping of near-surface and elevated wind resources inthe Middle East,” Ph.D. thesis., 25, 989–996.Yip, C. M. A. (2018), “Statistical characteristics and mapping of near-surface and elevated wind resources inthe Middle East,” Ph.D. thesis.