Bayesian Non-parametric Quantile Process Regression and Estimation of Marginal Quantile Effects
BBayesian Non-parametric Quantile Process Regression andEstimation of Marginal Quantile Effects
Steven G. Xu and Brian J. Reich Abstract
We propose a non-parametric method to simultaneously estimate non-crossing, non-linearquantile curves. We expand the conditional distribution function of the response in I -splinebasis functions where the coefficients are further modeled as functions of the covariates usingfeed-forward neural networks. By leveraging the approximation power of splines and neuralnetworks, our model can approximate any continuous quantile function. Compared to existingmethods, our method estimates all rather than a finite subset of quantiles, scales well to highdimensions and accounts for estimation uncertainty. While the model is arbitrarily flexible,interpretable marginal quantile effects are estimated using accumulative local effect plots andvariable importance measures. A simulation study shows that compared to existing methods,our model can better recover quantiles of the response distribution when the sample size issmall, and illustrative applications to birth weight and tropical cyclone intensity are presented. Key words:
Density regression; Hamiltonian Monte Carlo; Interpretable machine learning;Simultaneous quantile estimation. Department of Statistics, North Carolina State University, Raleigh, NC, 27695, USA a r X i v : . [ s t a t . M E ] F e b Introduction
Quantile regression (QR) models the statistical relationship between conditional quantiles of theresponse distribution and a set of covariates. It allows one to analyze the statistical relationshipbetween covariates and non-central parts of the conditional response distribution. Application ofQR has been widely seen in studies where the research question involves modeling the tails of theconditional response distribution. Examples include modeling the 0.05 conditional quantile of birthweight distribution to understand the determining factors of underweight newborns (Abrevaya,2001) and modeling the 0.99 conditional quantile of wind speed distribution to study the behaviorof tropical cyclones that may cause major damage (Elsner et al., 2008).Modeling a single quantile, however, does not reach the full potential of QR, which lies inthe joint description of a set of conditional quantiles. That is, one can model multiple conditionalquantiles simultaneously to investigate the functional relationship between covariate effects on theresponse and quantile level. When used appropriately, multiple QR is a powerful data descriptiontool that offers comprehensive characterization of covariate effects on the conditional responsedistribution.However, many QR methods, including the very first proposed by Koenker and Bassett Jr(1978), were designed to estimate quantile-dependent parameters for a single quantile. When thesemethods are used naively to estimate multiple quantiles, the natural ordering among individuallyestimated quantiles is not enforced and may lead to estimated quantile curves that cross. That is, forexample, the estimated median of the response conditioned on some combination of the covariatesmight be greater than the estimated th percentile. Quantile crossing violates basic probabilisticrules, since any valid conditional quantile function should be monotonically non-decreasing in thequantile level. It can also lead to difficult inference and interpretation, especially when separatelyestimated quantile curves are used to construct prediction intervals.Many research works have been proposed to overcome the quantile-crossing problem. Liuand Wu (2009) and Muggeo et al. (2013) proposed to estimate the quantiles sequentially under theconstraint that the current quantile curve does not cross the previous one, but their estimation is sen-sitive to the order that the quantiles are fitted. Chernozhukov et al. (2010) and Rodrigues and Fan22017) suggested post-hoc rearrangement of multiple quantile estimates to enforce non-crossing,but the adjusted conditional quantile function still depends highly on the individual estimates whichlack strength borrowing across different regression quantiles.Simultaneous estimation of multiple quantiles has received increasing attention in QR liter-ature. Unlike sequential estimation and post-processing, it estimates all desired quantiles at thesame time under non-crossing constraints. Fitting multiple quantiles simultaneously is advanta-geous since the information used to estimate an individual quantile is shared among estimationof all other quantiles. This can improve overall model performance especially when the availabledata is sparse. A large literature exists on estimating non-crossing quantile curves simultaneouslyunder a linear regression setting (Bondell et al., 2010; Liu and Wu, 2011; Reich et al., 2011; Reich,2012; Reich and Smith, 2013; Yuan et al., 2017; Yang and Tokdar, 2017). These approaches enjoygreat interpretability by allowing rate-of-change interpretation of quantile-dependent coefficients,but one drawback is that they cannot accommodate non-crossing quantile curves with a complexnon-linear trends. Furthermore, they are not suitable for high-dimensional problems since they donot implicitly account for complex interaction effects.To date, only a few works have considered modeling non-linear quantile curves that do notcross. Cannon (2018) proposed to model the quantile process using a neural network. He treatedthe quantile level as an additional covariate and imposed partial monotonicity constraints on itsassociated weight parameters to enforce non-crossing. Although this elegantly alleviates quantilecrossing, one consequence is that additional quantile curves for quantile levels not within the spec-ified range have to be estimated via extrapolation. Das and Ghosal (2018) modeled the quantileprocess as a weighted sum of B -spline basis functions of the quantile level, where the weightsare further expanded by tensor products of B -spline series expansion of each covariate; order con-straints were imposed on the spline coefficients of each covariate to ensure non-crossing. Althoughtheir method estimates the whole quantile process, it has a limited model coverage due to the sim-plified order constraint they chose. Furthermore, tensor product of basis functions is known tonot scale well to high dimensions, since the number of parameters grows exponentially with thenumber of covariates. Non-linear quantile process can also be estimated by inverting (or integrat-3ng then inverting) any valid estimate on the conditional distribution function. Das and Ghoshal(2018) proposed a model on the conditional cumulative distribution function (CDF) of similar formto their aforementioned quantile process model. Consequently, their CDF model also suffers fromlow model coverage and computational intractability in high dimensions. Izbicki and Lee (2016)projected the conditional probability density function (PDF) onto data-dependent eigenfunctionsof a kernel-based operator. Their model scales well to high dimensions and also enjoys great modelcoverage. However, the resulting quantile curves are often not smooth.In this paper, we propose to simultaneously estimate non-crossing non-linear quantile curvesby specifying a Bayesian non-parametric model on the conditional distribution. While there existother conditional distribution regression models (e.g., Holmes et al., 2012; Rothfuss et al., 2019;Li et al., 2021), we model the conditional CDF using an I -spline basis expansion where the splinecoefficients depend on the covariates through neural networks. We choose to model the distri-bution function instead of the quantile process because the former permits analytic derivation ofthe likelihood function and therefore efficient MCMC sampling of the posterior, and for a non-parametric regression the span of potential models is the same in both cases. A spline-based modelensures the estimated conditional CDF, and therefore the estimated conditional quantile functionis smooth, and the neural network components allows incorporation of complex covariate effectson the response distribution. A suitable output activation function is chosen for the neural net-work components such that the model represents a valid CDF. We name this method “QR Using I -spline Neural Network (QUINN)”. QUINN has a similar form to the density model of Das andGhosal (2018) but comes with many improvements. Compared to their B -spline basis expansion,the proposed I -spline basis expansion reduces the order constraints on the spline coefficients toonly a unit simplex constraint. Under this constraint, QUINN can approximate any continuousCDF function. Furthermore, by replacing tensor product of basis functions with neural networks,QUINN enjoys great scalability to high-dimension problems – the growth rate of parameter size isonly linear in the number of covariates.One concern about estimating the quantile process by inverting a non-parametric estimator ofthe conditional cumulative distribution function is that covariate effects on different quantiles can4e difficult to interpret. In fact, this is a natural drawback for many black box supervised learningmodels that sacrifice interpretability for flexibility. In many real-world applications, understandingthe covariate effects on the predicted response is of paramount importance. In the univariate case,visual characterization of the covariate effect can be obtained by simply plotting the fitted regres-sion line against it. However, black box models such as neural networks are more suitable formultivariate regression problems where an accurate visualization and interpretation of each covari-ate’s main effect is hard to obtain. To overcome this challenge, model-agnostic methods have beendeveloped (Friedman, 2001; Ribeiro et al., 2016; Goldstein et al., 2015) to extract interpretationfrom black box models. These methods have great advantages over model-specific methods, e.g.,the Gini importance for random forest (Breiman, 2001), as they can be applied on any superviselearning method. Recently, Apley and Zhu (2020) proposed to use accumulated local effect (ALE)plots to visualize main and second-order interaction effects of a black box supervised learningmodel. Their method produces reliable characterization of the covariate effects on the predictedresponse even if the covariates are correlated. In this paper, we will show that ALE plots can beapplied to visualize covariate effects on predicted quantiles. We will also present ways to estimatefeature importance of main and second-order interaction effects. Denote X ∈ X ⊂ R d as the covariate vector and Y ∈ R as the scalar response. We are interested inapproximating the quantile process of the response given the covariates Q Y ( τ | X = x ) for quantilelevel τ ∈ (0 , and for all x ∈ X . If Q Y ( τ | x ) is continuous and monotonically increasing in τ , thenfor any x ∈ X the conditional CDF is F Y ( y | x ) = Q − Y ( τ | x ) . Thus, the monotonicity constraint of Q Y ( τ | x ) can be naturally accounted for by specifying a valid model on F Y ( y | x ) and then invertingit. Our method requires the response variable to have a lower and upper bound. This can beachieved by introducing a transformed response variable Z = g ( Y ) via a monotonic bijection g : [ L Y , U Y ] → [0 , where L Y , U Y are the bounds of Y ; we consider both finite and infinitevalues for these bounds. In this section, we will outline our method for approximating quantileprocess of the transformed response Q Z ( τ, x ) whose estimate can then be back-transformed to an5stimate of Q Y ( τ | x ) . I -spline basis expansion We model the CDF of Z given x using an I -spline basis expansion. An I -spline basis expansionis defined by its support z ∈ [ L, U ] , degree r and knot sequence T = { t , t , . . . , t p | L = t 1) = − means that E X [ Q ( τ, X | X = 1)] = E X [ Q ( τ, X )] − .When plotted against x j and x l , the second-order ALE ¯ Q jl ( τ, x j , x l ) quantifies the difference be-tween average prediction conditioned on ( X j , X l ) = ( x j , x l ) and the average prediction over X .For example, ¯ Q , ( τ, − , 1) = 2 means that E X [ Q ( τ, X | X = − , X = 1)] = E X [ Q ( τ, X )]+2 .The interaction ALE ¯ Q Ijl ( τ, x j , x l ) can be interpreted analogously to ¯ Q jl ( τ, x j , x l ) , except now thedifference is contributed entirely by the interaction effect.It is also useful to summarize the main and interaction ALEs with a one-number summary thatcan be used to rank the importance of each effect. Following Greenwell et al. (2018), we proposeto measure overall variable importance (VI) for continuous covariates using the standard deviationof the ALE with respect to the marginal distribution of X , i.e.,VI j ( τ ) = SD (cid:2) ¯ Q j ( τ, X j ) (cid:3) and VI jl ( τ ) = SD (cid:2) ¯ Q Ijl ( τ, X j , X l ) (cid:3) ; for categorical covariates, the standard deviation is replaced by one fourth of the range. These VIscores (and the intermediate functions ¯ Q j , ¯ Q jl , and ¯ Q Ijl ) can be approximated using the partitioningschemes of Apley and Zhu (2020) as described in Appendix B.Although for notational simplicity we have omitted the dependence of the quantile functionon the parameters W , in practice the posterior uncertainty in W leads to posterior uncertainty inthe sensitivity metrics such as ¯ Q j ( τ ) and VI j ( τ ) . We account for this uncertainty by computingthe sensitivity measures for many MCMC samples from the posterior distribution of W , giving aMonte Carlo approximation of the posterior distribution of the sensitivity measures. X We first investigate performance for three cases with univariate and bivariate X :10 esign 1. The covariate X is generated from U (0 , . The response variable Y is given by Y = X + sin(2 X ) + 3 (cid:15), where (cid:15) follows the skew-normal distribution with location parameter 0, scale parameter 1 andshape parameter 4. The resulting quantile curves are parallel, and the data is highly right-skewed. Design 2. The covariate X is generated from U (0 , . The response variable Y is given by Y = 3 X + [0 . X + sin(3 πX + 1)] (cid:15), where (cid:15) follows the normal distribution with mean 0 and standard deviation 1. The resultingquantile curves are linear at the median but have strong curvature at the extremes. Design 3. The covariate vector X = ( X , X ) is generated from U (0 , × U (0 , . The responsevariable Y is given by Y = sin(2 πX ) + cos(2 πX ) + (cid:113) X + X ) (cid:15), where (cid:15) follows T , the Student’s t distribution with three degrees of freedom. The data ex-hibit strong heteroscadasticity and heavy-tailedness. For each design, we generate sample ofsizes n ∈ { , , } . Quantile curves (surfaces) at nineteen equally spaced quantiles τ ∈{ . , . , ..., . } are fitted to the data.To assess the performance of the proposed method in recovering non-linear quantile processfrom simulated data, we compare our method to four non-parametric non-crossing QR methods:the monotone composite QR neural network (MCQRNN) of Cannon (2018), the non-parametricsimultaneous QR (NPSQR) of Das and Ghosal (2018), the non-parametric distribution functionsimultaneous QR (NPDFSQR) also of Das and Ghosal (2018), and the spectral series conditionaldensity estimator (seriesCDE) of Izbicki and Lee (2016). MCQRNN is implemented in the qrnn package in R; codes for NPSQR and NPDFSQR are available from the second author’s webpage;and codes for seriesCDE are available from the supplemental material of the online paper. Imple-mentation details including tuning and model selection are given in Appendix C. For QUINN, we11un NUTS for 2000 iterations and discard the first 1000 iterations as burn-in.To compare the methods, 100 data sets were simulated for each design. For each samplesize, the overall performance of each method is measured by the root mean integrated squareerror (RMISE) between the actual, Q Y ( τ | x ) , and estimated (posterior mean), ˆ Q Y ( τ | x ) , quantileprocesses. To calculate the RMISE, we first divide the domain of each dimension of X by g equidistant grid-points, giving the G = gd vectors ˜ x , ..., ˜ x G that span the range of X . For Designs1 and 2, we set g = G = 101 and for Design 3 we set g = 41 and thus G = 41 . The RMISE isthen approximated as RMISE ( τ k ) = (cid:118)(cid:117)(cid:117)(cid:116) G G (cid:88) i =1 (cid:110) Q Y ( τ k | ˜ x i ) − ˆ Q ( τ k , ˜ x i ) (cid:111) for quantile level τ k ∈ { . , . , ..., . } andRMISE QP = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) k =1 RMISE ( τ k ) for the entire quantile process.The average RMISE QP over simulated data sets for each method, along with their stan-dard errors, are shown in Table 1. We see that for each of the designs considered, QUINN andMCQRNN give the best estimation. Specifically, QUINN gives significantly better estimationwhen the sample size is small, whereas the performance of MCQRNN catches up when the samplesize becomes large. QUINN also results in a more robust fit when data is scarce, in contrast toMCQRNN which has a drastically high variance. The only case where MCQRNN significantlyoutperformed QUINN is Design 3 with n = 200 . This is expected, as the min-max normalizationis not a bijection between [ U Y , Y L ] and [0 , when Y is unbounded and results in only a trun-cated F Y ( y | x ) being estimated for all x . This truncation will acutely impact estimate of extremequantiles of a heavy-tailed distribution like T in a finite sample setting. Nonetheless, QUINN stillachieves similar and even better performance than existing methods when the sample size is small.The average RMISE ( τ ) for each method over simulated data sets are shown in Figure 1. We12able 1: Simulation results : Average RMISE QP over simulated data sets with standard errorin parentheses, and the smallest error in each row is in bold.Design n QUINN MCQRNN NPSQR NPDFSQR seriesCDE1 50 (0.26) 1.11 (1.98) 1.13 (0.26) 1.15 (0.26) 1.11 (0.29)100 (0.11) 0.65 (0.15) 1.00 (0.19) 0.89 (0.17) 0.87 (0.19)200 0.49 (0.09) (0.11) 0.96 (0.18) 0.74 (0.16) 0.71 (0.19)2 50 (0.16) 1.18 (2.28) 1.18 (0.28) 1.20 (0.31) 0.90 (0.20)100 (0.11) 0.72 (0.13) 1.19 (0.22) 0.93 (0.19) 0.74 (0.19)200 (0.07) 0.53 (0.11) 1.03 (0.18) 0.76 (0.15) 0.57 (0.15)3 50 (0.32) 1.39 (0.81) 2.23 (1.78) 2.68 (2.25) 1.23 (0.53)100 (0.26) (0.21) 2.33 (1.83) 4.04 (2.62) 0.96 (0.27)200 0.81 (0.31) (0.34) 2.16 (1.20) 3.66 (2.29) 0.86 (0.27)4 200 (0.27) 3.21 (1.58) - - 3.37 (0.21)see that QUINN gives the best estimation of nonextreme quantiles in all cases, whereas MCQRNNgives similar or even better estimation of extreme quantiles when the data exhibit significant heavy-tailedness or skewness. X A defining characteristic of a non-parametric regression model is its ability to account for complexinteraction effects. Similarly, non-parametric QR model accounts for quantile-dependent interac-tion effects. Therefore, we would like to assess the performance of QUINN when X is multivariateand has both quantile-indepedent and quantile-dependent interaction effects. In addition, we wouldlike to assess the performance of QUINN when the data have a sparse structure; i.e., only a subsetof given covariates actually influences the response. To address these questions, we consider aten-predictor simulation design: Design 4. The covariate vector X = ( X , X , ..., X ) is generated uniformly from [0 , . The13 R M I SE . . . . (a) Design 1 t R M I SE . . . . . . (b) Design 2 t R M I SE . . . . . . (c) Design 3 t R M I SE QUINNMCQRNNseriesCDE NPSQRNPDFSQR (d) Design 4 Figure 1: Simulation results by quantile level. Average RMISE ( τ ) over simulated data setsby quantile level τ . 14uantile function of the response variable conditioned on the covariate Q Y ( τ | x ) is given by Q Y ( τ | x ) = 3( τ − . (cid:18) X + 35 (cid:19) + 15 (cid:34) X + 4 (cid:18) X − (cid:19) (cid:35) exp (cid:0) − X (cid:1) + 12 exp (cid:34)(cid:18) X + 12 (cid:19) (cid:18) X − (cid:19) (cid:35) + 5( τ − (cid:18) X + 25 (cid:19) (cid:18) X + 12 (cid:19) + 0 . − ( τ ) , where Φ − ( · ) is the standard normal quantile function. In this design, X and X have aquantile-independent interaction effect, and X and X have a quantile-dependent interactioneffect. The design is sparse as only the first six covariates are relevant.For the multivariate design, we compare QUINN with MCQRNN and seriesCDE only. We omitNPSQR and NPDFSQR because expanding each covariate of a d -dimensional x using quadratic B -spline basis functions results in a parameter space of dimension p DG ( p DG + 2) d . That is, the numberof parameters in NPSQR and NPDFSQR grow exponentially with the number of covariates. Evenwith d as few as , fitting NPSQR and NPDFSQR becomes computationally infeasible. Wegenerate sample of size n = 400 and split into 200 training and 200 testing data points. Eachmodel is first fit to the training data, then conditional quantile predictions at the equally-spacedquantiles mentioned in Section 4.1 are calculated for each given x of the testing data. For QUINN,we run NUTS for 3500 iterations and discard the first 2500 iterations as burn-ins.To measure the performance of each method, we generate 100 data sets; and for each dataset we calculate RMISE τ between actual and predicted quantiles conditioned on the testing datapoints. The average RMISE τ over simulated data sets are shown in Figure 1(d). We seethat QUINN gives substantially better estimation of the quantile process than MCQRNN and seri-esCDE. The result shows that our proposed method maintains its promising performance when x is multivariate, has complex interaction effects, and has a sparse structure.We now demonstrate how ALEs plot can be used with QUINN to visualize main and second-order interaction effects on its predicted quantiles. Design 4 is constructed such that only ¯ Q ( τ, x ) ,..., ¯ Q ( τ, x ) , ¯ Q ( τ, x , x ) , and ¯ Q ( τ, x , x ) are non-zero for some or all τ . In addition,15 Q j ( τ, x j ) , j = 1 , , and ¯ Q ( τ, x , x ) vary across τ . For each of the 100 simulated data sets, wecalculate the ALE main effect for each covariate and interaction effect for each pair of covariatesat quantile levels τ ∈ { . , . , ..., . } . The estimated ALE main effects at quantile levels τ ∈ { . , . , . } are shown in Figure 2; where the gray lines represent individual estimatesbased on the 100 simulated data sets, and the red line represents the true ALE effect calculatedusing the generating model. The estimated ALEs precisely capture the quantile-independent maineffects of X , X , X , the quantile-dependent main effects of X , X , X , and the redundancy of X , X , X , X . We also compare VI scores calculated from the estimated effects to those fromthe true effects in Figure 3. The ranking of the covariates based on estimated VI scores resemblesthat based on true VI scores.The estimated interaction effects are analyzed in Figure 4, where the off-diagonal cells con-tain VI scores of pair-wise interaction effects. Our results show that when τ = 0 . or . , theestimated VI scores of ¯ Q ( τ, x , x ) and ¯ Q ( τ, x , x ) , the only two interaction effects that con-tribute to the generating model, are significantly higher than those of all other pairs; when τ = 0 . , ¯ Q ( τ, x , x ) is close to zero and thus was difficult to distinguish from other non-contributing in-teraction effects. We further calculate the Monte Carlo probability that ¯ Q ( τ, x , x ) or ¯ Q ( τ, x , x ) is ranked top 2 among all interaction effects and show the results in Figure 5. The probability that ¯ Q ( τ, x , x ) is ranked top 2 is close to across all quantiles; the probability that ¯ Q ( τ, x , x ) isranked top 2 is close to when τ < . , but decreases as ¯ Q ( τ, x , x ) decreases. Therefore, thesensitivity analysis consistently identified the important main and interaction effects. Elsner et al. (2008) argued that the strongest tropical cyclones in the North Atlantic Basin havegotten stronger recently as a result of increasing ocean temperature over the Atlantic Ocean. Theyanalyzed 291 tropical cyclones over the period 1981 – 2006 and used separate QRs to model therelationship between their satellite-derived lifetime-maximum wind speed ( WmaxST ) and year of16 202 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -2.50.02.55.07.50.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -1012 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) 024 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -3-2-1012 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -4-2024 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) tau = 0.05 -202 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -2.50.02.55.07.50.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -1012 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) 024 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -3-2-1012 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -4-2024 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) tau = 0.5 -202 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -2.50.02.55.07.50.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -1012 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) 024 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -3-2-1012 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -4-2024 0.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) -0.2-0.10.00.10.20.00 0.25 0.50 0.75 1.00 X Q ( t , x ) tau = 0.95 = 0.05 = 0.5 = 0.95 Figure 2: Marginal main effect estimates for Simulation Design 4. Accumulative local effects(ALE) ¯ Q j ( τ ) by covariate j ∈ { , ..., } and quantile level τ ∈ { . , . , . } . The gray linesrepresent individual ALE calculated from the 100 simulated data sets. The red line represents thetrue ALE calculated from the generating model.17 X X X X X X X X X VI True X X X X X X X X X X VI Estimated (a) Quantile level τ = 0 . X X X X X X X X X X VI True X X X X X X X X X X VI Estimated (b) Quantile level τ = 0 . X X X X X X X X X X VI True X X X X X X X X X X VI Estimated (c) Quantile level τ = 0 . Figure 3: Marginal main effects importance for Simulation Design 4. Average estimated (left)and true (right) variable importance VI j ( τ ) for each main effect and τ ∈ { . , . , . } ; theestimates are averaged over 100 datasets and thin horizontal lines are 95% intervals.18 .5 02.3 000.71 000.71.14 00001.5 00000.812.48 0000000 00000000 000000000 0000000000 X X X X X X X X X X X X X X X X X X X X Variable Importance True X X X X X X X X X X X X X X X X X X X X Estimated (a) Quantile level τ = 0 . X X X X X X X X X X X X X X X X X X X X True X X X X X X X X X X X X X X X X X X X X Estimated (b) Quantile level τ = 0 . X X X X X X X X X X X X X X X X X X X X True X X X X X X X X X X X X X X X X X X X X Estimated (c) Quantile level τ = 0 . Figure 4: Marginal interaction effects importance for Simulation Design 4. Average estimated(left) and true (right) variable importance VI jk ( τ ) for each interaction effect; the estimates areaveraged over 100 datasets. 19 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . t P r opo r t i on Interaction(X , X )(X , X ) Figure 5: Marginal interaction effects importance ranking for Simulation Design 4. Pro-portion of the 100 simulated datasets for which the variable importance of ¯ Q Ijl ( τ, X , X ) and ¯ Q Ijl ( τ, X , X ) are ranked top 2 among all second-order interaction effects.occurence ( Year ) over a set of quantile levels. They found significant upward trends for regressionlines modeling quantile levels above the 70th percentile ( τ > . ) with estimated slope coefficientincreasing with τ . One drawback of their analysis was that they did not consider non-linear QR.In this section, we present an analysis of the data provided by Elsner et al. (2008) and model thewind speed quantiles simultaneously and non-parametrically using QUINN.First, we transform the covariate Year linearly to the unit interval such that the years 1981and 2006 are mapped to 0 and 1 respectively. For the response variable WmaxST , we consideredthree transformations to the unit interval: min-max normalization, log transformation followed bymin-max normalization, and CDF transformation using the power-Pareto distribution given by F ( y ) = 1 − y/σ ) k ) a , where a = 0 . , σ = 52 and k = 4 . (Tokdar et al., 2012; Das and Ghosal, 2018). We fitQUINN with V ∈ { , , , } hidden neurons and p ∈ { , , , } spline knots. We draw2000 posterior samples using NUTS, disregarding the first 1000 samples as burn-in and choose the20 Year W m a x S T (a) t W m a x S T d i ff (b) t W m a x S T d i ff (c) Figure 6: Cyclone intensity analysis. (a) Scatter plot of satellite-derived lifetime-maximumwind speed ( WmaxST ) and year of occurence ( Year ) overlaid with quantile curves estimated byQUINN. (b) Posterior mean difference (solid black line) of WmaxST between year 1990 and 1981across quantile range τ ∈ [0 . , . with 50% (light grey region) and 95% (dark grey region)credible band. (c) Posterior mean difference (solid black line) of WmaxST quantiles between year2006 and 1997 across τ ∈ [0 . , . with 50% (light grey region) and 95% (dark grey region)credible bands.best configuration of transformation method, V and p based on WAIC of the fitted model.The scatter plot of the cyclone intensity data is shown in Figure 6 where each point repre-sents a tropical cyclone between 1981 and 2006. The plot is overlaid with quantile curves for τ ∈ { . , . , ..., . } estimated by QUINN. We observe that all quantile curves are smooth,showing monotonically increasing trends over τ . Furthermore, the increase is more significant inrecent years and at higher quantile levels. This shows that tropical cyclones are getting strongerfrom 1980 to 2007, with the strongest ones having the most significant increase. We also visualizethe posterior estimates of the change in WmaxST quantiles between 1981 and 1990, and between1997 and 2006 in Figure 6, i.e., WmaxST diff = Q Y ( τ | x ) − Q Y ( τ | x ) where x and x are thetransformed covariates for the two years in the comparison. We see that in both cases the posteriormean WmaxST diff increases as τ increases, and the magnitude of WmaxST diff quantiles between1997 and 2006 is greater than that between 1981 and 1990, supporting our previous observations.Furthermore, the 95% credible intervals are above , showing that the increase of intensity forcyclones between 1980 and 2007 is statistically significant.21 .2 Application to birth weight data For the case of multivariate data analysis, we present an analysis of the effect of pregnancy-relatedfactors on infant birth weight ( Weight , in grams) quantiles. This effect was first studied in detailusing separate linear regression and 1992 and 1996 editions of the Natality Data Set (Abrevaya,2001); it was then restudied using simultaneous linear QR and 1997 edition of the data set (Tokdaret al., 2012). In this paper, we consider birth records from a more recent edition and presentinference result from estimating the effect non-parametrically using QUINN. Our data consist of5000 randomly chosen entries from the 2019 edition of the Natality Data Set of the United States on singleton live births to mothers recorded as Black or White, in the age group 18–45, withheight between 59 and 73 inches, and smoke no more than 20 cigarettes daily during pregnancy.The list of covariates includes gender of the child ( Boy , 1 = boy, 0 = girl), length of gestation( Week ), maternal age ( Age , in years), race ( Black , 1 = Black, 0 = White), height ( Height ,in inches), body mass index ( BMI ), weight gain ( WtGain , in pounds), average daily number ofcigarettes during pregnancy ( NumCig ), and indicators for the mother receiving sufficient prenatalcare ( PreVis ) and having developed pregnancy hypertension ( PHype ). We define receivingsufficient prenatal care as having paid more than 12 prenatal visits (Kriebs, 2010).All numerical covariates are mapped to the unit interval using min-max normalization. Forthe response variable Weight , we considered min-max normalization, and log transformationfollowed by min-max normalization. We fit QUINN with V ∈ { , , } hidden neurons and p ∈ { , , } spline knots. We draw 3500 posterior samples using NUTS, discarding the first2500 samples as burn-in and choose the best model configuration based on WAIC.To determine which covariates have significant impact on the birth weight distribution, we firstcalculate the ALE-induced VI score for each main effect across different quantiles. In particular,we would like to identify the covariates that most impact low birth weight (represented by the 0.05quantile), typical birth weight (represented by the 0.5 quantile), and high birth weight (representedby the 0.95 quantile). Figure 7 shows the ranking of ALE main effects at τ ∈ { . , . , . } .Length of gestation, maternal height, maternal body mass index, weight gain, average daily number Source: reVisAgePHypeBoyBlackWtGainHeightBMINumCigWeek 0 200 400 600 800 VI t = PreVisAgeBoyPHypeBlackWtGainHeightBMINumCigWeek 0 250 500 750 VI t = PreVisAgeBoyBlackPHypeNumCigWtGainBMIHeightWeek 0 250 500 750 VI t = Figure 7: Marginal main effects importance for the birth weight analysis. Posterior mean (95%credible interval) of the variable importance VI j ( τ ) of each covariate.of cigarettes during pregnancy, maternal race, and gender of the newborn are influential on all threequantiles. In particular, length of gestation is the most influential covariate on all three quantiles,average daily number of cigarettes during pregnancy is more influential on low and typical birthweight than on high birth weight, and height is more influential on high birth weight than on lowand typical birth weight.The ALE main effects of the six most important covariates at τ ∈ { . , . , . } , as wellas their VI over the entire quantile range are plotted in Figure 8. Our results indicate that the in-fluence of these covariates on birth weight are significantly non-constant across quantiles. Longergestation, increasing maternal height, increasing maternal body mass index, and weight gain con-tribute to higher birth weight; their effects are more pronounced at the high ends of the weightdistribution. Increasing smoking during pregnancy contributes to lower birth weight; its effect ismore pronounced at the low ends of the weight distribution. Compared to White mothers, Blackmothers are associated with lower birth weight; the difference is most substantial at high ends ofthe weight distribution.We further analyze the pair-wise interaction effects between length of gestation, maternalheight, maternal body mass index, average daily number of cigarettes during pregnancy, andweight gain. The posterior mean and standard deviation of their variable importance at τ ∈{ . , . , . } are shown in Figure 9. The most important interaction effects all involve thelength of gestation. Specifically, length of gestation and average daily number of cigarettes during23 - - - - X Q ( t , x ) t t V I ( t ) (a) Week No Yes - - X Q ( t , x ) t t V I ( t ) (b) Black 59 62 66 70 73 - - X Q ( t , x ) t V I ( t ) (c) Height 14 27 39 52 65 - - X Q ( t , x ) t V I ( t ) (d) BMI - - - - X Q ( t , x ) t V I ( t ) (e) NumCig - - X Q ( t , x ) t V I ( t ) (f) WtGain Figure 8: Marginal main effect estimates for the birth weight analysis. The plots show the ALE ¯ Q j ( τ, x j ) by quantile level for the six most important covariates.24regnancy seem to have the most pronounced interaction effect at all three quantiles. A visual il-lustration of the joint effects of these two covariates at τ ∈ { . , . , . } using contour plots areshown in Figure 10. The results indicate that longer gestation period is associated with higher birthweight irregardless of maternal smoking habit, but the effect is clearly amplified for non-smokers.In addition, heavier maternal smoking is associated with lower birth weight only for infants bornafter 38 weeks. In this paper, we propose a novel non-parametric method to simultaneously model non-crossingnon-linear quantile curves. The model for the conditional CDF of the response variable uses I -spline series expansion, where the spline coefficients are further modeled as functions of covari-ates using FNNs. By leveraging the approximation power of splines and neural networks, ourmodel can approximate all continuous CDFs, and thus all continuous quantile functions. We adopta Bayesian framework by assigning prior distributions to the weight parameters and utilize thestate-of-art NUTS to sample efficiently from the high-dimensional posterior. The performance ofour model depends on the number of hidden neurons and spline knots which are selected usingWAIC. Compared to existing simultaneous quantile regression methods, our method can estimatethe whole quantile process, yields a smooth estimate of the quantile function, and scales to high-dimensional regression. We also show that our model can yield meaningful interpretation via ALEplots and variable importance scores. Simulation studies show that our method can better recoverthe generating quantile process than existing methods, particularly when the sample size is small;and the relative importance of main and interaction effects can be estimated accurately.The proposed model was applied to the intensity data of tropical cyclones in the North AtlanticBasin and specifically to investigate how intensity quantiles changed between years 1981 and 2006.Our results showed that cyclone intensity increased smoothly from 1981 to 2006, with intensity ofthe strongest cyclones increasing more significantly than that of typical and weak cyclones. Theproposed method was also used to analyze the relationship between birth weight and maternalcharacteristics of U.S. newborns and specifically to identify the important maternal characteristics25 NumCigHeightBMIWtGain Height BMI WtGain Week Variable Importance Mean NumCigHeightBMIWtGain Height BMI WtGain Week Standard Deviation (a) Quantile level τ = 0 . NumCigHeightBMIWtGain Height BMI WtGain Week Mean NumCigHeightBMIWtGain Height BMI WtGain Week Standard Deviation (b) Quantile level τ = 0 . NumCigHeightBMIWtGain Height BMI WtGain Week Mean NumCigHeightBMIWtGain Height BMI WtGain Week Standard Deviation (c) Quantile level τ = 0 . Figure 9: Marginal interaction effects importance for the birth weight analysis. Posterior meanand standard deviation of variable importance VI jk ( τ ) for interaction effects between influentialcovariates. 26 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - NumCig W ee k (a) Quantile level τ = 0 . -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - -2000 - - - - - - - - - NumCig W ee k (b) Quantile level τ = 0 . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - NumCig W ee k (c) Quantile level τ = 0 . Figure 10: Marginal interaction effect estimates for the birth weight analysis. Posterior meanof second-order ALE ¯ Q jl ( τ ) for joint effects of length of gestation ( Week ) and average dailynumber of cigarettes ( NumCig ) at τ ∈ { . , . , . } .that are associated with high and low birth weight. Our results showed that low birth weightis primarily associated with prematurity, heavy maternal smoking, Black mothers; whereas highbirth weight is primarily associated high maternal body mass index, maternal height, and maternalweight gain.A crucial step in fitting our model is a transformation of the original response into the unitinterval. In this paper, we found that the min-max normalization sufficed in most cases. However,when the response is unbounded, the min-max normalization truncates the response distribution,potentially impacting estimation accuracy of extreme quantiles of heavy-tailed distributions. Forpractical use, we recommend different transformation methods to be compared, including thosethat depend on the covariate vector (e.g. conditional CDF transformation). Our current modelassumes independence between observations as well as between covariates. Future work can focuson relaxing these assumptions, such as the accommodation of spatial and/or temporal correlationbetween observations, and highly correlated covariates.27 eferences Abrevaya, J. (2001) The effects of demographics and maternal behavior on the distribution of birthoutcomes. Empirical Economics , , 247–257.Apley, D. W. and Zhu, J. (2020) Visualizing the effects of predictor variables in black box super-vised learning models. Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) , , 1059–1086.Betancourt, M. (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprintarXiv:1701.02434 .Betancourt, M. and Girolami, M. (2015) Hamiltonian Monte Carlo for hierarchical models. Cur-rent Trends in Bayesian Methodology with Applications , , 2–4.Bondell, H. D., Reich, B. J. and Wang, H. (2010) Noncrossing quantile regression curve estimation. Biometrika , , 825–838.Breiman, L. (2001) Random forests. Machine Learning , , 5–32.Cannon, A. J. (2018) Non-crossing nonlinear regression quantiles by monotone composite quan-tile regression neural network, with application to rainfall extremes. Stochastic EnvironmentalResearch and Risk Assessment , , 3207–3225.Chernozhukov, V., Fern´andez-Val, I. and Galichon, A. (2010) Quantile and probability curveswithout crossing. Econometrica , , 1093–1125.Das, P. and Ghosal, S. (2018) Bayesian non-parametric simultaneous quantile regression for com-plete and grid data. Computational Statistics & Data Analysis , , 172–186.De Boor, C. and Daniel, J. W. (1974) Splines with nonnegative B -spline coefficients. Mathematicsof Computation , , 565–568.Elsner, J. B., Kossin, J. P. and Jagger, T. H. (2008) The increasing intensity of the strongest tropicalcyclones. Nature , , 92–95.Friedman, J. H. (2001) Greedy function approximation: a gradient boosting machine. The Annalsof Statistics , 1189–1232.Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions, and the Bayesianrestoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 721–741.Goldstein, A., Kapelner, A., Bleich, J. and Pitkin, E. (2015) Peeking inside the black box: Visual-izing statistical learning with plots of individual conditional expectation. Journal of Computa-tional and Graphical Statistics , , 44–65. 28reenwell, B. M., Boehmke, B. C. and McCarthy, A. J. (2018) A simple and effective model-basedvariable importance measure. arXiv preprint arXiv:1805.04755 .Hoffman, M. D. and Gelman, A. (2014) The No-U-Turn sampler: adaptively setting path lengthsin Hamiltonian Monte Carlo. Journal of Machine Learning Research , , 1593–1623.Holmes, M. P., Gray, A. G. and Isbell, C. L. (2012) Fast nonparametric conditional density estima-tion. arXiv preprint arXiv:1206.5278 .Izbicki, R. and Lee, A. B. (2016) Nonparametric conditional density estimation in a high-dimensional regression setting. Journal of Computational and Graphical Statistics , , 1297–1316.Koenker, R. and Bassett Jr, G. (1978) Regression Quantiles. Econometrica , 33–50.Kriebs, J. M. (2010) Guidelines for Perinatal Care, Sixth Edition: By the American Academy ofPediatrics and the American College of Obstetricians and Gynecologists. Journal of Midwifery& Women’s Health .Li, R., Bondell, H. D. and Reich, B. J. (2021) Deep distribution regression. In press, ComputationalStatistics and Data Analysis .Liu, Y. and Wu, Y. (2009) Stepwise multiple quantile regression estimation using non-crossingconstraints. Statistics and Its Interface , , 299–310.— (2011) Simultaneous multiple non-crossing quantile regression estimation using kernel con-straints. Journal of Nonparametric Statistics , , 415–437.Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953) Equationof state calculations by fast computing machines. The Journal of Chemical Physics , , 1087–1092.Monnahan, C. C. and Kristensen, K. (2018) No-U-turn sampling for fast Bayesian inference inADMB and TMB: Introducing the adnuts and tmbstan R packages. PLoS ONE , , e0197954.Muggeo, V. M., Sciandra, M., Tomasello, A. and Calvo, S. (2013) Estimating growth charts vianonparametric quantile regression: a practical framework with application in ecology. Environ-mental and Ecological Statistics , , 519–531.Neal, R. M. (2011) MCMC using Hamiltonian dynamics. In Handbook of Markov Chain MonteCarlo (eds. S. Brooks, A. Gelman, G. L. Jones and X. L. Meng), chap. 5, 113–162. Chapman &Hall/CRC.— (2012) Bayesian learning for neural networks , vol. 118. Springer Science & Business Media.Reich, B. J. (2012) Spatiotemporal quantile regression for detecting distributional changes in envi-ronmental processes. Journal of the Royal Statistical Society: Series C (Applied Statistics) , ,535–553. 29eich, B. J., Fuentes, M. and Dunson, D. B. (2011) Bayesian spatial quantile regression. Journalof the American Statistical Association , , 6–20.Reich, B. J. and Smith, L. B. (2013) Bayesian quantile regression for censored data. Biometrics , , 651–660.Ribeiro, M. T., Singh, S. and Guestrin, C. (2016) Model-agnostic interpretability of machine learn-ing. arXiv preprint arXiv:1606.05386 .Rodrigues, T. and Fan, Y. (2017) Regression adjustment for noncrossing Bayesian quantile regres-sion. Journal of Computational and Graphical Statistics , , 275–284.Rothfuss, J., Ferreira, F., Walther, S. and Ulrich, M. (2019) Conditional Density Estimation withNeural Networks: Best Practices and Benchmarks. arXiv:1903.00954 .Salvatier, J., Wiecki, T. V. and Fonnesbeck, C. (2016) Probabilistic programming in Python usingPyMC3. PeerJ Computer Science , , e55.Stan Development Team (2019) Stan Modeling Language Users Guide and Reference Manual.URL https://mc-stan.org .Tokdar, S. T., Kadane, J. B. et al. (2012) Simultaneous linear quantile regression: a semiparametricBayesian approach. Bayesian Analysis , , 51–72.Yang, Y. and Tokdar, S. T. (2017) Joint estimation of quantile planes over arbitrary predictor spaces. Journal of the American Statistical Association , , 1107–1120.Yuan, Y., Chen, N. and Zhou, S. (2017) Modeling regression quantile process using monotoneB-splines. Technometrics , , 338–350. AppendicesA MCMC sampling details A.1 Posterior evaluation The non-parametric model on the conditional PDF of Z given X = x is f Z ( z | x , W ) = r + p − (cid:88) m =1 θ m ( x , W ) M m,r ( z | T ) = r + p − (cid:88) m =1 exp { u m ( x , W ) } (cid:80) r + p − i =1 exp { u i ( x , W ) } M m,r ( z | T ) , L ( D|W ) = n (cid:89) i =1 f Z ( z i , x i |W ) = n (cid:89) i =1 (cid:40) r + p − (cid:88) m =1 exp { u m ( x i , W ) } (cid:80) r + p − j =1 exp { u j ( x i , W ) } M m,r ( z i | T ) (cid:41) . where D = { z i , x i } ni =1 denotes the observed data. Let Θ = {W , σ w , γ } denote the set of modelingparameters and hyper-parameters, then the posterior of QUINN is f ( Θ |D ) ∝ N + ( γ | , a ) r + p − (cid:89) m =1 V (cid:89) l =0 N ( W ml | , γ ) d (cid:89) j =0 N + ( σ j | , a ) V (cid:89) l =0 d (cid:89) j =0 N ( W lj | , σ j ) n (cid:89) i =1 f Z ( z i , x i |W ) which can be approximated using MCMC methods. Sampling from this posterior is challengingfor traditional MCMC methods such as random-walk Metropolis (Metropolis et al., 1953) andGibbs sampler (Geman and Geman, 1984). These methods, although straightforward to implement,do not scale well to complicated posterior with high-dimensional parameter space. The formerexplores the posterior via inefficient random walks, resulting in low acceptance rate and wastedsamples; the latter requires knowing the conditional distribution of each parameter, which can beunrealistic in high-dimensional case. A.2 Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) (Neal, 2011; Betancourt and Girolami, 2015; Betancourt, 2017)is a variant of MCMC that permits efficient sampling from a high-dimensional target distribution,provided that all model parameters are continuous. It has gained increasing popularity for its re-cent applications in inference of Bayesian neural networks (Neal, 2012). By introducing auxillaryvariables r , HMC transforms the problem of sampling from f ( Θ |D ) to sampling from the jointdistribution f ( r , Θ |D ) = f ( r | Θ , D ) f ( Θ |D ) where f ( r | Θ , D ) is the auxiliary distribution oftenassumed to be multivariate Normal and independent of Θ and D , i,e. f ( r | Θ , D ) = f ( r ) . The joint31istribution defines a Hamiltonian H ( r , Θ |D ) = T ( r ) + V ( Θ |D ) T ( r ) := − log f ( r ) V ( Θ |D ) := − log f ( Θ |D ) . which can be used to generate states, i.e. samples of Θ and r , by simulating the Hamiltoniandynamics ∂ Θ ∂t = ∂T∂ r , ∂ r ∂t = ∂V∂ Θ . At any state ( Θ t , r t ) , HMC proposes the next state ( Θ t + L ∆ t , r t + L ∆ t ) by simulating Hamiltoniandynamics for time L ∆ t , which is approximated by applying the leapfrog algorithm L times eachwith step size ∆ t . Starting from an initial state, this process is repeated and the visited states form aMarkov chain. Compared to random-walk Metropolis, HMC explores the target distribution moreefficiently by using gradient of the log-posterior to direct each transition of the Markov chain.Although each step is more computationally expensive than a Metropolis proposal, the Markovchain produced by HMC often yields more distant samples and significantly higher acceptancerate. Although HMC has a high potential, its practical performance depends highly on the values of L and ∆ t . Poor choice of either parameter will result in unsatisfactory exploration of the posterior.In this paper, instead of using the original HMC which only allows manual setting of L and ∆ t , weuse the No-U-Turn Sampler (NUTS). NUTS is an extension to HMC that implements automatictuning of L . Furthermore, we use the dual averaging algorithm to adaptively select ∆ t . A detaileddescription of NUTS with dual averaging is presented in Algorithm 6 of Hoffman and Gelman(2014).NUTS is implemented in many probabilistic programming framework, such as PyMC3 (Sal-vatier et al., 2016) and Stan (Stan Development Team, 2019). Computational complexity of anHMC implementation is contingent on gradient calculation of the log-posterior, which the afore-mentioned high-level frameworks handle via automatic differentiation. In our experiment, weobserve that automatic differentiation can be extremely time consuming for a posterior as high-dimensional as ours. As a result, we use a low-level R implementation (Monnahan and Kristensen,32018) that accepts analytic gradients which we manually calculate. A.3 Reparametrization and transformation The hierarchical Gaussian priors W vw indep ∼ N (0 , σ w ) , σ w iid ∼ N + (0 , a ) introduce strong correla-tion between W vw and σ w in the posterior, especially when the data size is small. To alleviate thisissue, we consider a reparametrization: B vw iid ∼ N (0 , , σ w iid ∼ N + (0 , a ) , W vw = σ w B vw where B vw can be considered as standardized weights. Because B vw and σ w follow independentprior distributions, they are marginally uncorrelated in the posterior. Their coupling is instead in-troduced in the likelihood function. Such a parameterization is called non-centered. Non-centeredparameterization leads to simpler posterior geometries, thus increasing the efficiency of HMC.Similarly, W vw indep ∼ N (0 , γ ) , γ ∼ N + (0 , a ) can be reparameterized as B vw iid ∼ N (0 , , γ ∼ N + (0 , a ) , W vw = γB vw . HMC requires Θ to lie in an unconstrained space, and thus every parameter that has a natu-ral constraint needs to be transformed to an unconstrained variable. After unconstrained poste-rior samples are drawn, they can be back-transformed to the constrained space. In Θ , the scaleparameters σ v and γ are naturally constrained to be positive. Therefore we work with their log-transformations ˜ σ v = log σ v and ˜ γ = log γ with transformed prior distributions f (˜ σ v ) = N + (exp(˜ σ v ) | , a ) exp(˜ σ v ) and f (˜ γ ) = N + (exp(˜ γ ) | , a ) exp(˜ γ ) . Let B = { β uvw } and ˜ Θ = {B , ˜ σ w , ˜ γ } , then the posterior after non-centered reparameterization33nd constraint transformation is f ( ˜ Θ |D ) ∝ N + (exp(˜ γ ) | , a ) exp(˜ γ ) r + p − (cid:89) m =1 V (cid:89) l =0 N ( B ml | , d (cid:89) j =0 N + (exp(˜ σ ) j | , a ) exp(˜ σ j ) V (cid:89) l =0 d (cid:89) j =0 N ( B lj | , n (cid:89) i =1 f Z ( z i , x i |W ) , where W vw = exp(˜ σ w ) B vw and W vw = exp(˜ γ ) B vw in the likelihood function. A.4 Analytic gradient In this section, we provide analytic formulas for computing the gradient of log f ( ˜ Θ |D ) , whichNUTS uses to generate samples of ˜ Θ . To start with, let X denote the observed covariate matrix, M r ( z | T ) denote the M -spline matrix of transformed response vector z , denote a column vectorof 1s, σ denote the vector with elements σ w , and B u denote the matrix with elements B uvw . Thelog-likelihood function parametrized by ˜ Θ can be written in a compact form (cid:96) ( D| ˜ Θ ) = n (cid:88) i =1 (cid:32) log (cid:34) r + p − (cid:88) m =1 exp (cid:110) u m ( x i , ˜Θ ) (cid:111) M m,r ( z i | T ) (cid:35) − log (cid:34) r + p − (cid:88) m =1 exp (cid:110) u m ( x i , ˜ Θ ) (cid:111)(cid:35)(cid:33) = T (cid:16) log (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:12) M r ( z | T ) (cid:105) − log (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:105)(cid:17) where U ( X , ˜Θ ) = (cid:16) φ (cid:8) ˜ X diag[exp( ˜ σ )] B (cid:9)(cid:17) [exp(˜ γ ) B ] , ˜ X = (cid:16) X (cid:17) , and diag[exp( ˜ σ )] is the diagonal matrix with diagonal entries exp( ˜ σ ) . The log-prior can be writtenas f ( ˜ Θ ) ∝ − exp(2˜ γ ) a π + ˜ γ − vec( B ) T vec( B )2 − T (cid:20) exp(2 ˜ σ ) a π − ˜ σ (cid:21) − vec( B ) T vec( B )2 . vec( · ) denotes the vectorization operator. Finally, the gradient formula of the log-posteiorwith respect to each parameter is given by ∂ log f ( ˜ Θ |D ) ∂ B = exp(˜ γ ) diag [exp( ˜ σ )] ˜ X T ( V V − V V ) − B ∂ log f ( ˜ Θ |D ) ∂ B = exp(˜ γ ) V T (cid:16) V (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:12) M r ( z | T ) (cid:105) − V exp (cid:110) U ( X , ˜Θ ) (cid:111)(cid:17) − B ∂ log f ( ˜ Θ |D ) ∂ ˜ σ = exp(˜ γ ) (cid:104) diag (cid:110) ˜ X T ( V V − V V ) B T (cid:111)(cid:105) (cid:12) exp( ˜ σ ) − a π exp(2 ˜ σ ) + ∂ log f ( ˜ Θ |D ) ∂ ˜ γ = exp(˜ γ ) (cid:32) tr (cid:26) V B (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:12) M r ( z | T ) (cid:105) T V (cid:27) − tr (cid:26) V B exp (cid:110) U ( X , ˜Θ ) (cid:111) T V (cid:27) (cid:33) − a π exp(2˜ γ ) + 1 where V = (cid:16) φ (cid:8) ˜ X diag[exp( ˜ σ )] B (cid:9)(cid:17) V = diag (cid:110) (cid:11) (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:12) M r ( z | T ) (cid:105)(cid:111) V = diag (cid:110) (cid:11) (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:105)(cid:111) V = (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) (cid:12) M r ( z | T ) ¯ B T (cid:105) (cid:12) φ (cid:48) (cid:0) ˜ X diag[exp( ˜ σ )] B (cid:1) V = (cid:104) exp (cid:110) U ( X , ˜Θ ) (cid:111) ¯ B T (cid:105) (cid:12) φ (cid:48) (cid:0) ˜ X diag[exp( ˜ σ )] B (cid:1) , (cid:12) denotes element-wise multiplication, (cid:11) denotes element-wise division, and ¯ B u is B u after re-moving the first row. A.5 Model estimation It is well-known that FNN suffers from over-parameterization, which makes the weight parame-ters highly non-identifiable. In practice, MCMC for individual weights might not even converge,making Bayesian inference of the weight parameters impossible. Let W ( t ) , t = 1 , ..., T denote the t th posterior sample of W . In this study, instead of using the posterior estimates (e.g. posterior35eans) of the weight parameters to calculate a single estimate of F Z ( z | x , ˆ W )ˆ W = 1 T T (cid:88) t =1 W ( t ) we estimate F Z ( z | x ) using its posterior mean ˆ F Z ( z | x ) = 1 T T (cid:88) t =1 F ( z | x , W ( t ) ) . Convergence of MCMC can be checked using the trace plot of F Z ( z | x , W ( t ) ) for some ( z, x ) . B Computing the sensitivity indices In this section, we first briefly summarize how the main and interaction ALE can be estimatedusing sample data; the full description can be seen in Apley and Zhu (2020). We then explain howVI scores can be estimate based on the ALE estimates.Let x i,j and x i, \ j denote the i th observation of j th covariate and all other covariates respectively.The sample range of X j is partitioned into K intervals { N j ( k ) = ( z k − ,j , z k,j ] : k = 1 , , ..., K } where z k,j are chosen as the k/K -th sample percentile if X j is continuous and the unique valuesotherwise. Then the uncentered effect ¯ Q Uj ( τ, x j ) can be estimated by ˆ¯ Q Uj ( τ, x j ) = k j ( x j ) (cid:88) k =1 n j ( k ) (cid:88) { i : x i,j ∈ N j ( k ) } (cid:2) q j ( τ, z k,j , x i, \ j ) − q j ( τ, z k − ,j , x i, \ j ) (cid:3) , where k j ( x j ) index the interval into which x j falls, and n j ( k ) denotes the number of sample ob-servations N j ( k ) contains such that n = (cid:80) kk =1 n j ( k ) . Finally, ¯ Q j ( τ, x j ) can be estimated bymean-centering ˆ¯ Q Uj ( τ, x j ) , i.e. ˆ¯ Q j ( τ, x j ) = ˆ¯ Q Uj ( τ, x j ) − n K (cid:88) k =1 n j ( k ) ¯ Q Uj ( τ, z k,j ) . { X j , X l } , let x i, { j,l } denote i th observation vector of j th and l thcovariate, and x i, \{ j,l } denote all other covariates. The Cartesian product of sample ranges of X j and X l can be partitioned into K rectangular cells N { j,l } ( k, m ) = ( z k − ,j , z k,j ] × ( z l − ,j , z l,j ] . Thenthe uncentered effect ¯ Q Uj,l ( τ, x j , x l ) can be estimated by ˆ¯ Q Ujl ( τ, x j , x l ) = k j ( x j ) (cid:88) k =1 k l ( x l ) (cid:88) m =1 n { j,l } ( k, m ) (cid:88) { i : x i, { j,l } ∈ N { j,l } ( k,m ) } (cid:104) q jl ( τ, z k,j , z m,l , x i, \{ j,l } ) − q jl ( τ, z k − ,j , z m,l , x i, \{ j,l } ) − (cid:8) q jl ( τ, z k,j , z m − ,l , x i, \{ j,l } ) − q jl ( τ, z k − ,j , z m − ,l , x i, \{ j,l } ) (cid:9)(cid:105) , where k j ( x j ) , k l ( x l ) index the cell into which ( x j , x l ) falls, and n { j,l } ( k, m ) denotes the numberof sample observations N { j,l } ( k, m ) contains such that n = (cid:80) kk =1 (cid:80) Km =1 n { j,l } ( k, m ) . Similarly, ¯ Q jl ( τ, x j , x l ) can be estimated by mean-centering ˆ¯ Q Uj,l ( τ, x j , x l ) , i.e., ˆ¯ Q jl ( τ, x j , x l ) = ˆ¯ Q Ujl ( τ, x j , x l ) − n K (cid:88) k =1 K (cid:88) m =1 n { j,l } ( k, m ) ˆ¯ Q Ujl ( τ, z k,j , z m,l ) . Finally, interaction ALE is estimated by ˆ¯ Q Ij,l ( τ, x j , x l ) = ˆ¯ Q j,l ( τ, x j , x l ) − ˆ¯ Q j ( τ, x l ) − ˆ¯ Q j ( τ, x l ) .Following its definition in Section 3, VI j ( τ ) can be estimated by the sample standard deviationor the sample range of ˆ¯ Q j ( τ, z k,j ) , i.e., (cid:99) VI j ( τ ) = (cid:114) K (cid:80) Kk =1 (cid:104) ˆ¯ Q j ( τ, z k,j ) − K (cid:80) Kk =1 ˆ¯ Q j ( τ, z k,j ) (cid:105) if X j is continuous (cid:110) max k (cid:104) ˆ¯ Q j ( τ, z k,j ) (cid:105) − min k (cid:104) ˆ¯ Q j ( τ, z k,j ) (cid:105)(cid:111) / if X j is categorical . By analogy, VI jl ( τ ) can be estimated by the sample standard deviation or the sample range of ˆ¯ Q Ijl ( τ, z k,j , z m,l ) . 37 Simulation study implementation details For QUINN, we first transform the response variable into unit interval using min-max normaliza-tion. The conditional CDF of the transformed response is expanded with second-degree I -spline(i.e., r = 2 ) with p equidistant knots T = { t , t , ..., t p | t i = i/p, i = 0 , ..., p } . The FNN has singlehidden layer and V hidden neurons. We consider V ∈ { , , , } and p ∈ { , , , } , and se-lect the best configuration of V and p based on WAIC. The scale parameter a of the Half-Gaussianhyperprior for the weights is set to to ensure a weakly informative prior. For MCQRNN, weconsider neural networks with a single hidden layer, V ∈ { , , , , } hidden neurons, andweight penalty coefficients λ ∈ { e − , e − , ..., e − } . We select the best configuration of V and λ using 5-fold cross-validation. For NPSQR and NPDFSQR, we follow the guidelines providedby Das and Ghosal (2018) and first transform the response variable and covariate(s) into unit in-tervals using min-max normalization. The response variable and covariate(s) are then expandedusing quadratic B -splines with same number of equidistant knots, denoted as p DG . We fit NPSQRwith p DG ∈ { , , ..., } and NPDFSQR with p DG ∈ { , , ..., } . The optimal p DG for eithermodel is chosen based on the AIC in which maximum likelihood estimates are replaced by poste-rior means. SeriesCDE contains four tuning parameters: the number of components of the seriesexpansion in the Y direction N Y , the number of components of the series expansion in the X direction N X , the bandwidth parameter (cid:15) of the Gaussian kernel for constructing the Gram matrixof X , and the smoothness parameter δ that controls the bumpiness of the estimated conditionaldensity function. The tuning grids are set as N X , N Y ∈ { , , ..., n } , (cid:15) ∈ { e − , e − . , ..., e } , and δ ∈ { , . , ..., . } . Following the guidelines provided by Izbicki and Lee (2016), we first selectthe best configuration of N X , N Y , and (cid:15) using 5-fold cross-validation. We then tune δ using again5-fold cross-validation by fixing N X , N Y , and (cid:15) at their optimal values.Since QUINN, NPDFSQR, and seriesCDE all model the distribution function, their resultshave to be converted to estimates of the quantile function. For QUINN and NPDFSQR, once weobtain the non-parametric CDF estimate of the transformed response ˆ F Z ( z | x ) , we evaluate it at 101equidistant grid-points on the unit interval for each x . The conditional quantile function Q Z ( τ | x ) , τ ∈ (0 , can then be estimated by interpolation using (cid:110) ˆ F Z ( z i , x ) (cid:111) i =1 as input values and the38forementioned 101 equidistant grid-points as functional output values. Finally, the quantile func-tion of the original response Q Y ( τ | x ) can be estimated by reverting the min-max normalization.For seriesCDE, we first convert the non-parametric density function estimate ˆ f Y ( y | x ) to ˆ F Y ( y | x ) using numerical integration (e.g. trapezoidal rule). We then estimate Q Y ( τ | x ) using the aforemen-tioned interpolation approach.The above parameters are used for Designs 1-3, but for the multivariate Design 4 we haveslightly different tuning parameters. For QUINN, we consider V ∈ { , , } and p ∈ { , , } ,and choose the best configuration of V and p based on WAIC; values for all other parameters aresame as in Designs 1-3. For MCQRNN, we consider neural networks with either a single or twohidden layers, V ∈ { , , , , } hidden neurons for each hidden layer, and weight penalty coef-ficients λ ∈ { e − , e − , ..., e − } . The best configuration of number of layers, V , and λ are selectedusing 5-fold cross-validation. We do not consider neural networks with more than two hidden lay-ers as they are not currently implemented in qrnnqrnn