[PDF] A machine learning approach for efficient multi-dimensional integration

Abstract

We propose a novel multi-dimensional integration algorithm using a machine learning (ML) technique. After training a ML regression model to mimic a target integrand, the regression model is used to evaluate an approximation of the integral. Then, the difference between the approximation and the true answer is calculated to correct the bias in the approximation of the integral induced by a ML prediction error. Because of the bias correction, the final estimate of the integral is unbiased and has a statistically correct error estimation. The performance of the proposed algorithm is demonstrated on six different types of integrands at various dimensions and integrand difficulties. The results show that, for the same total number of integrand evaluations, the new algorithm provides integral estimates with more than an order of magnitude smaller uncertainties than those of the VEGAS algorithm in most of the test cases.

Full PDF

AA machine learning approach for eﬃcient multi-dimensionalintegration

Boram Yoon ∗ CCS-7, Applied Computer Science, Los Alamos National Laboratory,Los Alamos, NM 87545, USA

Abstract

We propose a novel multi-dimensional integration algorithm using a machine learning (ML)technique. After training a ML regression model to mimic a target integrand, the regressionmodel is used to evaluate an approximation of the integral. Then, the diﬀerence between theapproximation and the true answer is calculated to correct the bias in the approximation of theintegral induced by a ML prediction error. Because of the bias correction, the ﬁnal estimate ofthe integral is unbiased and has a statistically correct error estimation. The performance of theproposed algorithm is demonstrated on six diﬀerent types of integrands at various dimensionsand integrand diﬃculties. The results show that, for the same total number of integrand eval-uations, the new algorithm provides integral estimates with more than an order of magnitudesmaller uncertainties than those of the VEGAS algorithm in most of the test cases.

Monte Carlo integration is a numerical method evaluating the integral of an integrand over a ﬁniteregion using random sampling. As a consequence of the random sampling, in contrast to determin-istic methods, the result of the Monte Carlo integration is an estimate of the true value that comeswith a statistical uncertainty. For higher-dimensional integral problems, the Monte Carlo integra-tion methods provide smaller uncertainties than deterministic methods, such as the trapezoidalrule , for a given number of integrand evaluations. The most widely used strategies reducing thevariance of the Monte Carlo integration estimate are importance sampling and stratiﬁed sampling.Two of the most popular algorithms implementing these strategies are VEGAS and MISER .Recently, an idea utilizing generative machine learning (ML) models to perform the importancesampling is also proposed .In this paper, we present a novel algorithm numerically evaluating multi-dimensional integralsexploiting the eﬃcient interpolation ability of ML regression algorithms. The new algorithm in-volves computations for the training of and prediction with the ML algorithms, in addition to theevaluation of the integrand. Assuming that evaluation of the integrand is computationally muchmore expensive than the ML calculations, throughout the paper, we focus on reducing the varianceof an integral estimate for a given number of integrand evaluations. ∗ [email protected] a r X i v : . [ phy s i c s . c o m p - ph ] S e p Method

Suppose that we have a ML regression model ˜ f ( x ) that approximates a multidimensional function f ( x ) ≈ ˜ f ( x ) ≡ ˜ y . Here x is the input vector of the regression model, which is called the independentvariable , and ˜ y is the output of the regression model, which is called the dependent variable . Theintegral of f ( x ) over an integration region Ω ∈ R D can be split into two integrals I = (cid:90) Ω f ( x ) d x (1)= (cid:90) Ω ˜ f ( x ) d x (cid:124) (cid:123)(cid:122) (cid:125) I (a) + (cid:90) Ω (cid:16) f ( x ) − ˜ f ( x ) (cid:17) d x (cid:124) (cid:123)(cid:122) (cid:125) I (b) . (2)Here the I (a) is an integration of the ML regression model, which does not require an evaluationof f ( x ) to calculate. Depending on the regression model, the integral I (a) may be calculatedanalytically or numerically. Since ˜ f ( x ) approximates f ( x ), the I (a) provides an approximation ofthe target integral I . The I (b) in Eq. (2) provides a correction to the bias of the approximation in I (a) . The integral can be evaluated using a numerical method, such as the Monte Carlo integration.In the Monte Carlo integration, variance of the I (b) is proportional toVar (cid:2) f ( X ) − ˜ f ( X ) (cid:3) = Var (cid:2) f ( X ) (cid:3) + Var (cid:2) ˜ f ( X ) (cid:3) − (cid:2) f ( X ) , ˜ f ( X ) (cid:3) (3) ≈ (cid:2) f ( X ) (cid:3) (cid:16) − Corr (cid:2) f ( X ) , ˜ f ( X ) (cid:3)(cid:17) , (4)where the second line assumes a good ML regression model that gives Var (cid:2) ˜ f ( X ) (cid:3) ≈ Var (cid:2) f ( X ) (cid:3) .Therefore, the I (b) can be estimated precisely with a small number of Monte Carlo samples whencorrelation between f ( X ) and ˜ f ( X ) is high. The total uncertainty of the integral, σ I , can becalculated by combining the errors of the two terms: σ I = σ I (a) + σ I (b) . The idea replacing anobservable by its approximation with a proper correction term was proposed in the ﬁeld of latticequantum chromodynamics . In this paper, we generalize the idea for multi-dimensional integralproblems using ML regression algorithms. Now the question is how to obtain a good ML regressionmodel that closely approximates the integrand. Machine learning models

In this paper, we examine three regression algorithms: Multi-layer Perceptron (MLP), GradientBoosting Decision Tree (GBDT), and Gaussian Processes (GP). MLP is a feedforward artiﬁcialneural network that produces outputs from inputs based on multiple layers of perceptrons . Themodel is ﬂexibly applicable to various kinds of data and scales well up to a large number of data.GBDT is a sequence of shallow decision trees such that each successive decision tree compensatesfor the prediction error of its predecessor . The model provides a good regression performancewith no complicated tuning of hyperparameters and pre-processing of training data. An integrationof the GBDT regression models can be calculated analytically because the model is simply a setof intervals of input variables and their output values . GP regression is a nonparametric modelthat ﬁnds an optimal covariance kernel function explaining training data . The model is good atinterpolating the observations and works well with a small data set. Analytic integrability of theregression model depends on kernel choice. For example, in the case of the Radial Basis Function2RBF) kernel, which is one of the most popular kernels in GP, prediction of an input vector is givenby the dot product of a Gaussian function of the input vector and a constant vector, as describedin Eq. (7), so its analytic integration is given by error functions. Training data

Building a ML regression model approximating f ( x ) requires training samples of { ( x i , f ( x i )) } N train i =1 .To minimize the prediction error for a given number of training data, N train , it is essential to collectthe training samples near the peaks of the function (importance sampling) and where the functionchanges rapidly (stratiﬁed sampling). Such training data can be sampled by utilizing conventionalnumerical integration algorithms, such as the VEGAS, which includes eﬃcient sampling algorithmsbased on the importance sampling and stratiﬁed sampling. When the peaks of the function arelocalized, the training samples obtained using VEGAS build a much more accurate ML regressionmodel than those from a uniform sampling method. Data scaling

Many ML regression algorithms beneﬁt from scaling the dependent variable. Especially when thedependent variable varies by orders of magnitude within the range of interest, which is a typicalsituation in diﬃcult multi-dimensional integral problems, the data scaling plays a crucial role inobtaining a good regression performance. The most widely used scaling algorithms are min-maxscaling and standardization: y (cid:48) = y − min( y )max( y ) − min( y ) [Min-max scaling] , y (cid:48) = y − yσ y [Standardization] , (5)where y (cid:48) is the scaled variable, min( y ) and max( y ) are the minimum and maximum of y , y is theaverage of y , and σ y is the standard deviation of y . For the data with large scale variation, however,these scaling methods are dominated by the data of large magnitude and lose sensitivity to thedata of small values.To avoid the scale issue, we use the nth-root scaling deﬁned as y (cid:48) = sgn( y ) · | y | /n , (6)where sgn( y ) is the sign of y , and n is a positive integer. This is a strictly monotonic transformationwhose inverse is y = sgn( y (cid:48) ) · | y (cid:48) | n . The optimal value of n can be obtained using the training data.Taking a small portion (e.g. 10–50%) of the training data as a validation dataset, one can train aregression model on the remaining training samples with various choices of n and ﬁnd the optimalvalue of n that gives the minimum prediction error on the validation dataset. Once the n isdetermined, a ﬁnal regression model can be obtained using the full training data.This nth-root scaling plays a crucial role in building a good regression model for most of theintegral problems. In this study, we standardize the y after the nth-root scaling to maximize theregression performance. Evaluation of I (a) and I (b) When n = 1, I (a) of Eq. (2) can be calculated analytically for certain regression algorithms. When n >

1, however, the ML predictions should be processed by the inverse of the nth-root transforma-tion, so the analytic integral becomes complicated. For example, GP regression with a RBF kernel3or a nth-root scaled data can be written as˜ f (cid:48) ( x ) = N train (cid:88) i =1 α i exp (cid:20) − l || x − x i || (cid:21) , (7)where || x || is the Euclidean norm of x , x i are the training data, and α i and l are constants that aredetermined from the training. To obtain the prediction of the integrand ˜ f ( x ), the GP regressionneeds to be transformed as ˜ f ( x ) = sgn( ˜ f (cid:48) ( x )) | ˜ f (cid:48) ( x ) | n . For a positive integrand, the power of n ofEq. (7) can be expanded analytically, but the number of terms is large for large N and n .For simplicity, we use a numerical method, the VEGAS algorithm, to evaluate I (a) and I (b) .Since the peaks of the f ( x ) are ﬂattened by subtracting the ˜ f ( x ) in the integrand of I (b) , a simpleMonte Carlo integration works well for I (b) . However, the VEGAS outperforms the simple MonteCarlo integration when the regression is not accurate enough. In this section, we present numerical experiments of the proposed integration algorithm using ML.The precisions of the integral estimates are compared with those of the VEGAS algorithm, which isone of the best performing algorithms on the market , at a similar number of integrand evaluations. Test integrands

In order to test the performance of the numerical integration, we use the six families of the inte-grands proposed in Ref. : f ( x ) = cos(2 πw + c · x ) [Oscillatory] ,f ( x ) = D (cid:81) i =1 c − i + ( x i − w i ) [Product peak] ,f ( x ) = 1(1 + c · x ) D +1 [Corner peak] ,f ( x ) = exp( − (cid:80) Di =1 c i ( x i − w i ) ) [Gaussian] ,f ( x ) = exp( − c · | x − w | ) [ C -function] ,f ( x ) = (cid:40) x > w or x > w , exp( c · x ) otherwise . [Discontinuous] . (8)Here, D is the dimension of x , and w i ∈ [0 ,

1) is the parameter that is supposed to shift the peaksof the integrand without changing the diﬃculty of the integral problem. One exception is f ( x ), asthe small value of w or w makes the function to be localized in small region and makes the integralproblem diﬃcult. To avoid the unwanted eﬀect, we restrict w i ∈ [0 . , .

9) for f ( x ). c i is a positiveparameter that controls the diﬃculty of the integral. In general, increasing the value of c i increasesthe diﬃculty of the integral problem. To ﬁx the diﬃculty of the integral, we randomly choose c i from a uniform distribution in [0 ,

1) and renormalize the vector by multiplying a constant factorso that || c || = (cid:80) i | c i | becomes the target constant. In this study, we carry out the integration4or 36 diﬀerent random choices of w and c and take average performance. To ﬁx the integrationdiﬃculty, we normalize c to three diﬀerent values of || c || = 1 ,

3, and 8. Integration is performedin a D -dimensional unit hypercube, and the results are compared at three diﬀerent dimensions of D = 5 ,

8, and 10.

ML regression algorithms and hyperparameters

For the implementation of the MLP, GP, and GBDT regression algorithms, we use the scikit-learnpython library . For MLP, four hidden layers of 128, 128, 128, and 16 neurons with rectiﬁedlinear unit (ReLU) activation functions are used. Training is performed using Adam optimizationalgorithm with the learning rate of 10 − . Training updates are performed until there is no decreaseof the validation score with a tolerance of 10 − for 20 epochs with 10% of validation fraction. ForGP, RBF with a constant kernel is used, and length scale and constant are determined using L-BFGS-B optimizer . For GBDT, we use 1000 weak estimators with a learning rate of 0 .

01 anda subsampling ratio of 0 .

3. The maximum depth of each decision tree is limited to 4. Note thathere we use a relatively large number of estimators with a small subsampling ratio so that theregression output becomes a smooth function in x . In this proof-of-principal study, we did notexplore extensive phase space of the hyperparameters but take those generally considered to be areasonable choice.The powers of the nth-root scaling n ∈ [1 ,

50] for MLP and GP regressions are determined byusing 20% of training data as a validation dataset. The performance of the GBDT algorithm is notvery sensitive to the scaling but n > n = 1 case. So, we use aﬁxed number n = 3 for GBDT regression. VEGAS setup

For the VEGAS numerical integration, we use Lepage’s VEGAS python library . The library hastwo damping parameters: α and β . The parameter α controls the remapping of the integrationvariables in the VEGAS adaptation. A smaller value gives the slower grid adaption for a conser-vative estimate. Here, we use α = 0 .

5, which is the default value of the library, for most of thecalculations. One exception is the discontinuous integrand family f ( x ), which is more diﬃcult toevaluate than other integrand families and requires a large number of samples per iteration or slowgrid adaptation to converge to the exact integral solution. To make the VEGAS integral stable, weuse α = 0 . f ( x ). The parameter β controls the redistribution of integrand evaluations in thestratiﬁed sampling. β = 1 is the theoretically optimal value, and β = 0 means no redistribution.Here, we use β = 0 .

75, which is the default value of the library.Another important parameters are the number of iterations for the VEGAS grid adaptation( N itn ) and the approximated number of integrand evaluations per iteration ( N eval ). These param-eters are set diﬀerently for diﬀerent VEGAS tasks performed in this study: (1) calculation of thetarget integral I in Eq. (1) for a comparison, (2) sampling the training data, (3) calculation of I (b) in Eq. (2), and (4) calculation of I (a) in Eq. (2). • In task (1), we use N itn = 20 at two diﬀerent values of N eval = 500, and 1000. When β = 0,the total number of integrand evaluations will be N itn × N eval . Because of the redistributionthat happens when β >

0, however, the total number of integrand evaluations for this taskdrops to around N ≈ N itn × N eval / In task (2), we use N itn = 10 for the most of the integrand families with the same N eval values used in task (1). In this task, all the integrand calls, (cid:0) x , f ( x ) (cid:1) , are collected as the MLtraining data. The total number of integrand evaluations in this task is N train ≈ N/

2. Twoexceptions are f ( x ), where we use N itn = 14, and f ( x ), where we use N itn = 6, which arethe choices that stabilize the VEGAS integration of task (3). The choice of N itn determinesthe ratio of the number of integrand evaluations for the training ( N train ) and bias correction( N crxn ). The parameter can be tuned for a given problem so that it minimizes the integraluncertainty. In this proof-of-principal study, however, we take N itn = 10, which makes theratio N train : N crxn ≈ • In task (3), our target total number of integrand evaluations is N − N train , so that the totalnumber of integrand evaluations in the ML integrator is the same as that of the VEGASintegration in task (1). For that, we set N eval = ( N − N train ) / α = 0 . N crxn , is close to N − N train . • In task (4), we use N itn = 30 and set N eval to those used in task (1) multiplied by a factor of1000. We stop the VEGAS iteration when the error of I (a) becomes smaller than 20% of theerror of I (b) .VEGAS integral estimates are obtained by taking a weighted average of the estimates from eachVEGAS iteration. Whenever the p-value of the weighted average is smaller than 0.05, we discardthe results and rerun the VEGAS integration with a diﬀerent random seed. Results

Table 1 shows the precision gain of the proposed integration algorithm over VEGAS,Gain = σ I of VEGAS σ I of ML Integrator . (9)Total number of integrand evaluations of the ML integrator ( N train + N crxn ) is similar to that of theVEGAS integration ( N ). The full list of N , N train , N crxn , and precision of the integral algorithmsare given in Appendix A.The best performing ML algorithms are GP for the integrand families 1–4, MLP for the inte-grand family 5, and GBDT for the integrand family 6. Fig. 1 clearly explains these results: GPwith a RBF kernel shows very good performance in describing smooth functions but fails in C and discontinuous functions. MLP shows mediocre performance for the all functional forms, andGBDT, which is a combination of the discrete decision trees, outperforms the MLP in describingthe discontinuous integrands.For all test cases, the ML integrator performs better than VEGAS. The gain is higher when D is smaller and when || c || is smaller. Also, the gain tends to be increased when N is larger,which indicates a better scaling behavior than VEGAS. In case of the integrations with the GPregression algorithm, the σ I of the ML integrator is up to four orders of magnitude smaller thanthat of VEGAS. When GP is eﬃciently applied, the diﬀerence between the target integrand andits ML prediction is tiny, which makes the value and error of I (b) small. As a result, the ﬁnal error6 f ( x , x i > = . ) x f(x) MLP GBDT GP -0.03-0.02-0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 0.2 0.4 0.6 0.8 1 (f − f ~ ) ( x , x i > = . ) x MLP GBDT GP f ( x , x i > = . ) x f(x) MLP GBDT GP -0.02 0 0.02 0.04 0.06 0.08 0 0.2 0.4 0.6 0.8 1 (f − f ~ ) ( x , x i > = . ) x MLP GBDT GP -10 0 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 1 f ( x , x i > = . ) x f(x) MLP GBDT GP -50-40-30-20-10 0 10 20 30 40 0 0.2 0.4 0.6 0.8 1 (f − f ~ ) ( x , x i > = . ) x MLP GBDT GP

Figure 1: Integrands f i ( x ) and their ML predictions ˜ f i ( x ) (left), and the prediction errors f i ( x ) − ˜ f i ( x ) (right) for N ≈ || c || = 8 .

0. Those are plotted as a function of x , while rest of the x are ﬁxed to x i =2 , ,..., = 0 . || c || ≈ ≈ N ) for three dimensions ( D ) and three integrand diﬃculties( || c || ) on the six integrand families listed in Eq. (8). The results are averaged over 36 randomsamples. The numbers in the parentheses are the standard deviation of the mean.is dominated by the error of I (a) , which can be improved without increasing the number of f ( x )evaluations. As an example, below shows I (a) and I (b) of the integrands shown in Fig. 1: I = I (a) + I (b) f : 0 . . − . GP ] f : 0 . . . M LP ] f : 14 . . . GBDT ] (10)Since the ML integrator uses VEGAS, it inherits the potential instability of the VEGAS forsmall N eval , which introduces a systematic bias in the integration results. In general, the instabilitycan be avoided by increasing N or decreasing the value of α ; a detailed description of how to dealwith the instability is given in Ref. . For the ML integrator, the most fragile part is the integrationfor I (b) = (cid:82) ( f − ˜ f ) d x . When such instability is observed, for a given N , one can increase the ratio of N train for a better prediction or increase the ratio of N crxn for a more stable integrand evaluation,depending on the integral problem. It is also important to use a ML regression algorithm thatyields a smooth ˜ f ( x ). As shown in the right column of Fig. 1, a non-smooth ˜ f ( x ), such as the onefrom GBDT, makes f ( x ) − ˜ f ( x ) highly oscillating and the integral diﬃcult to evaluate. Among thethree regression algorithms used in this study, we ﬁnd that the smooth prediction of GP gives themost stable integration for I (b) . To check the instability, we do not manually tune the integration8arameters for each integral problem but use a general setting for most of the calculations. Asa result, we could observe a few integral results deviating from the true answer by more than4 σ , mostly in case the integral families 3 and 6. Since the number of such occurrences is smallcompared to the total number of random samples, the inclusion of these occurrences does notchange the average results, so we did not exclude these results from our average estimation. Thenumber of more than 4 σ deviations for each integral problem is given in the tables in Appendix A. In this paper, we proposed a novel algorithm calculating multi-dimensional integrals using MLregression algorithms. In this algorithm, a ML regression model is trained to mimic the targetintegrand and is used to estimate an approximated integral. Any bias of the estimate induced bythe ML prediction error is corrected by using a bias correction term, as described in Eq. (2), sothat the ﬁnal integral result could have a statistically correct estimation of the uncertainty. Twoessential prescriptions for obtaining a good the training eﬃciency are (1) collecting training samplesusing the VEGAS algorithm, and (2) scaling the training data using the nth-root scaling deﬁnedin Eq. (6).The performance of the proposed ML integrator is compared with that of the VEGAS algorithmon six diﬀerent integrand families listed in Eq. (8). Three ML regression algorithms of MLP,GBDT, and GP are examined, and the best performing algorithm is selected for each integrandfamily. For all test cases, the ML integrator shows better performance than the VEGAS for a giventotal number of integrand evaluations. In most of the cases, the ML integrator is able to provideintegration results with more than an order of magnitude smaller uncertainty than the VEGASalgorithm. The performance gain is presented in Table 1. As a proof-of-principal study, we did nottune the ML and algorithm hyperparameters for each integral problem. By tuning the parameterstailored to a speciﬁc problem, one would be able to further reduce the integration error of the MLintegrator.We ﬁnd that the performance and the stability of the proposed algorithm largely depend onthe smoothness of the regression output. Developing a ML algorithm speciﬁcally targeting the MLintegrator will be able to improve the performance and stability of the algorithm. One possibleapproach is to augment the training data by adding a small amount of noise to the training data ,which could improve the smoothness of the MLP and GBDT models. We also ﬁnd that the GPregression algorithm with a RBF kernel fails in describing C and discontinuous functions becauseof the singular points in the integrands. For a given integrand with known such singular points, onewould be able to build a combination of multiple GP models deﬁned on each domain divided bythe singular points for a better performance. It will be also promising to explorer diﬀerent types ofkernels or to develop a hybrid model of decision tree and GP that can be generically applicablefor such integrands. Acknowledgments

Computations were carried out using Institutional Computing at Los Alamos National Laboratory.This work was supported by the U.S. Department of Energy, Oﬃce of Science, Oﬃce of High EnergyPhysics under Contract No. 89233218CNA000001, and the LANL LDRD program.9

Tables of results

In Tables 2–7, we show the results of VEGAS and ML integrations algorithms.

References [1] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,

Numerical Recipes 3rdEdition: The Art of Scientiﬁc Computing , 3rd ed. (Cambridge University Press, USA, 2007).[2] G. Peter Lepage, Journal of Computational Physics , 192 (1978).[3] G. P. Lepage, VEGAS - an adaptive multi-dimensional integration program , Tech. Rep. CLNS-447 (Cornell Univ. Lab. Nucl. Stud., Ithaca, NY, 1980).[4] J. Bendavid, (2017), arXiv:1707.00028 [hep-ph] .[5] G. S. Bali, S. Collins, and A. Schafer, Comput. Phys. Commun. , 1570 (2010),arXiv:0910.3970 [hep-lat] .[6] T. Blum, T. Izubuchi, and E. Shintani, Phys. Rev.

D88 , 094503 (2013), arXiv:1208.4349[hep-lat] .[7] B. Yoon, T. Bhattacharya, and R. Gupta, Phys. Rev. D , 014504 (2019), arXiv:1807.05971[hep-lat] .[8] R. Rojas,

Neural Networks: A Systematic Introduction (Springer-Verlag, Berlin, Heidelberg,1996).[9] L. Breiman, J. Friedman, C. Stone, and R. Olshen,

Classiﬁcation and Regression Trees , TheWadsworth and Brooks-Cole statistics-probability series (Taylor & Francis, 1984).[10] J. H. Friedman, Ann. Statist. , 1189 (2001).[11] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning , Adaptive Compu-tation and Machine Learning (MIT Press, Cambridge, MA, USA, 2006) p. 248.[12] T. Hahn, Comput. Phys. Commun. , 78 (2005), arXiv:hep-ph/0404043 .[13] A. Genz, “A package for testing multiple integration subroutines,” in

Numerical Integration:Recent Developments, Software and Applications , edited by P. Keast and G. Fairweather(Springer Netherlands, Dordrecht, 1987) pp. 337–340.[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay, Journal of Machine Learning Research , 2825 (2011).[15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” (2014), citearxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conferencefor Learning Representations, San Diego, 2015.10 ≈ || c || =1.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 5(3) × − . × − . × − . × − . × − . × − MLP 1 . × − . × − . × − . × − . × − . × − GBDT 2(1) × − . × − . × − . × − . × − . × − GP 2(1) × − . × − . × − . × − . × − . × − Gain MLP 37.9(3.4) 41.2(1.5) 6.91(18) 42.2(1.3) 14.06(57) 1.177(65)GBDT 5.06(41) 22.10(86) 1.166(34) 21.83(94) 3.77(10) 10.03(51)GP 407(29) 6181(319) 329.0(4.1) 6223(258) 2.785(73) 0.261(22) N > σ GBDT – – – – – 1N ≈ || c || =3.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 1 . × − . × − . × − . × − . × − . × − MLP 3(1) × − . × − . × − . × − . × − . × − GBDT 6(3) × − . × − . × − . × − . × − . × − GP 5(2) × − . × − . × − . × − . × − . × − Gain MLP 32.7(1.1) 8.73(28) 6.59(31) 7.87(21) 5.50(19) 1.141(66)GBDT 2.18(11) 4.33(16) 1.129(36) 3.95(14) 1.647(45) 6.49(42)GP 196.8(1.7) 1321(67) 211.46(98) 1151(47) 1.105(30) 0.253(19) N > σ VEG – – – – – 2N ≈ || c || =8.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 5(2) × − . × − . × − . × − . × − . × − MLP 1 . × − . × − × − . × − . × − . × − GBDT 9(9) × − . × − . × − . × − . × − . × − GP 2(1) × − . × − . × − . × − . × − . × − Gain MLP 35.72(90) 3.381(81) 7.47(46) 4.52(20) 3.09(11) 0.942(72)GBDT 1.200(43) 1.471(64) 0.997(33) 2.15(13) 0.901(37) 3.05(17)GP 181.32(57) 47.2(7.9) 137.6(4.5) 346.2(9.4) 0.526(11) 0.220(13) N > σ VEG – – – – – 2

Table 2: Number of integrand evaluations ( N , N train , and N crxn ), precision of the integral ( σ I / | I | ),gain deﬁned in Eq. (9), and N > σ of VEGAS (VEG) and ML (MLP, GBDT, and GP) integrationalgorithms for N ≈ D = 5. The results are averaged over 36 random samples. Thenumbers in the parentheses are the standard deviation of the mean. Here, N > σ is the number ofintegration results that are more than 4 σ away from the true answer out of the 36 samples; onlythe non-zero values are presented. 11 ≈ || c || =1.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 9(6) × − . × − . × − . × − . × − . × − MLP 2(1) × − . × − . × − . × − . × − × − GBDT 4(2) × − . × − . × − . × − . × − . × − GP 5(3) × − . × − . × − . × − . × − . × − Gain MLP 40.5(4.0) 46.37(86) 5.89(16) 48.03(96) 8.88(29) 0.714(36)GBDT 4.53(37) 30.15(95) 0.752(17) 29.19(52) 3.863(76) 8.62(59)GP 352(28) 8292(262) 206.8(2.5) 8918(177) 3.25(10) 0.234(16) N > σ VEG – – – – – 1GBDT – – – – – 1N ≈ || c || =3.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 1 . × − . × − . × − . × − . × − . × − MLP 3(1) × − . × − . × − . × − . × − . × − GBDT 1 . × − . × − . × − . × − . × − . × − GP 1 . × − . × − . × − . × − . × − . × − Gain MLP 30.2(2.1) 9.17(22) 5.83(14) 8.88(19) 3.64(14) 0.603(29)GBDT 1.970(74) 5.52(13) 0.592(15) 5.143(86) 1.542(27) 4.32(22)GP 140.5(1.6) 1219(100) 130.4(1.2) 1606(46) 1.270(41) 0.231(16) N > σ VEG – – – – – 2N ≈ || c || =8.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 3(1) × − . × − . × − . × − . × − . × − MLP 1(1) × − . × − . × − . × − . × − . × − GBDT 1(1) × − . × − . × − . × − . × − . × − GP 5(2) × − . × − . × − . × − . × − . × − Gain MLP 32.98(82) 2.588(49) 7.03(31) 2.309(45) 1.613(63) 0.741(51)GBDT 1.191(27) 1.420(44) 0.450(10) 1.319(33) 0.707(13) 2.276(71)GP 106.5(2.8) 20.2(2.3) 57.2(2.5) 319.4(6.5) 0.568(16) 0.252(18) N > σ VEG – – – – – 2MLP – – 1 – – –

Table 3: Number of integrand evaluations, precision of the integral, precision gain, and N > σ for N ≈ D = 8. The notations are the same as Table 2.12 ≈ || c || =1.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 6(4) × − . × − . × − . × − . × − . × − MLP 2(1) × − . × − . × − . × − . × − . × − GBDT 4(3) × − . × − . × − . × − . × − . × − GP 6(4) × − . × − . × − . × − . × − . × − Gain MLP 35.3(3.1) 41.41(88) 5.05(15) 41.78(99) 6.48(23) 0.593(31)GBDT 4.37(38) 28.57(93) 0.572(12) 26.95(58) 3.825(66) 8.21(42)GP 345(29) 8533(250) 178.4(2.2) 8941(189) 3.55(11) 0.245(15) N > σ VEG – – – – – 1N ≈ || c || =3.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 1 . × − . × − . × − . × − . × − . × − MLP 3(1) × − . × − . × − . × − . × − . × − GBDT 1 . × − . × − . × − . × − . × − . × − GP 1 . × − . × − . × − . × − . × − . × − Gain MLP 25.2(1.8) 9.51(27) 4.81(15) 9.39(20) 2.477(75) 0.573(32)GBDT 1.759(57) 6.31(12) 0.4266(88) 6.002(87) 1.511(28) 4.34(23)GP 118.7(1.3) 972(106) 98.7(1.4) 2028(56) 1.405(40) 0.247(15) N > σ VEG – – – – – 3N ≈ || c || =8.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 3(1) × − . × − . × − . × − . × − . × − MLP 1 . × − . × − . × − . × − . × − . × − GBDT 6(4) × − . × − . × − . × − . × − . × − GP 8(4) × − . × − . × − . × − . × − . × − Gain MLP 25.99(72) 2.389(34) 5.87(26) 2.102(38) 1.184(50) 0.595(39)GBDT 1.076(20) 1.472(40) 0.2780(81) 1.321(20) 0.676(14) 2.005(72)GP 53.6(1.5) 17.4(1.7) 34.6(1.4) 385.8(9.8) 0.622(16) 0.241(17) N > σ VEG – – – – – 3MLP – – 5 – – –GBDT – – 1 – – –

Table 4: Number of integrand evaluations, precision of the integral, precision gain, and N > σ for N ≈ D = 10. The notations are the same as Table 2.13 ≈ || c || =1.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 3(2) × − . × − . × − . × − . × − . × − MLP 7(4) × − . × − . × − . × − . × − . × − GBDT 1 . × − . × − . × − . × − . × − . × − GP 1(1) × − . × − . × − . × − . × − . × − Gain MLP 41.7(3.9) 39.9(1.3) 7.65(16) 39.2(1.2) 16.67(54) 1.531(75)GBDT 4.51(36) 19.22(54) 1.076(27) 18.42(62) 3.288(97) 14.52(66)GP 357(24) 5770(311) 298.9(2.9) 5497(235) 2.546(65) 0.243(18) N > σ – – – – – – –N ≈ || c || =3.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 5(2) × − . × − . × − . × − . × − . × − MLP 1 . × − . × − . × − . × − . × − . × − GBDT 4(2) × − . × − . × − . × − . × − . × − GP 3(1) × − . × − . × − . × − . × − . × − Gain MLP 35.43(98) 8.67(18) 6.90(28) 7.76(17) 7.88(25) 1.262(68)GBDT 2.05(11) 3.84(14) 0.975(27) 3.70(12) 1.503(54) 7.47(34)GP 174.7(1.6) 1241(44) 200.16(92) 1084(44) 1.076(25) 0.234(15) N > σ – – – – – – –N ≈ || c || =8.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 2 . × − . × − . × − . × − . × − . × − MLP 5(2) × − . × − × − . × − . × − . × − GBDT 2(1) × − . × − . × − . × − . × − . × − GP 1 . × − . × − . × − . × − . × − . × − Gain MLP 40.6(1.0) 3.564(87) 7.47(41) 4.13(22) 3.96(12) 1.350(80)GBDT 0.997(35) 1.379(57) 0.825(24) 1.86(11) 0.770(32) 3.55(16)GP 157.38(47) 121(20) 144.3(6.4) 279.4(7.4) 0.4730(90) 0.205(14) N > σ VEG – – – – – 1

Table 5: Number of integrand evaluations, precision of the integral, precision gain, and N > σ for N ≈ D = 5. Notations are the same as Table 2.14 ≈ || c || =1.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 7(5) × − . × − . × − . × − . × − . × − MLP 9(5) × − . × − . × − . × − . × − . × − GBDT 2(1) × − . × − . × − . × − . × − . × − GP 3(2) × − . × − . × − . × − . × − . × − Gain MLP 60.6(5.4) 61.4(1.2) 8.80(23) 62.8(1.2) 16.69(51) 0.791(35)GBDT 4.84(38) 33.42(90) 0.802(21) 32.06(52) 4.129(89) 10.95(57)GP 359(28) 9130(307) 215.1(2.8) 9649(203) 3.650(98) 0.230(14) N > σ VEG – – – – – 1N ≈ || c || =3.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 1 . × − . × − . × − . × − . × − . × − MLP 1 . × − . × − . × − . × − . × − . × − GBDT 7(3) × − . × − . × − . × − . × − . × − GP 6(2) × − . × − . × − . × − . × − . × − Gain MLP 46.9(2.7) 11.98(22) 7.66(30) 11.21(22) 6.64(23) 0.786(36)GBDT 2.104(85) 6.06(15) 0.620(17) 5.715(72) 1.656(39) 5.04(20)GP 147.2(1.5) 1592(89) 142.3(1.2) 1698(49) 1.436(38) 0.228(16) N > σ VEG – – – – – 3GBDT – – 1 – – –N ≈ || c || =8.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 2 . × − . × − . × − . × − . × − . × − MLP 7(3) × − . × − . × − . × − . × − . × − GBDT 3(2) × − . × − . × − . × − . × − . × − GP 3(1) × − . × − . × − . × − . × − . × − Gain MLP 51.3(1.7) 3.383(78) 8.81(44) 2.851(50) 3.075(89) 1.126(68)GBDT 1.252(32) 1.546(47) 0.453(10) 1.368(28) 0.741(17) 2.746(84)GP 132.3(2.0) 39.7(5.0) 97.2(2.9) 319.5(7.7) 0.638(15) 0.269(20) N > σ VEG – – – – – 1MLP – 1 3 – – –

Table 6: Number of integrand evaluations, precision of the integral, precision gain, and N > σ for N ≈ D = 8. Notations are the same as Table 2.15 ≈ || c || =1.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 5(3) × − . × − . × − . × − . × − . × − MLP 9(6) × − . × − . × − . × − . × − × − GBDT 3(2) × − . × − . × − . × − . × − . × − GP 4(3) × − . × − . × − . × − . × − . × − Gain MLP 59.3(5.0) 57.6(1.1) 8.16(21) 57.39(95) 12.95(51) 0.696(33)GBDT 4.52(40) 32.5(1.1) 0.598(13) 30.96(53) 3.918(71) 9.84(58)GP 339(27) 9092(218) 180.5(2.0) 9535(118) 3.78(10) 0.218(11) N > σ – – – – – – –N ≈ || c || =3.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 9(4) × − . × − . × − . × − . × − . × − MLP 1 . × − . × − . × − . × − . × − . × − GBDT 9(4) × − . × − . × − . × − . × − . × − GP 8(3) × − × − . × − . × − . × − . × − Gain MLP 39.9(1.9) 12.32(27) 6.57(20) 11.70(19) 5.14(19) 0.722(41)GBDT 1.832(71) 6.70(14) 0.4227(74) 6.373(81) 1.561(28) 4.65(26)GP 118.1(1.1) 1679(111) 116.3(1.1) 1976(47) 1.488(40) 0.227(13) N > σ VEG – – – – – 1N ≈ || c || =8.0Integrand family 1 2 3 4 5 6 N N train N crxn σ I / | I | VEG 2(1) × − . × − . × − . × − . × − . × − MLP 7(4) × − . × − . × − . × − . × − . × − GBDT 1(1) × − . × − . × − . × − . × − . × − GP 3(1) × − . × − . × − . × − . × − . × − Gain MLP 42.69(95) 3.181(47) 7.67(36) 2.645(44) 2.39(10) 0.872(56)GBDT 1.100(26) 1.570(43) 0.2983(86) 1.371(19) 0.690(12) 2.198(93)GP 88.7(2.6) 35.6(4.0) 56.9(2.1) 372.1(9.0) 0.657(15) 0.236(17) N > σ VEG – – – – – 2MLP – – 7 – – –GP – – 1 – – –

Table 7: Number of integrand evaluations, precision of the integral, precision gain, and N > σ for N ≈ D = 8. Notations are the same as Table 2.1616] R. Byrd, P. Lu, J. Nocedal, and C. Zhu, SIAM Journal of Scientiﬁc Computing , 1190(1995).[17] P. Lepage, “gplepage/vegas: vegas version 3.4.5,” (2020).[18] P. Lepage, “vegas Documentation,” (2020).[19] L. Holmstrom and P. Koistinen, IEEE Transactions on Neural Networks , 24 (1992).[20] G. An, Neural Computation , 643 (1996).[21] E. Schulz, M. Speekenbrink, and A. Krause, Journal of Mathematical Psychology85