Large-scale Uncertainty Estimation and Its Application in Revenue Forecast of SMEs
Zebang Zhang, Kui Zhao, Kai Huang, Quanhui Jia, Yanming Fang, Quan Yu
LLarge-scale Uncertainty Estimation and Its Application inRevenue Forecast of SMEs
Zebang Zhang, Kui Zhao, Kai Huang, Quanhui Jia, Yanming Fang, Quan Yu
Ant Financial Services Group { zebang.zhzb, zhaokui.zk, kevin.hk, quanhui.jia, yanming.fym, jingmin.yq } @antfin.com Abstract
The economic and banking importance of the smalland medium enterprise (SME) sector is well rec-ognized in contemporary society. Business creditloans are very important for the operation of SMEs,and the revenue is a key indicator of credit limitmanagement. Therefore, it is very beneficial toconstruct a reliable revenue forecasting model. Ifthe uncertainty of an enterprise’s revenue forecast-ing can be estimated, a more proper credit limit canbe granted. Natural gradient boosting approach,which estimates the uncertainty of prediction by amulti-parameter boosting algorithm based on thenatural gradient. However, its original implemen-tation is not easy to scale into big data scenarios,and computationally expensive compared to state-of-the-art tree-based models (such as XGBoost). Inthis paper, we propose a Scalable Natural Gradi-ent Boosting Machines that is simple to implement,readily parallelizable, interpretable and yields high-quality predictive uncertainty estimates. Accord-ing to the characteristics of revenue distribution,we derive an uncertainty quantification function.We demonstrate that our method can distinguishbetween samples that are accurate and inaccurateon revenue forecasting of SMEs. What’s more,interpretability can be naturally obtained from themodel, satisfying the financial needs.
The economic and banking importance of the small andmedium enterprise (SME) sector is well recognized in con-temporary society [Biggs, 2002]. Business loans are veryimportant for the operation of SMEs. However, it is also ac-knowledged that these actors in the economy may be under-served, especially in terms of finance [Lloyd-Reason andMughan, 2006]. This has led to significant debate on thebest methods to serve this sector. A substantial portion of theSME sector may not have the security required for conven-tional collateral based bank lending, nor high enough returnsto attract formal venture capitalists and other risk investors.The effective management of lending to SMEs can contributesignificantly to the overall growth and profitability of banks [Abbott and others, 2011]. Banks have traditionally relied ona combination of documentary sources of information, inter-views and visits, and the personal knowledge and expertiseof managers in assessing the risk of business loans. But to-day, financial institutions have also begun to use big data andmachine learning to manage credit risk for the credit loans[Khandani et al. , 2010]. Revenue is a key indicator of creditlimit management. Therefore, it is very beneficial to con-struct an effective revenue forecasting model for credit limitmanagement.Forecasting the revenue of SMEs is a very challenging task.Traditional machine learning methods for financial regressiontasks like revenue forecasting, such as Gradient Boosting Ma-chines (GBMs) [Friedman, 2001], utilize the nonlinear trans-formation of decision tree to get more robust predictions. Butfor regression tasks, current popular models such as GBMscan only provide point estimates (forecast expectations or me-dians) and cannot quantify the predictive uncertainty. In fi-nancial tasks, it is crucial to estimate the uncertainty in fore-cast. The real revenue of SMEs is heteroscedastic distribu-tion. The small enterprise with relatively unstable operatingconditions have more large variance then medium enterprisewith relatively stable operating conditions. A proper creditlimit cannot be granted if the uncertainty of an enterprise’srevenue forecasting cannot be estimated. This is especiallythe case when the predictions are directly related to auto-mated decision making, as probabilistic uncertainty estimatesare important in determining manual fall-back alternatives inthe workflow [Kruchten, 2016]. In order to quantify the un-certainty, we need to upgrade from point estimation modelsto probabilistic prediction models. Probabilistic prediction,which is the approach where the model outputs a full proba-bility distribution over the entire outcome space, is a naturalway to quantify those uncertainties.Bayesian method and non-Bayesian method are state-of-the-art methods in probabilistic uncertainty estimation.Bayesian methods naturally generate predictive uncertaintyby integrating predictions over the posterior, but we are onlyinterested in predictive uncertainty and do not focus uponthe concrete procedure of generating uncertainty in predict-ing revenue of SMEs. In practice, Bayesian methods are of-ten harder to implement and computationally slower to traincompared to no-Bayesian method, such as Neural Networkmodels and Bayesian Additive Regression Trees (BART) a r X i v : . [ c s . L G ] M a y Chipman et al. , 2010]. Moreover, sampling-based Bayesianmethod generally requires good statistical expertise and thusleads to poor ease-of-use. Nature Gradient Boosting (NG-Boost) [Duan et al. , 2019] as the state-of-the-art algorithm ofnon-Bayesian method uses the natural gradient to address thechallenge that simultaneous boosting of multiple parametersfrom the base learners. They demonstrate empirically thatNGBoost performs competitively relative to other models inits predictive uncertainty estimates as well as on traditionalmetrics. They use decision tree from scikit-learn [Pedregosa et al. , 2011] as the base learner, which is a single machinealgorithm supports the exact greedy splitting. NGBoost canonly work on small data sets due to the single machine limit.In this paper, We further derive the natural gradient tomake it suitable for large-scale financial scenarios. We studythe fisher information of normal distribution and find thatthe updating procedure of its natural gradient can be fur-ther optimized. For normal distribution, we propose a moreefficient updating method for the natural gradient, whichcan dramatically improve computational efficiency. Thebase learner of SN-GBM is classification and regressiontrees(CART) [Breiman, 2017], which is the most popular al-gorithm for tree induction. Compared with NGBoost, SN-GBM adapts a more efficient distributed decision tree basedon approximate algorithm as the tree-based learner, whichcan improve the computational efficiency and robustness. Wederive an uncertainty quantification function to distinguishbetween samples that are accurate and inaccurate. In finan-cial scenarios, interpretability is always demanded becauseof the transparency requirements of financial scenarios. Sowe provide two kinds of interpretability include uncertainty.Through the uncertainty interpretability, we can know the fac-tors that cause the predictive uncertainty. In addition, weutilize the uncertainty outcome to optimize the procedure ofsolving regression problems, such as feature selection.We summarize our contributions as follows:1. We propose SN-GBM for large-scale uncertainty esti-mation in real industry and provide interpretability ofthe model.2. We apply uncertainty estimation algorithm to revenueforecasting of SMEs for the first time.3. We explore a range of uses of uncertainty estimation inregression tasks, which can bring a new modeling per-spective.
Sales Forecast.
Since there have been fewer published worksabout revenue forecast, we refer to some research about salesforecast. Sales often determine revenue. Sales forecast playsa prominent role in business strategy for generating revenue.Previous month sale is found to be more prominent param-eters influencing the sales forecast in [Sharma and Sinha,2012]. Previous revenue is also an important factor in ourrevenue forecast. The most commonly used techniques forsales forecasting include statistically based approaches liketime series, regression approaches and computational intelli-gence method like fuzzy back-propagation network (FBPN). [Chang and Wang, 2006] and [Sharma and Sinha, 2012] bothuse FBPN for sales forecasting. FBPN algorithm performsmore robust than traditional multiple linear regression algo-rithms in [Sharma and Sinha, 2012], which indicates non-linear models are more appropriate for non-linear regressiontasks such as sales forecast.
Gradient Boosting Machines.
Gradient Boosting Ma-chines [Friedman, 2001] is a widely-used machine learningalgorithm, due to its efficiency, accuracy, and interpretability.It has been shown to give state-of-the-art results in structureddata (such as Kaggle Competitions). Popular scalable im-plementations of tree-boosting methods include [Chen andGuestrin, 2016] and [Ke et al. , 2017]. We are motivated inpart by the empirical achievement of tree-based methods, al-though they only provide homoscedastic regression. One ofthe key problems in tree boosting is to find the best split fea-ture value. [Chen and Guestrin, 2016] efficiently supportsexact greedy for the single machine version, as well as ap-proximate splitting algorithm. Also, we refer to some engi-neering optimizations in xgboost and lightgbm.
Uncertainty Estimation.
Approaches to probabilisticforecasting can be broadly be distinguished as Bayesian ornon-Bayesain. Bayesian approaches (which include a priorand a likelihood) that leverage decision trees for structured in-put data include [Chipman et al. , 2010], [Lakshminarayanan et al. , 2016] and [He et al. , 2019]. Bayesian NNs learn adistribution over weights to estimate predictive uncertainty[Lakshminarayanan et al. , 2017]. Bayesian approaches costexpensive computational resource and are not easy to developdistributed algorithm. We are only interested in predictiveuncertainty and do not pay attention to the concrete processof generating uncertainty. So bayesian approaches are notin our consideration. A non-Bayesian approach is similar toour work is [Duan et al. , 2019] which takes a natural gradi-ent method to solve the problem that multi-parameter boost-ing. Such a heteroskedastic approach to capturing uncertaintyhas also been called aleatoric uncertainty estimation [Kendalland Gal, 2017]. As well as NGBoost, uncertainty that arisesdue to dataset shift or out-of-distribution inputs [Shift, ] isnot in the scope of our work.
In the theory of algorithm, we mainly refer to the work ofNGBoost. Firstly, We will clarify how NGBoost uses natu-ral gradient to implement probabilistic prediction. Then wewill demonstrate the improvements we have made on the ba-sis of NGBoost, include more efficient updating method forthe natural gradient. Traditional models can only output theinterpretability of expectation. While SN-GBM can outputtwo kinds of interpretability include uncertainty. Finally, weimplement robust and interpretable Scalable Nature GradientBoosting based on the decision tree from Spark, which is sig-nificantly faster than NGBoost.
The target of traditional regression prediction methods is toestimate E [ y | x ] . While the target of probabilistic forecast isto estimate P θ ( y | x ) , where x is a vector of observed featuresnd y is the prediction target, θ ∈ R p are parameters of targetdistribution. Take normal distribution for example, θ = [ µ, σ ] (To be more specific, different x have different parameters µ, σ , that is, θ = [ µ ( x ) , σ ( x )] ). Proper Scoring Rules
Fitting different targets need different loss functions. Proba-bilistic estimation requires ”proper scoring rule” as optimiza-tion objective. A proper scoring rule S takes as input a fore-casted probability distribution P and one observation y , andthe true distribution of the outcomes gets the best score inexpectation [Gneiting and Raftery, 2007]. In mathematicalnotation, a scoring rule is a proper scoring rule if and only ifit satisfies E y ∼ Q [ S ( Q, y )] ≤ E y ∼ Q [ S ( P, y )] ∀ P, Q (1)where Q represents the true distribution of outcomes y , and P is any distribution. When a proper scoring rule is used asloss functions during model training, the convergence direc-tion of model is to output the calibration probability finally.In fact, maximum likelihood estimation (MLE), a method ofestimating the parameters of a probability distribution, whichsatisfies above property. The difference from one distribution Q to another P is common KL divergence: D S ( Q || P ) = D L ( Q || P )= E y ∼ Q (cid:20) log Q ( y ) P ( y ) (cid:21) (2)It has a nice property that is invariant to the choice ofparametrization [Dawid and Musio, 2014]. We will talkabout importance of this property in later sections. Natural Gradient
Gradient descent is the most commonly used method to opti-mize the objective function. The ordinary gradient of a scor-ing rule S is the direction of steepest ascent (fastest increasein infinitesimally small steps). That is, ∇ S ( θ, y ) ∝ lim (cid:15) → arg max d : || d || = (cid:15) S ( θ + d, y ) (3)However, ordinary gradient is not invariant to reparametriza-tion. To be more specific, if we transform θ into ψ = z ( θ ) , P θ + dθ ( y ) (cid:54) = P ψ + dψ ( y ) . Therefore, different reparametriza-tion approaches will affect the updating path of parameter.Again, we will talk about why we need invariant of thereparametrization.The generalized natural gradient is the direction of steep-est ascent in Riemannian space, which is invariant toparametrization, and is defined: ˜ ∇ S ( θ, y ) ∝ lim (cid:15) → arg max d : D S ( P θ || P θ + d )= (cid:15) S ( θ + d, y ) (4)when choosing MLE as proper scoring rule, we get: ˜ ∇ S ( θ, y ) = ˜ ∇ L ( θ, y ) ∝ L L ( θ ) − ∇ L ( θ, y ) (5)where L L ( θ ) is the Fisher Information carried by an obser-vation about P θ . Note that a Fisher Information matrix iscalculated for each sample. In this section, we take the normal distribution as an exampleto demonstrate how to implement efficient large-scale distri-bution estimation.
Simplify Computation
The key of NGBoost is to calculate ˜ ∇ L ( θ, y ) , which isequal to calculate L L ( θ ) − ∇ L ( θ, y ) . NGBoost calculates L L ( θ ) − ∇ L ( θ, y ) by solving system of linear equations,whose time complexity is O ( N ) (where N = 2 ). Thistime complexity is relatively high for a single machine al-gorithm. Moreover, solving the system of linear equations isalso not conducive to implementing distributed parallel algo-rithms. We find that on the premise of the normal distribution,a more direct method for calculating natural gradient can bederived.The normal distribution is the most commonly used prob-ability distribution. Many forecasting targets follow the nor-mal distribution or can be transformed into a normal distri-bution (such as log-normal distribution). So we optimizethe natural gradient calculation for the normal distribution.For normal distribution, the distribution parameters are θ =[ µ, ψ ] , where ψ = log( σ ) . By further derivation, we get: ∇ L ( θ, y ) = (cid:34) µ − yσ − ( µ − y ) σ (cid:35) (6)Actually, the inverse of a fisher information matrix also can bederived simply. The fisher information of normal distributionis as follow: L L ( θ ) = E (cid:34) σ − µ + y ) σ − µ + y ) σ µ − y ) σ (cid:35) = (cid:20) σ
00 2 (cid:21) (7)Then, we can get: L L ( θ ) − = (cid:20) σ
00 0 . (cid:21) (8)Finally, we derive the result of natural gradient: ˜ ∇ L ( θ, y ) ∝ L L ( θ ) − ∇ L ( θ, y )= (cid:20) µ − y . − ( µ − y ) σ ) (cid:21) = (cid:20) µ − y . − ( µ − y ) exp( − ψ )) (cid:21) (9)The second term is finally transformed into multiplicationbecause the CPU of the computer calculates multiplicationoperations much faster than division. As we can see fromthe first term, NGBoost calculates the expectation µ in thesame way as a normal gradient boosting machines that targetsMean squared error (MSE). Scalable Natural Gradient Boosting
Gradient boosting is effectively a functional gradient descentalgorithm. In order to fit multiple parameters of the distri-bution, we need multiple sets of trees, and each set of treesfits one parameter. Take normal distribution as an example,we use two sets of trees to fit µ and log( σ ) . Because theange of GBM output is ( −∞ , + ∞ ) , but the range of σ is (0 , + ∞ ) . Reparameterizing σ ∈ (0 , + ∞ ) to ψ = log( σ ) , ψ ∈ ( −∞ , + ∞ ) is consistent with GBM output. This is oneof the important reasons why natural gradient is needed: Nat-ural gradient has the desirable property of being invariant toreparameterization. Another reason to use natural gradient isto enable using the same updating step size for two new treeswhen two sets of trees update at each stage. This becausethrough the adjustment of L L ( θ ) − , the gradient is scaled tothe same scale whether it is between samples or parameters(”optimally pre-scaled”).Apache Spark is a popular open-source platform for large-scale data processing, which is specially well-suited for it-erative machine learning tasks [Zaharia et al. , 2010]. TheMLlib [Meng et al. , 2016] ensemble decision trees for clas-sification and regression problems. Decision trees use manystate-of-the-art techniques from the PLANET project [Panda et al. , 2009], such as data-dependent feature discretization toreduce communication costs. Based on decision trees fromSpark ML, we implement scalable natural gradient boost-ing machines, which is a tree and feature parallelization sys-tem. Since there is no dependency between two base learnersat each iteration, two trees for two parameters can be con-structed in parallel. Algorithm 1:
Scalable Natural Gradient Boosting forNormal Distribution
Data:
Dataset: D = { x i , y i } ni =1 Input:
Boosting iterations M , Learning rate η , Treelearner f , Normal distribution with parameters µ and ψ = log σ , proper scoring rule M LE
Output:
Scalings and tree learners { ρ ( m ) , f ( m ) } Mm =1 Initialize µ (0) ← n (cid:80) ni =1 y i , ψ (0) ← (cid:113) n − ( y i − ¯ y ) for m ← , ..., M dofor i ← , ..., n do g ( µ i ) ( m ) ← µ ( m − i − y i g ( ψ i ) ( m ) ← − ( µ ( m − i − y i ) exp ( − ψ ( m − i )2 end f ( m ) µ ← fit ( { x i , g ( µ ) ( m ) } ni =1 ) f ( m ) ψ ← fit ( { x i , g ( ψ ) ( m ) } ni =1 ) ρ ( m ) ← arg min ρ (cid:80) ni =1 M LE ( µ m − i − ρ · f ( m ) µ ( x i ) , ψ m − i − ρ · f ( m ) ψ ( x i ) , y i ) for i ← , ..., n do µ ( m ) i ← µ ( m − i − η ( ρ ( m ) · f ( m ) µ ( x i )) ψ ( m ) i ← ψ ( m − i − η ( ρ ( m ) · f ( m ) ψ ( x i )) endend The overall training procedure is summarized in Algorithm1. For normal distribution, g ( µ ) and g ( ψ ) are the natural gra-dient of µ and ψ , respectively. In each iteration, two treelearners f µ and f ψ will be constructed in parallel. The scal-ing factor ρ is chosen to minimize MLE in the form of a linesearch. We multiply it by global update step η , then update the parameters to µ and ψ . Interpretability of Uncertainty
For each sample, SN-GBM will output two prediction results,which are forecast expectation µ and variance σ . Theoret-ically, the smaller the variance, the narrower its distribution,and the more accurate the prediction. The heteroscedasticityof data often arises uncertainty. Heteroscedasticity often oc-curs when there is a large difference among the sizes of theobservations. So we use variance to estimate the uncertaintyof prediction results. In tree-based model, feature importanceis often used as a factor in making decisions in interpretingmodels. SN-GBM is composed of two sets of trees, one isexpectation set and another is variance set. We provide twoapproaches to getting the feature importance of variance:1. Weight: The number of times a feature is used to splitthe data across variance trees.2. Gain: The average gain of the feature when it is used invariance trees.By the feature importance of the variance, we can knowwhich features affect the uncertainty of the prediction and thecorrelation score. We propose an approach for quantifying the uncertainty ofthe forecasting target of the non-normal distribution of theoriginal distribution. To provide a reliable and accurate pre-diction, we derive an uncertainty quantification function forrevenue forecasting. Through the uncertainty quantificationfunction, we can know the approximate probability of accu-rate predictions for each sample. In addition, we propose abran-new feature selection based on the feature importanceof variance, which can improve the precision of uncertaintyquantification.
The normal distribution is the most commonly used proba-bility distribution. According to the central limit theorem, ifan object is affected by multiple factors, no matter what thedistribution of each factor is, the average of the results is anormal distribution. The normal distribution is symmetric,but many real-world distributions are asymmetric. Actually,if effects are independent but multiplicative rather than addi-tive, the result may be approximately log-normal rather thannormal. A Box-Cox transformation is a way to transform nor-normal dependent variables into a normal shape. One of theBox-Cox transformation is the log transformation. The realrevenue distribution is close to a log-normal distribution. Af-ter log transformation, the revenue distribution has become anormal distribution, as shown in Figure 1.In regression tasks, not only the error between the predic-tion and the observation is usually considered, but also theratio between the error and the observation needs to be con-sidered. In revenue forecasting, we train SN-GBM model tofit ln ( Y ) , where ln ( Y ) ∼ N ( µ, σ ) . So the µ and σ ofmodel output is the expectation and variance of ln ( Y ) . Infact, we need to estimate the uncertainty of the original rev-enue forecast by relative standard deviation, that is the ratio igure 1: (a) is the real revenue distribution, (b) is the log trans-formed revenue. of the standard of y to the expectation deviation of y . If therandom variable ln ( Y ) has a normal distribution, then the ex-ponential function of ln ( Y ) , Y =exp(ln Y ), has a log-normaldistribution. R notates relative standard deviation. R = (cid:112) [exp( σ ) −
1] exp(2 µ + σ )exp( µ + σ )= (cid:112) [exp( σ ) − (10)Because (cid:112) [exp( σ ) − ∝ σ , we can also use σ to mea-sure the relative standard deviation of Y . Data from many real-world applications can be high dimen-sional and features of such data are usually highly redun-dant. Identifying informative features has become an impor-tant step for data mining to not only circumvent the curse ofdimensionality but to reduce the amount of data for process-ing. Feature Selection is the process where you automaticallyor manually select those features which contribute most toyour prediction variable or output in which you are interestedin. One of the commonly used approaches is to use the featureimportance output by the tree-based model to filter features.The traditional tree-based models select the features that havea large contribution to the forecasting expectation based onthe feature importance of the expectation. This method of-ten ignores the correlation between features and uncertainty(here is variance). Based on the feature importance of thevariance of SN-GBM, we can select features that are highlycorrelated with predictive uncertainty. Some features mayhave low expectation importance but have high variance im-portance. Such features may not improve the point estimationperformance but may improve the accuracy of the distributionestimation. So in the future, we can combine the feature im-portance of expectation and variance to select features.
Our experiments use datasets from a large fintech servicesgroup. This group has served tens of millions of SMEs. Oneof the most significant scenarios is credit limit management.Our goal is to forecast the revenue of SMEs in the next sixmonths. This is a time series regression task, so we mainlychoose historical revenue and trade data of SMEs as the fea-ture for constructing model. We extract twelve sub-datasetsfrom January to December 2018 and five sub-datasets from January to May 2019. The first twelve months is the trainingset and the second five months is the test set. We extract 215features related to the revenue for each enterprise. The sizeof training sample is 10 million.
Traditional evaluation metric is mean absolute percentage er-ror (MAPE) of forecasted expectations (i.e. ˆ E [ y | x ] ). It usu-ally expresses accuracy as a percentage. Because the µ ofSN-GBM output is the expectation of ln( y ) , so our MAPEformula is defined: M = 1 n n (cid:88) i =1 | exp( µ ) − yy | (11)However, the MAPE does not capture predictive uncertainty.The quality of predictive uncertainty is captured in the av-erage negative log-likelihood (NLL) as measured on the testset. NLL is calculated as follows: N LL = − n n (cid:88) i =1 log ˆ P θ ( y i | x i ) (12)where P θ is the probability density function of the normaldistribution ˆ P θ = σ √ π e − ( y − µ )22 σ .In addition, in order to more intuitively apply the resultsof predictive uncertainty to credit limit management, we haveadded an evaluation metric ACCURACY , which indicates theproportion of samples with prediction errors within 30%. Inthe later section, we will briefly introduce how to utilize thismetric for more refined credit limit management.
Results of Uncertainty Quantification
We compare SN-GBM with several regression models com-monly used in financial scenarios such as XGBoost [Chenand Guestrin, 2016] and GBDT [Friedman, 2001]. For faircomparison, we set learning rate = 0.3, the number of itera-tions = 300, the depth of trees = 6 for all algorithms. Our ex-perimental results show that SN-GBM is comparable to state-of-the-art tree-based models in the performance of point esti-mation, as shown in Table 1.We sort the prediction results by the uncertainty quantifi-cation function σ , and then divide them into 10 buckets atequal samples. The uncertainty vs accuracy results is shownin Figure 2. The curve of accuracy and predictive uncertaintyis monotonically decreasing. If the application demands anaccuracy x%, we can trust the model only in cases where theuncertainty is less than the corresponding threshold. For ex-ample, the ACCURACY of top 50% (from uncertainty level1 to 5) samples is above 90%. Because we have great confi-dence in prediction results of the top 50% samples, for theseenterprises we can directly use the predicted revenue as a ref-erence factor for their credit limit. For other enterprises, weneed to multiply the predicted revenue by a factor before us-ing it.lgorithms 201901 201902 201903 201904 201905MAPE ACCURACY MAPE ACCURACY MAPE ACCURACY MAPE ACCURACY MAPE ACCURACYGBDT 0.310 0.687 0.305 0.712 0.294 0.736 0.320 0.724 0.293 0.742XGBoost
Table 1: Comparison of performance of point estimation on revenue scenario. SN-GBM offers competitive performance of point estimationin terms of MAPE and ACCURACY.Figure 2: Accuracy vs Uncertainty curves: The abscissa axis indi-cates the predictive uncertainty from 1 (low uncertainty) to 10 (highuncertainty).(a) MAPE performance in five test subsets. (b) ACCU-RACY performance in five test subsets.
Interpretability
The interpretability of SN-GBM include expectation featureimportance and variance feature importance. An example ofthe top 20 important features about expectation in revenuescenario is shown in Figure 3. From this figure, we observethat features with high expectation feature importance do notnecessarily have high variance feature importance.
Figure 3: Top 20 important features about expectation. The”mean importance” indicates the expectation feature importance,and the ”variance importance” indicates the variance feature impor-tance.
Feature Selection
For the time series regression, the variance type features of-ten better able to describe the predictive uncertainty. We ap-pend three time-series variance features to the 215 originalfeatures, which are the revenue variance in the past 3 months,the revenue variance in the past 6 months and the revenuevariance in the past 12 months. We compare the performanceof point estimation and distribution estimation of models with215 features and 218 features (append three features aboutrevenue variance), respectively. The results of point estima-tion are shown in Figure 4. After appending the features ofrevenue variance, the accuracy of model prediction has notbrought a significant improvement. But from Figure 5 we can see, the accuracy of the distribution estimation is significantlyimproved, relative speaking.
Figure 4: The Comparison of point estimation performance.Model1: 215 features; Model2: 218 features.Figure 5: The Comparison of distribution estimation performance.Model1: 215 features; Model2: 218 features.
In this paper, we propose a large-scale uncertainty estimationapproach named SN-GBM to predict the revenue of SMEs.The revenue distribution of SMEs is log-normal distribution.After log transformation, the revenue distribution is close tonormal distribution. For normal distribution, we further de-rive the natural gradient to make it suitable for large-scalefinancial scenarios. We derive an uncertainty quantificationfunction for the original distribution that is log-normal. Spe-cially, we provide the interpretability for predictive uncer-tainty. Through the uncertainty interpretability, we can knowthe factors that cause the predictive uncertainty. Experimen-tal results show that we can effectively distinguish betweenaccurate and inaccurate samples on a large-scale real-worlddataset, which is significantly beneficial for refined creditlimit management. The features of the variance type can im-prove the accuracy of the distribution estimation. In the fu-ture, it is worth considering to retain features of the variancetype when constructing regression models. eferences [Abbott and others, 2011] Lewis F Abbott et al.
The Man-agement of Business Lending: A Survey , volume 2. Indus-trial Systems Research, 2011.[Biggs, 2002] Tyler Biggs. Is small beautiful and worthy ofsubsidy? literature review. 2002.[Breiman, 2017] Leo Breiman.
Classification and regressiontrees . Routledge, 2017.[Chang and Wang, 2006] Pei-Chann Chang and Yen-WenWang. Fuzzy delphi and back-propagation model for salesforecasting in pcb industry.
Expert systems with applica-tions , 30(4):715–726, 2006.[Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin.Xgboost: A scalable tree boosting system. In
Proceedingsof the 22nd acm sigkdd international conference on knowl-edge discovery and data mining , pages 785–794. ACM,2016.[Chipman et al. , 2010] Hugh A Chipman, Edward I George,Robert E McCulloch, et al. Bart: Bayesian additive regres-sion trees.
The Annals of Applied Statistics , 4(1):266–298,2010.[Dawid and Musio, 2014] Alexander Philip Dawid andMonica Musio. Theory and applications of proper scoringrules.
Metron , 72(2):169–183, 2014.[Duan et al. , 2019] Tony Duan, Anand Avati, Daisy Yi Ding,Sanjay Basu, Andrew Y Ng, and Alejandro Schuler. Ng-boost: Natural gradient boosting for probabilistic predic-tion. arXiv preprint arXiv:1910.03225 , 2019.[Friedman, 2001] Jerome H Friedman. Greedy function ap-proximation: a gradient boosting machine.
Annals ofstatistics , pages 1189–1232, 2001.[Gneiting and Raftery, 2007] Tilmann Gneiting andAdrian E Raftery. Strictly proper scoring rules, pre-diction, and estimation.
Journal of the AmericanStatistical Association , 102(477):359–378, 2007.[He et al. , 2019] Jingyu He, Saar Yalov, and P RichardHahn. Xbart: Accelerated bayesian additive regressiontrees. In
The 22nd International Conference on ArtificialIntelligence and Statistics , pages 1130–1138, 2019.[Ke et al. , 2017] Guolin Ke, Qi Meng, Thomas Finley,Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, andTie-Yan Liu. Lightgbm: A highly efficient gradient boost-ing decision tree. In
Advances in Neural Information Pro-cessing Systems , pages 3146–3154, 2017.[Kendall and Gal, 2017] Alex Kendall and Yarin Gal. Whatuncertainties do we need in bayesian deep learning forcomputer vision? In
Advances in neural information pro-cessing systems , pages 5574–5584, 2017.[Khandani et al. , 2010] Amir E Khandani, Adlar J Kim, andAndrew W Lo. Consumer credit-risk models via machine-learning algorithms.
Journal of Banking & Finance ,34(11):2767–2787, 2010.[Kruchten, 2016] N Kruchten. Machine learning meets eco-nomics. 2016. [Lakshminarayanan et al. , 2016] Balaji Lakshminarayanan,Daniel M Roy, and Yee Whye Teh. Mondrian forests forlarge-scale regression when uncertainty matters. In
Artifi-cial Intelligence and Statistics , pages 1478–1487, 2016.[Lakshminarayanan et al. , 2017] Balaji Lakshminarayanan,Alexander Pritzel, and Charles Blundell. Simple and scal-able predictive uncertainty estimation using deep ensem-bles. In
Advances in Neural Information Processing Sys-tems , pages 6402–6413, 2017.[Lloyd-Reason and Mughan, 2006] Lester Lloyd-Reasonand Terry Mughan. Removing barriers to sme access tointernational markets. 2006.[Meng et al. , 2016] Xiangrui Meng, Joseph Bradley, BurakYavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu,Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen,et al. Mllib: Machine learning in apache spark.
TheJournal of Machine Learning Research , 17(1):1235–1241,2016.[Panda et al. , 2009] Biswanath Panda, Joshua S Herbach,Sugato Basu, and Roberto J Bayardo. Planet: massivelyparallel learning of tree ensembles with mapreduce.
Pro-ceedings of the VLDB Endowment , 2(2):1426–1437, 2009.[Pedregosa et al. , 2011] Fabian Pedregosa, Ga¨el Varoquaux,Alexandre Gramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. Scikit-learn: Machinelearning in python.
Journal of machine learning research ,12(Oct):2825–2830, 2011.[Sharma and Sinha, 2012] Rashmi Sharma and Ashok KSinha. Sales forecast of an automobile industry.
Inter-national Journal of Computer Applications , 53(12), 2012.[Shift, ] Evaluating Predictive Uncertainty Under DatasetShift. Can you trust your model’s uncertainty?[Zaharia et al. , 2010] Matei Zaharia, Mosharaf Chowdhury,Michael J Franklin, Scott Shenker, and Ion Stoica. Spark:Cluster computing with working sets.