[PDF] Likelihood-Free Gaussian Process for Regression

Abstract

Gaussian process regression can flexibly represent the posterior distribution of an interest parameter given sufficient information on the likelihood. However, in some cases, we have little knowledge regarding the probability model. For example, when investing in a financial instrument, the probability model of cash flow is generally unknown. In this paper, we propose a novel framework called the likelihood-free Gaussian process (LFGP), which allows representation of the posterior distributions of interest parameters for scalable problems without directly setting their likelihood functions. The LFGP establishes clusters in which the value of the interest parameter can be considered approximately identical, and it approximates the likelihood of the interest parameter in each cluster to a Gaussian using the asymptotic normality of the maximum likelihood estimator. We expect that the proposed framework will contribute significantly to likelihood-free modeling, particularly by reducing the assumptions for the probability model and the computational costs for scalable problems.

Full PDF

LLikelihood-Free Gaussian Process for Regression

Yuta Shikuri

Yokohama, Japan [email protected]

Abstract

Gaussian process regression can ﬂexibly represent the posterior distribution ofan interest parameter providing that information on the likelihood is sufﬁcient.However, in some cases, we have little knowledge regarding the probability model.For example, when investing in a ﬁnancial instrument, the probability model ofcash ﬂow is generally unknown. In this paper, we propose a novel frameworkcalled the likelihood-free Gaussian process (LFGP), which allows representationof the posterior distributions of interest parameters for scalable problems withoutdirectly setting their likelihood functions. The LFGP establishes clusters in whichthe probability distributions of the targets can be considered identical, and it approx-imates the likelihood of the interest parameter in each cluster to a Gaussian usingthe asymptotic normality of the maximum likelihood estimator. We expect thatthe proposed framework will contribute signiﬁcantly to likelihood-free modeling,especially from the perspective of fewer assumptions for the probability model andlow computational costs for scalable problems.

Gaussian process regression (GPR) is a type of Bayesian nonparametric regression method thatallows ﬂexible representation of the posterior distribution of an interest parameter. However, somedrawbacks to GPR have been published in the literature, and several reported studies have endeavoredto overcome these limitations and contribute to the extended applicability of the method. The mostsevere drawback of GPR is the computational cost, which is O ( N ) for data size N . Moreover, theapplication of GPR to scalable problems has been a signiﬁcant area of focus in current research.Variational inducing point [Titsias, 2009, Candela and Rasmussen, 2005] to minimize the Kullback–Leibler divergence for approximation of the posterior distribution is the most popular approach amongmethods for reducing the computational cost and is highlighted in some key works [Csató and Opper,2000, Shen et al., 2005, Bui and Turner, 2014]. Hensman et al. 2015 extended the likelihood toa non-Gaussian (i.e., free-form likelihood) by estimating the hyperparameters via inducing pointframework and Markov-Chain Monte Carlo procedures.These works have contributed toward establishing a model for the posterior distribution of an interestparameter more ﬂexibly. However, despite the free-form likelihood being a powerful tool, difﬁcultiesare encountered when the probability model is unknown. A typical example of this is the modelingof cash ﬂows for investing in ﬁnancial instruments [Thu and Xuan, 2018, Sidehabi et al., 2016]. Thedemand for machine learning frameworks for algorithmic trading has been rapidly increasing in recentyears. Traders are typically being challenged with predicting asset ﬂuctuations using nonparametricmodels trained on large amounts of historical data. However, setting the probability model in such asituation is generally infeasible because the cash ﬂows of ﬁnancial instruments are quite complex.Demands for likelihood-free modeling exist not only in the ﬁeld of investments in ﬁnancial instrumentsbut also in other ﬁelds such as ecology and biology. To satisfy this demand, various types of likelihood-free inference methods have been proposed. One idea that is common among these methods isrepresenting the probability distribution of targets through a repetitive process of simulating data Preprint. Under review. a r X i v : . [ c s . L G ] J un nd evaluating the discrepancies between the simulated and observed data [Gutmann and Corander,2016]. In general, these methods have high computational cost, and we do not know how theparameters affect the discrepancy. Wood 2010 proposed a method that uses the synthetic likelihoodto approximate the probability distribution of targets to a multivariate normal distribution by thecentral limit theorem. The computational cost in their method is more efﬁcient than that of othermethods. In addition, the discrepancies between the simulated and the observed data can be evaluatedby comparing the individual maximum likelihood estimators for the data. Their method is simpleand powerful, and our work is profoundly inspired by it. However, the utility of the method islimited to cases where the generative processes of the targets are clear even though the likelihoodsare intractable.In this study, we propose a novel framework called the likelihood-free Gaussian process (LFGP),which represents the posterior distribution of an interest parameter in the form of a typical GPRwithout setting the likelihood function directly. We approximate the likelihood to a Gaussian bythe framework of GPR and the asymptotic normalities of the maximum likelihood estimators. Theconcept of our approach is similar to that of [Wood, 2010] from the perspective of the Gaussianapproximation. However, we do not assume any generative processes and focus on establishingidentically distributed clusters to obtain the maximum likelihood estimators of an interest parameter.For the performance veriﬁcation of the LFGP, we modeled some pseudo datasets and binary option(BO), which is a type of derivative instrument. Both methods and their results suggest that theLFGP is capable of representing the posterior distribution of an interest parameter even if only littleknowledge on its probability model is available. In addition, we show that the LFGP is suitable forscalable problems. Contribution.

We consider that the LFGP will signiﬁcantly contribute to likelihood-free model-ing, especially from the perspective of fewer assumptions for the probability models and lowercomputational costs for scalable problems.

Gaussian processes (GPs) are distributions over functions [Rasmussen and Williams, 2006], such that f ∼ GP ( ν ( · ) , k ( · , · )) , (1)where ν : R d → R is the mean function and k : R d × R d → R is the covariance function. Givenan observed dataset D = { X , y } = { x i , y i } ni =1 with x i ∈ R d , y i ∈ R , the posterior distribution of f = { f ( x i ) } ni =1 , f : R d → R is stated as P( f | D , Λ , θ ) ∼ P( y | f , Λ )N( f | ν X , K X,X ; θ ) , (2)where ν X = { ν ( x i ) } Ni =1 , [ K X,X (cid:48) ; θ ] s,t = k ( x s , x (cid:48) t | θ ) , θ are the hyperparameters of the priordistributions, Λ = { Λ( x i ) } ni =1 are the parameters of the probability model except f , Λ : R d → R p − , and p ∈ N is the number of parameters. Training and Prediction.

In the context of a GPR, our main concern is maximizing the log marginallikelihood L ( Λ , θ | D ) = log P( y | X , Λ , θ ) = log (cid:90) P( y | f , Λ )N( f | ν X , K X,X ; θ )d f . (3)The hyperparameters are optimized as Λ ∗ , θ ∗ = arg max Λ , θ L ( Λ , θ | D ) . (4)If the probability model P( y | f , Λ ) is Gaussian, then we can analytically obtain the gradient ∂ L ( Λ , θ | D ) ∂ Λ , ∂ L ( Λ , θ | D ) ∂ θ for the optimization. The form of the log marginal likelihood is given as L normal ( Λ , θ | D ) = −

12 ( y − ν X ) T K − ( y − ν X ) −

12 log | K | − n π ) , (5)where K = K X,X ; θ + Λ , Λ = σ I n , < σ ∈ R . In addition, the posterior distribution of f ∗ = { f ( x ∗ i ) } n ∗ i =1 for a new input X ∗ = { x ∗ i } n ∗ i =1 with x ∗ i ∈ R d is Gaussian. Further, its mean and2ariance are given as E [ f ∗ | D , X ∗ , θ ∗ , σ ∗ ] = ν X ∗ + K X ∗ ,X ; θ ∗ ( K X,X ; θ ∗ + σ ∗ I n ) − ( y − ν X ) , (6) Var[ f ∗ | D , X ∗ , θ ∗ , σ ∗ ] = K X ∗ ,X ∗ ; θ ∗ + σ ∗ I n ∗ − K X ∗ ,X ; θ ∗ ( K X,X ; θ ∗ + σ ∗ I n ) − K X,X ∗ ; θ ∗ , (7)where σ ∗ is the optimized value of σ . Despite these simpliﬁcations, GPRs signiﬁcantly contributetoward modeling various problems ﬂexibly. However, the computational cost O ( n ) for the inversematrix computation K − is unacceptable from the view of scalability, and the probability model P( y | f , Λ ) is unknown in some cases. Covariance Function.

The radial basis function (RBF) kernel is a basic covariance function that canbe written as k RBF ( x s , x (cid:48) t | θ ) ≡ C exp (cid:18) −

12 ( x s − x (cid:48) t ) T diag( l ) − ( x s − x (cid:48) t ) (cid:19) , (8)where θ = { C, l } , l = { l j } dj =1 , < C, l j ∈ R for all j . The covariance functions can be designedﬂexibly by combining multiple simpler functions. For example, Lee et al. 2018 proved that aGP with a speciﬁc covariance function is equivalent to the function of a deep neural network withinﬁnitely wide layers. However, the covariance functions are limited to positive semi-deﬁnite types,which sometimes prevent the ability to design them freely. The Euclidean distance space ( R d , d E ) between two points ( x s , x (cid:48) t ) in the RBF kernel exp( − d E ( x s , x (cid:48) t ) ) cannot be replaced by anuneven distance space [Feragen et al., 2015], where d E : R d × R d → R . To address this, some works[Guhaniyogi and Dunson, 2016, Calandra et al., 2016] transformed the original feature space to anew space instead of designing the covariance function directly. In this section, we discuss the mechanism of the LFGP. As the premise of this discussion, we assumethat the targets are independent, the set of parameters and the set of probability distributions have aone-to-one correspondence, the Fisher information matrix is a regular matrix, and ν ( · ) ≡ . To approximate the likelihood to a Gaussian using the asymptotic normality of the maximumlikelihood estimator, we consider assigning the data points D = { x i , y i } ni =1 to m clusters, i.e., π = { π i } ni =1 , π i ∈ { , , · · · , m } . If the targets y = { y i } ni =1 are identically distributed in eachcluster and the size n h of each cluster is sufﬁciently large, then the maximum likelihood estimators y π = { y πh } mh =1 of the interest parameters u = { f ( z h ) } mh =1 in the centroids Z = { z h } mh =1 , z h ∈ R d of the clusters approximately follow the Gaussian distribution: y π ∼ N( y π | u , Σ π ) , (9)where Σ π = diag( σ , σ , · · · , σ m ) , σ h = n h E y [( ∂∂f ( z h ) log P( y | f ( z h ) , Λ( z h ))) ] − ∼ O ( n h ) for all h [Lehmann, 1999]. Identical Distribution.

We evaluate the parameter consistency by the geometric mean r ( π , Z | D , θ ) of the correlations between the data points and centroids in each cluster, i.e., r ( π , Z | D , θ ) ≡ (cid:89) h (cid:89) iπ i = h k ( x i , z h | θ ) (cid:112) k ( x i , x i | θ ) k ( z h , z h | θ )  n . (10)If all the data points of a cluster are equivalent to its centroid, then − log r ( π , Z | D , θ ) = 0 . Weoptimize m, π , Z to maximize r ( π , Z | D , θ ) under the constraint that the size of each clusterexceeds n : m ∗ ( θ ) , π ∗ ( θ ) , Z ∗ ( θ ) = arg min m, π , Z n ,n , ··· ,n m ≥ n − log r ( π , Z | D , θ ) . (11)3n this discussion, we assume that the approximation of the likelihood to a Gaussian holds for − log r ( π ∗ ( θ ) , Z ∗ ( θ ) | D , θ ) ≤ δ and n h ≥ n for all h . Furthermore, we approximate Σ π as σ h ∼ O ( n h ) ≤ O ( n ) (cid:39) . Training and Prediction.

Based on these assumptions, the log marginal likelihood is given as L free ( θ | D , π , Z ) = − y T π K − Z,Z ; θ y π −

12 log | K Z,Z ; θ | − n π ) . (12)We optimize θ to maximize the log marginal likelihood L free ( θ | D , π , Z ) under the constraints ofthe aforementioned assumptions: θ ∗ = arg max θ − log r ( π ∗ ( θ ) ,Z ∗ ( θ ) | D , θ ) ≤ δ L free ( θ | D , π ∗ ( θ ) , Z ∗ ( θ )) . (13)The mean and variance of the posterior distribution of u ∗ = { f ( x ∗ i ) } n ∗ i =1 for the new input X ∗ are E [ u ∗ | D , X ∗ , θ ∗ ] = K X ∗ ,Z ∗ ; θ ∗ K − Z ∗ ,Z ∗ ; θ ∗ y π ∗ , (14) Var[ u ∗ | D , X ∗ , θ ∗ ] = K X ∗ ,X ∗ ; θ ∗ − K X ∗ ,Z ∗ ; θ ∗ K − Z ∗ ,Z ∗ ; θ ∗ K Z ∗ ,X ∗ ; θ ∗ . (15)where π ∗ = π ∗ ( θ ∗ ) , Z ∗ = Z ∗ ( θ ∗ ) . It is usually infeasible to optimize θ directly under the givenconstraints, as in eq. (13). Therefore, we decompose the optimization process into two iterativeprocesses as in algorithm 1. The process of the log marginal likelihood maximization is usuallyreasonable because the computational cost of the inverse matrix is at most O ( n /n ) . However,distance-based clustering is generally difﬁcult for scalability reasons [Asgharbeygi and Maleki, 2008]. Algorithm 1

Hyperparameter optimization Input: D = { X , y } , n , δ, (cid:15), θ , Output: θ old , π , Z repeat θ old ← θ m, π , Z ← arg min m, π , Z n ,n , ··· ,n m ≥ n − log r ( π , Z | D , θ ) (cid:46) Clustering under cluster size constraint. θ ← arg max θ L free ( θ | D , π , Z ) (cid:46) Maximizing the log marginal likelihood. until − log r ( π , Z | D , θ ) ≤ δ, L free ( θ | D , π , Z ) − L free ( θ old | D , π , Z ) ≤ (cid:15) To reduce the difﬁculties associated with scalable clustering, we set the RBF kernel to the covariancefunction. The form of eq. (11) is expressed as follows: m ∗ RBF ( θ ) , π ∗ RBF ( θ ) , Z ∗ RBF ( θ ) = arg min m, π , Z n ,n , ··· ,n m ≥ n (cid:88) h (cid:88) iπ i = h ( x i − z h ) T diag( l ) − ( x i − z h ) . (16)This is equivalent to the linear k-means [Hartigan and Wong, 1979] function, which is computationallylight because we can obtain the centroids analytically. The cost is given by O ( nm ) [Murphy, 2012].However, the two discussion points are the size constraint of the clusters generated by linear k-meansand representation of the uneven distance. Constraint.

We can obtain the solution of the linear k-means function heuristically and efﬁcientlyusing the expectation–maximization algorithm. However, assigning data points to clusters underthe cluster size constraints is usually difﬁcult. Here, we provide algorithm 2 by applying the ideaof x-means [Pelleg and Moore, 2000], which recursively splits a cluster into two clusters by lineark-means until the clusters no longer meet the division conditions. This algorithm enables us toretain the cluster size constraint while maintaining efﬁciency in the linear k-means process. Thecomputational cost of algorithm 2 is at most O ( n /n ) . Therefore, the total cost of algorithm 1 is atmost O ( n /n + n /n ) . Manifold.

As shown above, the RBF kernel signiﬁcantly reduces the computational cost of algo-rithm 2. However, we cannot guarantee that we can appropriately represent the feature space by4 lgorithm 2

Reclusive clustering Input: X , Output: π , Z ← RC( X ) procedure RC( X ) if n > | X | then return the assignment number π and the centroids z of X else Split X into two clusters X , X by linear k-means. If min {| X | , | X |} < n , then split X into X , X evenly and randomly. Run RC( X ), RC( X ) recursively.the RBF kernel. Here, the RBF kernel is not capable of representing the uneven distance of thefeature space as the covariance matrix must be semi-positive deﬁnite. This is a critical drawbackwhen the data points are distributed on manifolds ( M , d M ) embedded in Euclidean space ( R d , d E ) .To overcome this drawback, we convert the original feature space to a new Euclidean space, whichapproximately preserves the uneven distances as an option of the LFGP. For this conversion, we usemanifold learning methods (e.g., locally linear embedding (LLE) [Saul and Roweis, 2001], Isomap[Schoeneman et al., 2018], and UMAP [McInnes et al., 2018]) that represent the manifold distancesamong data points based on the k-neighbor graph. In this section, we demonstrate two experimental results that verify the performance of the LFGP.The computing environment used is Microsoft Windows 10 Pro,3.6 GHz Intel Core i9 processor, and64 GB memory. The code to replicate each experiment in this study is available . In this subsection, we present veriﬁcation of the behavior of the LFGP from two perspectives usinga pseudo dataset (table 1). First, we determine the representations of the posterior distributions ofsome interest parameters. Our interest parameters here depend on the individual problems. TheLFGP allows us to model various types of parameters ﬂexibly. Second, we convert the originalfeature space to a new Euclidean space using UMAP. The feature space sometimes has the structureof a manifold embedded in Euclidean space. However, in the framework of the LFGP, kernels arepractically difﬁcult to use, except for the RBF kernel, owing to the computational cost of clusteringfor scalability. We empirically show that a preprocessing conversion before training reduces thisdrawback. In our experiments, we do not focus on the performance of each manifold learning method.

Setup.

We prepared three pseudo datasets (Cube, Tube, and Roll), which are plotted in n data points.Each dataset consists of two clusters that have different shapes of beta distribution. Parametersmapped by f are mean, median, variance, and skew of beta distribution. Under these experimentalconditions, we ﬁrst train the LFGP on each dataset, for both cases of converted by UMAP and notconverted by UMAP. Next, we show the mean of the posterior distribution of each parameter on n ∗ test data points, replacing n in table 1 with n ∗ .Table 1: Pseudo DatasetType Cluster x i, x i, x i, y i Cube i ≤ n U( − ,

0) U( − ,

2) U( − ,

2) Be(1 , i > n U(0 ,

2) U( − ,

2) Be(2 , Tube i ≤ n cos in π + N(0 , .

1) sin in π + N(0 , .

1) U( − ,

2) Be(1 , i > n in π + N(0 , .

1) 2 sin in π + N(0 , .

1) U( − ,

2) Be(2 , Roll i ≤ n in cos in π + N(0 , . in sin in π + N(0 , .

1) U( − ,

2) Be(1 , i > n in cos in π + N(0 , . in sin in π + N(0 , .

1) U( − ,

2) Be(2 , https://github.com/MLPaperCode/LFGP erformance. From ﬁg. 1 and ﬁg. 2, the result with UMAP conversion in the case of Cube is somedistance away from the true value of each parameter, compared to the result with non-conversion.Meanwhile, the performance with UMAP conversion is superior to non-conversion in the case of Tubeand Roll. In particular, the result with non-conversion in the case of Tube is poor. We consider that thehyperparameter search along the Cartesian coordinates is difﬁcult when data points are symmetricallydistributed. These results suggest that the feature space should be converted appropriately dependingon the speciﬁc problem.Figure 1: Mean of f ( x ∗ i ) against x ∗ i, when dataset is converted by UMAP. Pink: i ≤ n ∗ , Lightblue: i > n ∗ . Upper: Cube. Middle: Tube. Lower: Roll. (a) mean. (b) median. (c) variance. (d)skew. n ∗ = 200 data points are plotted. The two dashed lines are the true values of each parameter ineach cluster (table 2). n = 20 , , d = 3 , n = 100 , δ = 1 , (cid:15) = 1 , and the k-neighbors of UMAP is . Training in the case of Cube did not converge within iterations.Figure 2: Mean of f ( x ∗ i ) against x ∗ i, when dataset is not converted by UMAP. Other conditions arethe same as in ﬁg. 1. Training in the case of Roll sometimes did not converge within iterations.6able 2: True parameters value of beta distributionCluster y i Mean Median Variance Skew i ≤ n Be(1 , i > n Be(2 , Scalability.

Table 3 suggests that we can train the LFGP on scalable problems using limited iterationsand in a realistic timeframe. However, the case wherein n = 100 , and n = 500 did not convergebecause the density of data points in each cluster was sparse. Therefore, n should be set appropriatelydepending on individual problems.Table 3: Calculation time and repetition count of algorithm 1. All numerical values in this table areaveraged over trials for the cases of Cube, mean, and UMAP non-conversion. n = 100 n = 500 n training time [sec] repetition count training time [sec] repetition count ,

000 199 ±

64 5 . ± . — * — * ,

000 684 ±

243 6 . ± . ±

157 4 . ± . ,

000 2 , ± ,

414 7 . ± . , ± ,

834 7 . ± . ,

000 11 , ± ,

560 8 . ± . , ± ,

471 5 . ± . , ,

000 44 , ± ,

805 6 . ± . , ± ,

059 6 . ± . * Training does not converge within iterations. It is generally infeasible to model asset price ﬂuctuations because their probability models areunknown. Herein, we suggest that the LFGP enables representing the percentile points of currencyexchange rate ﬂuctuations. To verify the performance of the LFGP, we consider the BO index, whichis a simple derivative instrument. The BO index is generally above or below the currency exchangerate ﬂuctuations at any given time. The cash ﬂow of the BO is literally binary in nature, despite beingbased on the currency exchange rates. Therefore, we can model the cash ﬂow from the BO directlywithout representing the percentile points of the currency exchange rate ﬂuctuations. Using thisfeature of the BO, we compare the LFGP with a baseline model. The historical currency exchangerate used here is from OANDA API . Rule.

In this experiment, we consider the following rules for the BO provided by HighLow :1. Predict whether a currency exchange rate will increase or decrease 30 s in advance.2. Payout is . for entry cost when the prediction is correct.3. A draw is treated as incorrect (signiﬁcant digits are . pips).4. Possible entry timing is Monday to Friday, 8:00–29:00, except for New Year and holidays. Strategy.

The entry asset is GBP/JPY because the frequency of draws is low compared to othercurrencies. We train the LFGP as a nonparametric model under the hypothesis that there are somepatterns of rate movements in short intervals. We do not convert the feature space before the trainingbecause of the lack of knowledge of the feature space. The whole procedure is simple:1. Represent each . and . percentile point every min with the LFGP based onevery s of rate ﬂuctuation for the previous d s. https://developer.oanda.com/rest-live-v20/introduction/ https://trade.highlow.com/

7. Add stress to the mean of the percentile points based on the posterior distribution (cid:90) f H −∞ N( f ∗ . | m ∗ . , k ∗ . ) df ∗ . = α, (17) (cid:90) f L −∞ N( f ∗ . | m ∗ . , k ∗ . ) df ∗ . = 1 − α, (18)where f H , f L are the means after stresses of . , . percentile points of the cur-rency exchange rate ﬂuctuations in s, f ∗ . , f ∗ . are the . , . percentilepoints, m ∗ . , k ∗ . are the mean and variance of the posterior distribution of f ∗ . , m ∗ . , k ∗ . are the mean and variance of the posterior distribution of f ∗ . , and α represents the degree of stress, < α ≤ . , α ∈ R .3. Bet High if the . percentile point is more than . pips and bet Low if the . percentile point is less than − . pips. Baseline.

As a baseline model, we use random forest (RF) [Kam, 1995] which is a nonparametricbinary classiﬁcation model. RF is suitable as a baseline from the perspective of scalability. Werepresent the individual probabilities of High and Low instead of percentile points and join in therounds when the probability is higher than − β, < β ≤ . , β ∈ R . Evaluation.

The training period is 2017/4/1–2018/3/31, and the evaluation period is 2018/4/1–2019/3/31. From table 4, the LFGP with d = 10 , n = 100 maximizes the cumulative proﬁt for theevaluation period, and the RF with d = 20 , depth = 10 .Table 4: Cumulative proﬁt for the period 2018/4/1–2019/3/31 by the LFGP ( δ = 1 , (cid:15) = 1 ) and theRF (the number of trees is ). Proﬁt is . if the prediction for each entry is correct; loss is . otherwise. Each value is a maximum in trials in the case of α = 0 . , β = 0 . .LFGP ( α = 0 . RF ( β = 0 . d n = 100 n = 200 n = 300 depth = 5 depth = 10 depth = 1510 1 ,

547 999 576 1 ,

021 1 ,

070 33620 1 ,

047 1 ,

087 1 ,

372 487 1 ,

181 65530 796 1 ,

408 1 ,

264 263 936 848

Figure 3: Cumulative proﬁt against entry counts for the period 2018/4/1–2019/3/31. Left: LFGP( d = 10 , n = 314 , , n ∗ = 313 , , n = 300 ). Right: RF ( d = 20 , n = 312 , , n ∗ =312 , , depth = 10 ). Other conditions are the same as in table 4. Backtesting.

The training period is 2017/4/1–2018/3/31, and the backtesting period is 2019/4/1–2020/3/31. In addition to the evaluation results, ﬁg. 4 suggests that the smaller the value of α and β ,the greater the cumulative proﬁt against entry count, and the performances of both LFGP and RF aresimilar. This suggests that the LFGP is capable of representing the posterior distribution of percentilepoints without knowledge of the probability distribution of the currency exchange rate ﬂuctuation.8igure 4: Cumulative proﬁt against entry counts for the period 2019/4/1–2020/3/31. Left: LFGP( d = 10 , n = 314 , , n ∗ = 316 , , n = 300 ). Right: RF ( d = 20 , n = 312 , , n ∗ =315 , , depth = 10 ). Other conditions are the same as in ﬁg. 3. In this study, we proposed the LFGP, which is a framework for likelihood-free modeling. The mainconcept of the LFGP is the approximation of the likelihood in each identically distributed clusterto a Gaussian using the asymptotic normality of the maximum likelihood estimator. Compared toexisting methods, this approximation exhibits fewer assumptions for the probability models and lowercomputational costs for scalable problems. However, we emphasize that the asymptotic normality isbased on the assumptions that targets are independent, the set of parameters and the set of probabilitydistributions have a one-to-one correspondence, and the Fisher information matrix is a regular matrix.Although considering whether these assumptions are valid for individual problems is necessary, weexpect that the proposed method will contribute to likelihood-free modeling in several ﬁelds. Atypical application is the modeling of cash ﬂows for investing in ﬁnancial instruments. Cash ﬂowsare quite complex, unlike binary options in general. At present, we are continuing our experiments togain actual proﬁts from the foreign exchange (FX) and not only by means of backtesting. We expectto report the results of the extended work in the near future.

Broader Impact

This study could have a positive impact in terms of stimulating the investment market. Noviceinvestors should learn numerous techniques, which is sometimes a barrier to joining in investmentmarkets. Algorithmic trading might reduce the barrier with a lower fee than professional funds.Further, a concentration risk may occur if investors use similar algorithmic trading models. We shouldbe cautious as algorithmic trading could break the investment market.

Acknowledgement

I wish to thank Dr. Naoki Hamada, Dr. Junpei Komiyama, Dr. Takashi Ohga, Atsushi Suyama, KevinNoel, Tomoaki Nakanishi, Kohei Fukuda, Masahiro Asami, Daisuke Kadowaki, Yuji Hiramatsu,Hirokazu Iwasawa, and Korki Tomizawa for invaluable advice provided during the writing of thispaper. I would also like to thank Editage for their high-quality English language editing. I am gratefulto my family for supporting my research activities.

References

Michalis K. Titsias. Variational learning of inducing variables in sparse gaussian processes.

Proceedings ofMachine Learning Research , 5:567–574, April 2009.Joaquin Q. Candela and Carl E. Rasmussen. A unifying view of sparse approximate gaussian process regression.

Journal of Machine Learning Research , 6:1939–1959, December 2005. ehel Csató and Manfred Opper. Sparse representation for gaussian process models. In Proceedings of the 13thInternational Conference on Neural Information Processing Systems , pages 423–429, Cambridge, MA, USA,December 2000. MIT Press.Yirong Shen, Andrew Y. Ng, and Matthias Seeger. Fast gaussian process regression using kd-trees. In

Proceedings of the 18th International Conference on Neural Information Processing Systems , pages 1225–1232, Cambridge, MA, USA, December 2005. MIT Press.Thang D. Bui and Richard E. Turner. Tree-structured gaussian process approximations. In

Proceedings of the27th International Conference on Neural Information Processing Systems , pages 2213–2221, Cambridge,MA, USA, December 2014. MIT Press.James Hensman, Alexander G. de G. Matthews, Maurizio Filippone, and Zoubin Ghahramani. Mcmc forvariationally sparse gaussian processes. In

Proceedings of the 28th International Conference on NeuralInformation Processing Systems , pages 1648–1656, Cambridge, MA, USA, December 2015. MIT Press.Thuy Nguyen Thi Thu and Vuong D. Xuan. Forex trading using supervised machine learning.

InternationalJournal of Engineering and Technology , 7(4.15):400–404, 2018.Sitti W. Sidehabi, Indrabayu Amirullah, and Sofyan Tandungan. Statistical and machine learning approach inforex prediction based on empirical data. In

Proceedings of International Conference on ComputationalIntelligence and Cybernetics . IEEE, November 2016.Michael U. Gutmann and Jukka Corander. Bayesian optimization for likelihood-free inference of simulator-basedstatistical models.

Journal of Machine Learning Research , 17(125):1–47, January 2016.Simon N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems.

NATURE , 466(7310):1102–1104, August 2010.Carl E. Rasmussen and Christopher K. I. Williams.

Gaussian Processes for Machine Learning . The MIT Press,2006.Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In

Proceedings of International Conference onLearning Representations , Vancouver, BC, Canada, May 2018.Aasa Feragen, Francois Lauze, and Søren Hauberg. Geodesic exponential kernels: When curvature and linearityconﬂict. In

Proceedings of Computer Vision and Pattern Recognition , pages 3032–3042, Boston, MA, USA,October 2015. IEEE.Rajarshi Guhaniyogi and David B. Dunson. Compressed gaussian process for manifold regression.

Journal ofMachine Learning Research , 17(1):2472–2497, January 2016.Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian processesfor regression. In

Proceedings of International Joint Conference on Neural Networks , pages 3338–3345,Vancouver, BC, Canada, July 2016. IEEE.E. L. Lehmann.

Elements of Large-Sample Theory . Springer, . edition, 1999.Nima Asgharbeygi and Arian Maleki. Geodesic k-means clustering. In

Proceedings of International Conferenceon Pattern Recognition , pages 1–4. IEEE, December 2008.J. A. Hartigan and M. A. Wong. A k-means clustering algorithm.

Applied statistics , 28(1):100–108, January1979.Kevin P. Murphy.

Machine Learning: A Probabilistic Perspective . MIT Press, . edition, 2012.Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efﬁcient estimation of the numberof clusters. In

Proceedings of International Conference on Machine Learning , pages 727–734. MorganKaufmann Publishers, June 2000.Lawrence K. Saul and Sam T. Roweis. An introduction to locally linear embedding.

Journal of MachineLearning Research , 7, January 2001.Frank Schoeneman, Varun Chandola, Nils Napp, Olga Wodo, and Jaroslaw Zola. Entropy-isomap: Manifoldlearning for high-dimensional dynamic processes. In

Proceedings of IEEE International Conference on BigData . IEEE, August 2018.Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection fordimension reduction.

Journal of Open Source Software , 3(29), September 2018.Ho T. Kam. Random decision forests. In

Proceedings of the 3rd International Conference on Document Analysisand Recognition , pages 278–282, Montreal, Quebec, 1995. IEEE., pages 278–282, Montreal, Quebec, 1995. IEEE.