A Novel Cluster Classify Regress Model Predictive Controller Formulation; CCR-MPC
Clement Etienam, Siying Shen, Edward J O'Dwyer, Joshua Sykes
11 | P a g e
A Novel Cluster Classify Regress Model Predictive Controller Formulation ; CCR-MPC
Clement Etienam , Siying Shen , Edward J O'Dwyer & Joshua Sykes Active Building Research Programme Imperial College London * Corresponding Author | P a g e
Abstract
In this work, we develop a novel data driven model predictive controller using advanced techniques in the field of machine learning. The objective is to regulate control signals to adjust the desired internal room set point temperature , affected indirectly by the external weather states. The methodology involves developing a time-series machine learning model with either a Long Short Term Memory model (LSTM) or a Gradient Boosting Algorithm (XGboost), capable of forecasting this weather states for any desired time horizon and concurrently optimising the control signals to the desired set point. The supervised learning model for mapping the weather states together with the control signals to the room temperature is constructed using a previously developed methodology called Cluster Classify regress (CCR), which is similar in style but scales better to high dimensional dataset than the well-known Mixture-of-Experts. The overall methodology involves using the CCR as a forward model in a batch or sequential optimisation approach. The overall method called CCR-MPC involves a combination of a time series model for weather states prediction, CCR for forwarding and any numerical optimisation method for solving the inverse problem . Two numerical method for optimisation are shown,
NelderโMead
Approximation and the Bayesian style, iterative ensemble smoother(I-ES). Forward uncertainty quantification (Forward-UQ) leans towards the regression model in the CCR and is attainable using a Bayesian deep neural network or a Gaussian process (GP). For this work, in the CCR modulation, we employ K-means clustering for Clustering , XGboost classifier for Classification and 5 th order polynomial regression for Regression. Inverse UQ can also be obtained by using an I-ES approach for solving the inverse problem or even the well-known Markov chain Monte Carlo (MCMC) approach. The developed CCR-MPC is elegant, and as seen on the numerical experiments is able to optimise the controller to attain the desired set point temperature. | P a g e
1. Introduction
Energy efficiency is a major concern to achieve sustainability in modern society. Smart cities sustainability depends on the availability of energy-efficient infrastructures and services. Buildings compose most of the city, and they are responsible for most of the energy consumption and emissions to the atmosphere (40%). Smart cities need smart buildings to achieve sustainability goals. Buildingโs thermal modelling is essential to face the energy efficiency race. In this report, we introduce a novel data driven model predictive controller (MPC). A mathematical introduction is presented to help understand the fundamental supervised learning problem to predict the room temperature in a building and also forecast weather states for 10-15 minutes lag duration. Finally the overall data driven MPC is implemented on some toy problems. The rest of the report is structured in this manner, section 2 introduces the necessary background necessary for our novel data driven MPC methodology. This session consists of the supervised learning theory, time series modelling using recurrent neural network (RNN) and Long short term memory network (LSTM), optimisation methods, neural networks, gradient boosting and a novel cluster classify regress algorithm developed in [6] and reformulated [8]. Section 3 & 4 shows some numerical experiments and section 5 gives conclusions and insights into future work. Background
For a supervised learning mode the following ansatz holds; Assume {(๐ฅ ๐ , ๐ฆ ๐ )} ๐=1๐ where ๐ฅ ๐ โ โ ๐พ and ๐ฆ ๐ โ โ ๐ for regression or ๐ฆ ๐ โ {0,1, โฆ ๐ฝ} ๐ for classification , where (๐ฅ ๐ , ๐ฆ ๐ ) โ ๐ ร ๐พ assumed to be input and output of a model, Postulating amongst a family of ansatz ๐(. ; ๐) , parametrized by ๐ โ โ ๐ , we can find a ๐ โ such that for all ๐ = 1, โฆ ๐ , ๐ฆ ๐ = ๐(๐ฅ ๐ ; ๐ โ ) Eqn.1(a) ๐: ๐ โ ๐พ is the forward mapping irregular and has sharp features, very non-linear and has noticeable discontinuities, Then โ ๐ฅ โฒ in the set {๐ฅ ๐ } ๐=1๐ , ๐ฆ โฒ โ ๐(๐ฅ โฒ ; ๐ โ ) | P a g e Eqn 1(b) Where ๐ฆ โฒ is the true label of ๐ฅ โฒ Example includes; generalized linear models (GLM), Neural networks (NN). The idea is that; โข ๐ data points only seen in training and often not all at once, memory is sublinear or constant in ๐ โข ๐ data points and ๐ inner product, that is required for prediction โข Bayesian โข Number of hyper-parameters ๐ can be small (3 for Gaussian processes [5]) โข Seeks to identify the best model ๐(. ; ๐, ๐, ๐) without relying on any parametric form.
Find best ๐ โ which minimizes data misfit; ฮฆ(๐) = โ ๐(๐ฆ ๐ , ๐(๐ฅ ๐ ; ๐)) ๐๐=1
Eqn 1(c) ๐ โ = (๐ ๐ ๐) โ1 ๐ ๐ ๐ Eqn 1(d) Where
๐ = [๐ฅ , โฆ , ๐ฅ ๐ ] and ๐ = [๐ฆ , โฆ , ๐ฆ ๐ ] An RNN model captures a time series sequence from the following ansatz; [15 &19] Assume a hidden state โ ๐ก โ โ โ โ ๐ โ , the dynamics are described as follows, โ ๐ก = ๐ โ (โ ๐กโ1 , ๐ฅ ๐ก ) Eqn 2 ๐ฆ ๐ก = ๐ ๐ฆ (โ ๐ก ) Eqn 3 ๐ โ : โ ร ๐ โ โ is the transition nonlinear function describing the dynamics of โ ๐ก โ โ and ๐ ๐ฆ โถ โ โ ๐ describes the functional mapping from the hidden state to the output variable. | P a g e โ ๐ก โ โ is the running summary of ๐ข ๐ก โ ๐ until time ๐ก โ โ . The reoccurrence formula updates this summary based on its previous value, โ ๐กโ1 โ โ where we assume โ = 0 ๐ โ is the composition of an element wise nonlinearity with an affine transformation of both ๐ข ๐ก and โ ๐กโ1 such that, โ ๐ก = ๐ โ (๐ ๐ขโ ๐ข ๐ก + ๐ โโ โ ๐กโ1 ) Eqn 4 ๐ ๐ขโ โ โ ๐ โ ร๐ ๐ข is the input to hidden weight matrix ๐ โโ โ โ ๐ โ ร๐ โ is the hidden to hidden weight matrix ๐ฆ ๐ก = ๐ ๐ฆ (๐ โ๐ฆ โ ๐ก ) Eqn 5 Where ๐ โ๐ฆ โ โ ๐ ๐ฆ ร๐ โ is the hidden to output weight matrix Defining ๐ โ as a sigmoid or hyper-bolic tangent function and ๐ ๐ฆ as an identity function, we would have, โ ๐ก = ๐ก๐๐โ(๐ ๐ขโ ๐ข ๐ก + ๐ โโ โ ๐กโ1 ) Eqn 6 and also, ๐ฆ ๐ก = (๐ โ๐ฆ โ ๐ก ) Eqn 7 Where tanh(๐ฅ) = ๐ โ1๐ +1 In (6),
๐ โก {๐ ๐ขโ , ๐ โโ , ๐ โ๐ฆ } are parameters that are estimated using back propagation method through time (BPTT) . RNNโs have memory ability, but long short term memory is limited. This makes the gradient explode or varnish while training using BPTT algorithm.[15] An LSTM unit consists of memory cell ๐ ๐ก an input gate ๐ ๐ก , a forget gate ๐ ๐ก and an output gate ๐ ๐ก . The memory cell carries the memory content of the LSTM unit while the three gates control the amount of changes to an exposure of the memory content. | P a g e (a) (b) Fig 1:
The architecture of an (a) RNN and (b) LSTM.
In Fig.1.(b), ๐ข ๐ก is the input to the memory cell layer, and ๐ is the element wise logistic sigmoid function describes as ๐(๐ฅ) = โ๐ฅ . Mathematically, for ๐ ๐ LSTM units, the forget, input and output gates are described as follows, ๐ ๐ก = ๐(๐ ๐ข๐ ๐ข ๐ก + ๐ โ๐ โ ๐กโ1 ) Eqn 8a ๐ ๐ก = ๐(๐ ๐ข๐ ๐ข ๐ก + ๐ โ๐ โ ๐กโ1 ) Eqn 8b ๐ ๐ก = ๐(๐ ๐ข๐ ๐ข ๐ก + ๐ โ๐ โ ๐กโ1 ) Eqn 8c ๐ ๐ข๐ โ โ ๐ ๐ ร๐ ๐ข , ๐ ๐ข๐ โ โ ๐ ๐ ร๐ ๐ข and ๐ ๐ข๐ โ โ ๐ ๐ ร๐ ๐ข are the LSTM weights from the input ๐ข ๐ to ๐ ๐ , ๐ ๐ and ๐ ๐ respectively. Similarly, ๐ โ๐ โ โ ๐ ๐ ร๐ โ , ๐ โ๐ โ โ ๐ ๐ ร๐ โ and ๐ โ๐ โ โ ๐ ๐ ร๐ โ are the weights from the hidden state โ ๐กโ1 to ๐ ๐ก , ๐ ๐ก and ๐ ๐ก respectively. The gates from 7a-7c control the information flow through the LSTM. The forget gate ๐ ๐ก determines how much of the hidden state โ ๐กโ1 is allowed tom pass through; the input gate ๐ ๐ก determines the past values of the hidden state, โ ๐กโ1 that will be updated; and the output gate ๐ ๐ก , determines how much of โ ๐กโ1 will be made available to the next layer. The final candidate value , ๐ ๐ก and the memory cell, ๐ ๐ก are updated by; ๐ ๐ก = ๐ก๐๐โ(๐ ๐ข๐ ๐ข ๐ก + ๐ โ๐ โ ๐กโ1 ) Eqn 9 ๐ ๐ก = ๐ ๐ก ๐ ๐กโ1 + ๐ ๐ก ๐ ๐ก Eqn 10 | P a g e โ ๐ก = ๐ ๐ก ๐ก๐๐โ(๐ ๐ก ) Eqn 11 Finally an RNN with LSTM architecture is implemented by replacing the recurrent hidden layer with an LSTM cell.
We give a brief introduction here, for more the reader may refer to [29] Let ๐ฆ = ๐(๐ฅ; ๐) + ๐ where ๐ is random noise .Let ๐ = [๐ฅ , โฆ , ๐ฅ ๐ ] and = [๐ฆ , โฆ , ๐ฆ ๐ ] , we place a prior on ๐ , the aim is to sample from the posterior ๐(๐|๐, ๐) โ ๐(๐|๐, ๐)๐(๐|๐) ๐(๐) Eqn 12 A typical method would be using the ensemble Kalman filter (EnKF). Assuming a linear-Gaussian case, the prior pdf is expressed as ๐(๐) โ ๐ exp (โ 12 (๐ โ ๐ ๐๐๐๐๐ ) ๐ ๐ถ ๐โ1 (๐ โ ๐ ๐๐๐๐๐ )) Eqn 13 In the equation above, ๐ ๐๐๐๐๐ =Best prior estimate (mean) of hyper- parameters for the machine, ๐ =Model hyper-parameters vectors, ๐ถ ๐ =covariance matrix of the model , ๐ =constant. Corrupting the true data, ๐ ๐๐๐ = ๐ ๐ก๐๐ข๐ + ๐ Eqn 14 ๐ฆ ๐๐๐ = Observed data , ๐ฆ ๐ก๐๐ข๐ =true output data, ๐ =the noise that accounts for the limited functionality of the measurement equipment. For full Gaussian linearity, we can assume the pdf of likelihood to be, | P a g e ๐(๐|๐, ๐) = ๐ exp (โ 12 (๐ ๐๐๐ โ ๐(๐; ๐)) ๐ ๐ถ ๐ ๐๐๐ โ1 (๐ ๐๐๐ โ ๐(๐; ๐)) Eqn 15 ๐ถ ๐ ๐๐๐ =covariance matrix of the measurement noise. The conditional pdf can then be derived as, ๐(๐โ๐, ๐) = ๐ exp (โ 12 ((๐ โ ๐ ๐๐๐๐๐ ) ๐ ๐ถ ๐โ1 (๐ โ ๐ ๐๐๐๐๐ )ร (โ 12 (๐ ๐๐๐ โ ๐(๐; ๐)) ๐ ๐ถ ๐ ๐๐๐ โ1 (๐ ๐๐๐ โ ๐(๐; ๐)) Eqn 16 ๐ =normalizing constant .The aim is the minimisation of the objective function Q (m) in the equation below, ๐(๐) = ๐๐(๐) + ๐๐(๐)
Eqn 17
๐ (๐) = 12 (๐ โ ๐ ๐๐๐๐๐ ) ๐ ๐ถ ๐โ1 (๐ โ ๐ ๐๐๐๐๐ ) + 12 (๐ ๐๐๐ โ ๐(๐; ๐)) ๐ ๐ถ ๐ ๐๐๐ โ1 (๐ ๐๐๐ โ ๐(๐; ๐)) Eqn 18
๐๐(๐) = model mismatch term that provides normalisation for the Hessian matrix.
๐๐(๐) =data mismatch term If we assumed the relationship between model and predicted data is linear i.e. ๐(๐; ๐) = ๐บ๐ , then the posterior mean in Eqn 18 is ๐ ๐+1 = ๐ ๐ + ๐ถ ๐ ๐บ ๐ (๐บ๐ถ ๐ ๐บ ๐ + ๐ถ ๐ท ) โ1 (๐ ๐๐๐ โ ๐บ๐ ๐ ) Eqn 19 Let ๐ ๐ ๐๐ = ๐บ๐ be the solution of the forward problem, ๐ถฬ ๐๐๐ = 1๐ ๐ โ 1 โ(๐ ๐ข๐,๐ โ ๐ฬ ๐ ) ๐ ๐ ๐=1 (๐ ๐ ๐๐๐๐ โ ๐ ๐ ๐๐ ๐ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ) ๐ = 1๐ ๐ โ 1 โ(๐ ๐๐ โ ๐ฬ ๐ ) ๐ ๐ ๐=1 (๐บฬ (๐ ๐๐ โ ๐ฬ ๐ )) ๐ = ๐ถ ๐ ๐บ ๐ | P a g e Eqn 20(a) ๐ถฬ ๐๐๐ = 1๐ ๐ โ 1 โ (๐ ๐ ๐๐๐๐ โ ๐ ๐ ๐๐ ๐ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ) ๐ ๐ ๐=1 (๐ ๐ ๐๐๐๐ โ ๐ ๐ ๐๐ ๐ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ) ๐ = 1๐ ๐ โ 1 โ (๐บฬ (๐ ๐๐ โ ๐ฬ ๐ )) ๐ ๐ ๐=1 (๐บฬ (๐ ๐๐ โ ๐ฬ ๐ )) ๐ = ๐บ๐ถ ๐ ๐บ ๐ Eqn 20(b) ๐ ๐+1 = ๐ ๐ + ๐ถฬ ๐๐๐ (๐ถฬ ๐๐๐ + ๐ถ ๐ท ) โ1 (๐ ๐๐๐ โ ๐บ๐ ๐ ) Eqn 21 Where ๐ถฬ ๐๐๐ (๐ถฬ ๐๐๐ + ๐ถ ๐ท ) โ1 is the Kalman gain matrix, and ๐ is the iteration index . The method shown in Eqn 21 can also be posed in an online sequential learning of the model hyper-parameters for a parametric supervised learning model This sub- section, re-introduces a recent algorithm developed in [6] and re-formulated in [8] For a set of labelled data {(๐ฅ ๐ , ๐ฆ ๐ )} ๐=1๐ , where (๐ฅ ๐ , ๐ฆ ๐ ) โ ๐ ร ๐พ assumed to be input and output of a model shown in Eqn 22 ๐ฆ ๐ โ ๐(๐ฅ ๐ ) Eqn 22 ๐: ๐ โ ๐พ is irregular and has sharp features, very non-linear and has noticeable discontinuities, The output space is taken as ๐พ = โ and ๐ = โ ๐ In this stage, we seek to cluster the training input and output pairs ๐: ๐ ร ๐พ โ โ โ {1, โฆ , ๐ฟ} where the label function minimizes, ฮฆ ๐๐๐ข๐ ๐ก (๐) = โ โ โ(๐ฅ ๐ , ๐ฆ ๐ ) ๐๐๐ ๐ ๐ฟ๐=1
Eqn 23 ๐ ๐ = {(๐ฅ ๐ , ๐ฆ ๐ ); ๐(๐ฅ ๐ , ๐ฆ ๐ ) = ๐} Eqn 24 โ ๐ loss function associated to cluster ๐ ๐ง ๐ = (๐ฅ ๐ , ๐ฆ ๐ ), โ ๐ = |๐ง ๐ โ ๐ ๐ | Eqn 25 ๐ ๐ = ๐ | โ ๐ง ๐๐๐๐ ๐ where |. | denotes the Euclidean norm ๐ ๐ = ๐(๐ฅ ๐ , ๐ฆ ๐ ) is an expanded training set {(๐ฅ ๐ , ๐ฆ ๐ , ๐ ๐ )} ๐=1๐ Eqn 26(a) ๐ ๐ : ๐ โ โ Eqn 26(b) ๐ฅ โ ๐ provides an estimate ๐ ๐ โถ ๐ฅ โผ ๐(๐ฅ) โ โ such that ๐ ๐ (๐ฅ ๐ ) = ๐ ๐ for the majority of the data. Crucial for the ultimate fidelity of the prediction. {๐ฆ ๐ } is ignored at this phase. The classification function minimizes ฮฆ ๐๐๐ข๐ ๐ก (๐ ๐ ) = โ ๐ ๐ (๐ ๐ , ๐ ๐ (๐ฅ ๐ )) ๐๐=1
Eqn 26(c) ๐ ๐ โถ โ ร โ โ โ + is small if ๐ ๐ (๐ฅ ๐ ) = ๐ ๐ for example we can choose ๐ ๐ (๐ฅ) =๐๐๐๐๐๐ฅ ๐โโ ๐ ๐ (๐ฅ) where ๐ ๐ (๐ฅ) > 0, โ ๐ ๐ (๐ฅ) โ๐=1 is a soft classifier and ๐ ๐ (๐ ๐ , ๐ ๐ (๐ฅ ๐ )) =โ log(๐ ๐ (๐ฅ)) is a cross-entropic loss. ๐ ๐ โถ ๐ ร โ โ ๐พ Eqn 27(a)
For each (๐ฅ, ๐) โ ๐ ร โ must provide an estimate ๐ ๐ โถ (๐ฅ, ๐) โผ ๐ ๐ (๐ฅ, ๐) โ ๐พ such that ๐ ๐ (๐ฅ, ๐ ๐ (๐ฅ)) โ ๐ฆ for both the training and test data. If successful a good reconstruction for ๐: ๐ โ ๐พ Eqn 27(b) Where ๐(. ) = ๐ ๐ (. , ๐ ๐ (. )) the regression function can be found by minimizing ฮฆ ๐ (๐ ๐ ) = โ ๐ ๐ (๐ฆ ๐ , ๐ ๐ (๐ฅ ๐ , ๐ ๐ (๐ฅ ๐ ))) ๐๐=1
Eqn 27(c) Where ๐ ๐ โถ ๐พ ร ๐พ โ โ + minimized when ๐ ๐ (๐ฅ ๐ , ๐ ๐ (๐ฅ ๐ )) = ๐ฆ ๐ in this case can be chosen as ๐ ๐ (๐ฆ, ๐ ๐ (๐ฅ, ๐ ๐ (๐ฅ))) = |๐ฆ โ ๐ ๐ (๐ฅ, ๐ ๐ (๐ฅ))| . Data can be partitioned into ๐ถ ๐ = {๐; ๐ ๐ (๐ฅ ๐ ) = ๐} for ๐ = 1, โฆ ๐ฟ and then perform ๐ฟ separate regressions done in parallel ฮฆ ๐๐ (๐ ๐ (. , ๐) = โ ๐ ๐ (๐ฆ ๐ , ๐ ๐ (๐ฅ ๐ , ๐)) ๐โ๐ถ ๐ Eqn 27(d) ๐ฅ = (๐ฅ , โฆ ๐ฅ ๐ ) โ โ ๐ and ๐ฆ โ โ hence, have |(๐ฅ, ๐ฆ) โ (๐ฅ โฒ , ๐ฆ โฒ )| = (๐ฆ โ ๐ฆ โฒ ) + |๐ฅ โ ๐ฅ โฒ | For ๐ = 1, โฆ ๐, ๐ฅฬ ๐ = (๐ฅ ๐ โ ๐๐๐ ๐โ{1,..๐} ๐ฅ ๐๐ ) (๐๐๐ฅ ๐โ{1,..๐} ๐ฅ ๐๐ โ ๐๐๐ ๐โ{1,..๐} ๐ฅ ๐๐ )โ Eqn 28(a) ๐ฆฬ = ๐ถ(๐ฆ โ ๐๐๐ ๐โ{1,..๐} ๐ฆ ๐ ) (๐๐๐ฅ ๐โ{1,..๐} ๐ฆ ๐ โ ๐๐๐ ๐โ{1,..๐} ๐ฆ ๐ )โ Eqn 28(b) For
๐ถ > 1
๐ถ = 10๐ where ๐ = dim(๐ฅ)
For regression
๐ถ = 1
A critique of this method is that it re-uses the data in each phase. A Bayesian postulation handles this limitation elegantly well. Recall;
๐ท = {(๐ฅ ๐ , ๐ฆ ๐ )} ๐=1๐ Eqn 29(a) Assume parametric models for the classifier ๐ ๐ (. ; ๐ ๐ ) = ๐ ๐ (. ; ๐ ๐๐ ) and the regressor ๐ ๐ (. , ๐; ๐ ๐๐ ) for ๐ = 1, โฆ ๐ฟ where ๐ ๐ = (๐ ๐1 , โฆ , ๐ ๐๐ฟ ) and ๐ ๐ = (๐ ๐1 , โฆ , ๐ ๐๐ฟ ) and let ๐ = (๐ ๐ , ๐ ๐ ) , the posterior density has the form ๐(๐, ๐|๐ท) โ โ ๐(๐ฆ ๐ |๐ฅ ๐ , ๐ ๐ , ๐) ๐๐=1 ๐(๐|๐ฅ ๐ , ๐ ๐ )๐(๐ ๐ )๐(๐ ๐ ) Eqn 29(b) ๐(๐ฆ ๐ |๐ฅ ๐ , ๐ ๐ , ๐) โ exp (โ 12 |๐ฆ ๐ โ ๐ ๐ (๐ฅ ๐ , ๐; ๐ ๐๐ )| ) Eqn 29(c) And ๐(๐|๐ฅ ๐ , ๐ ๐ ) = ๐ ๐ (๐ฅ ๐ ; ๐ ๐ ) Eqn 29(d) ๐ ๐ (๐ฅ ๐ ; ๐ ๐ ) = exp(โ ๐ (๐ฅ; ๐ ๐๐ ))โ exp (โ ๐ (๐ฅ; ๐ ๐๐ )) ๐ฟ๐=1
Eqn 29(e) โ ๐ (๐ฅ; ๐ ๐๐ ) are some standard parametric regressors Random Forests (RF) [21] are part of the algorithms called decision trees. In decision trees, the goal is to create a prediction model that predicts an output by combining different input variables. In the decision tree, each node corresponds to one of the input variables; each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. Random forest algorithm trains different decision trees by using different subsets of the training data. as depicted in Fig.2.
Fig.2 : Sketch representation of a RF workflow. Random subsets of the data are trained by the RF. The randomness generates models that are not correlated to each other โข Maximum features : this is the maximum number of features that a RF is allowed to try in each individual tree. โข Number of estimators:
This is the number of built trees before taking the maximum voting or averages of predictions. โข Minimum sample leaf size : The leaf is the end of a decision tree
Advantages โข The chance of overfitting decreases, since several different decision trees are used in the learning procedure. This is very crucial for making predictions in polymer performance datasets โข RFs apply pretty well when a particular distribution of the data is not required. For example, no data normalization is needed. โข Parallelization: the training of multiple trees can be parallelized (for example through different computational slots)
Disadvantages โข RFs usually might suffer from smaller training datasets โข The time to train a RF algorithm might be longer compared to other algorithms.
An Artificial Neural Network โ ANN, with a single hidden layer can be represented graphically as follows in Fig.3.;
Figure 3:
Depiction of a neural network architecture
Formally, a one-hidden-layer MLP is a function ๐ โถ โ ๐พ โ โ ๐ , where ๐พ is the size of input vector ๐ฅ and ๐ is the size of the output vector. We define ๐(๐ฅ) , such that, in matrix notation: ๐(๐ฅ) = ๐บ (๐ (2) + ๐ (2) (๐ (๐ (1) + ๐ (1) ๐ฅ))) Eqn 30 With bias vectors ๐ (1) , ๐ (2) ; weight matrices ๐ (1) , ๐ (2) and activation functions ๐บ and ๐ The vector โ(๐ฅ) = ฮฆ(๐ฅ) = ๐ (๐ (1) + ๐ (1) ๐ฅ) constitutes the hidden layer ๐ (1) โ โ ๐พร๐พ โ is the weight matrix connecting the input vector to the hidden layer. Each column ๐ ๐(1) represents the weights from the input units to the ๐ ๐กโ hidden unit. Typical choices for ๐ include ๐ก๐๐โ(๐) = (๐ ๐ โ ๐ โ๐ )(๐ ๐ + ๐ โ๐ ) Eqn 31(a) ๐ ๐๐๐๐๐๐(๐) = 1(1 + ๐ โ๐ ) Eqn 31(b)
๐ ๐๐ฟ๐(๐) = max(0, ๐)
Eqn 31(c) The output vector is then
๐(๐ฅ) = ๐บ (๐ (2) + ๐ (2) โ(๐ฅ))
Eqn 32 For classification problems, class-membership probabilities can be obtained by choosing ๐บ as the softmax function (in the case of multi-class classification). ๐ ๐๐๐ก๐๐๐ฅ (๐) ๐ = ๐๐ฅ๐(๐ ๐ )โ ๐๐ฅ๐(๐ ๐ ) ๐ฟ๐=1
Eqn 33 Where ๐ ๐ represents the ๐ th element of the input to softmax which corresponds to class ๐ and ๐ฟ is the number of classes. The result is a vector that contains the probabilities which samples ๐ฅ that belongs to each class. The output is the class with the highest probability. For regression the output remains ๐(๐ฅ) = (๐ (2) + ๐ (2) โ(๐ฅ))
Eqn 34 Where the activation function is unity. Depending on the problem type ANN uses different loss functions depending on the problem type. The loss function for classification is Cross-Entropy, which in binary case is given as,
๐ฟ๐๐ ๐ (๐ฆฬ, ๐ฆ, ๐) = โ yln ๐ฆฬ โ (1 โ ๐ฆ) ln(1 โ ๐ฆฬ) + ๐ผ2 ||๐|| Eqn 35 where ๐ผ2 ||๐|| is an L2-regularization term (aka penalty) that penalizes complex models; and ๐ผ > 0 is a non-negative hyper parameter that controls the magnitude of the penalty. For regression problems, ANN uses the Square Error loss function; written as, ๐ฟ๐๐ ๐ (๐ฆฬ, ๐ฆ, ๐) = 12 ||๐ฆฬ โ ๐ฆ|| + ๐ผ2 ||๐|| Eqn 36
Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function by repeatedly updating these weights. After computing the loss, a backward pass propagates it from the output layer to the previous layers, providing each weight parameter with an update value meant to decrease the loss. In gradient descent, the gradient โ๐ฟ๐๐ ๐ W of the loss with respect to the weights is computed and deducted from W. More formally, this is expressed as, ๐ ๐+1 = ๐ ๐ โ ๐๐ป๐ฟ๐๐ ๐ ๐๐ Eqn 37 where i is the iteration step, and ๐ is the learning rate with a value larger than 0. The algorithm stops when it reaches a pre-set maximum number of iterations; or when the improvement in loss is below a certain, small number. We give a brief introduction here, for more information the reader may refer to [28] For a given data set with ๐ examples and ๐ features ๐ท = {(๐ฅ ๐ , ๐ฆ ๐ )}(|๐ท| = ๐, ๐ฅ ๐ โ ๐ ๐ , ๐ฆ ๐ โ ๐ ), a tree ensemble model uses ๐พ additive functions to predict the output. ๐ฆ ๐ ฬ = ๐(๐ฅ ๐ ) = โ ๐ ๐ (๐ฅ ๐ ) ๐พ๐=1 , ๐ ๐ โ ๐ญ Eqn 38 where
๐ญ = {๐(๐ฅ) = ๐ค ๐(๐ฅ) }(๐ โถ ๐ ๐ โ ๐, ๐ค โ ๐ ๐ ) is the space of regression trees ( known also as CART). ๐ represents the structure of each tree that maps an example to the corresponding leaf index. ๐ is the number of leaves in the tree. Each ๐ ๐ corresponds to an independent tree structure ๐ and leaf weights ๐ค . Each regression tree contains a continuous score on each of the leaf, ๐ค ๐ represents score on ๐ -th leaf. For a given example, we will use the decision rules in the trees (given by ๐ ) to classify To learn the set of functions used in the model, we minimize the following regularized objective;
โ(๐) = โ ๐(๐ฆ ๐ ฬ , ๐ฆ ๐ ) ๐ + โ โฆ(๐ ๐ ) ๐ where โฆ(๐) = ๐พ๐ + 12 ๐โ๐คโ Eqn 39 ๐ is a differentiable convex loss function that measures the difference between the prediction ๐ฆ ๐ ฬ and the target ๐ฆ ๐ . The second term โฆ penalizes the complexity of the model (i.e., the regression tree functions). The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions. The tree ensemble model in Eq. (39) includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space. Instead, the model is trained in an additive manner. Formally, let ๐ฆ ๐ ฬ ๐ก be the prediction of the ๐ -th instance at the ๐ก -th iteration, we will need to add ๐ ๐ก to minimize the following objective. โ (๐ก) = โ ๐(๐ฆ ๐ , ๐ฆ ๐ ฬ (๐กโ1) ๐๐=1 + ๐ ๐ก (๐ฅ ๐ )) + โฆ(๐ ๐ก ) Eqn 40 This means we greedily add the ๐ ๐ก that most improves our model according to Eq. (39). Second-order approximation can be used to quickly optimize the objective in the general setting. โ (๐ก) โ โ[๐(๐ฆ ๐ , ๐ฆ ๐ ฬ (๐กโ1) ) + ๐ ๐ ๐ ๐ก (๐ฅ ๐ ) + 12 โ ๐ ๐ ๐ก2 (๐ฅ ๐ )] + โฆ(๐ ๐ก ) ๐๐=1 Eqn 41 where ๐ ๐ = ๐ ๐ฆ ๐ ฬ (๐กโ1) ๐(๐ฆ ๐ , ๐ฆ ๐ ฬ (๐กโ1) ) and โ๐ = ๐ ๐ ฬ (๐กโ1) ๐(๐ฆ ๐ , ๐ฆ ๐ ฬ (๐กโ1) ) are first and second order gradient statistics on the loss function. Removing the constant terms to obtain the following simplified objective at step ๐ก ; โฬ (๐ก) โ โ[ ๐ ๐ ๐ ๐ก (๐ฅ ๐ ) + 12 โ ๐ ๐ ๐ก2 (๐ฅ ๐ )] + โฆ(๐ ๐ก ) ๐๐=1 Eqn 42 Define ๐ผ ๐ = {๐|๐(๐ฅ ๐ ) = ๐} as the instance set of leaf ๐ . We can rewrite Eqn (42) by expanding โฆ as follows โฬ (๐ก) = โ [ ๐ ๐ ๐ ๐ก (๐ฅ ๐ ) + 12 โ ๐ ๐ ๐ก2 (๐ฅ ๐ )] + โฆ(๐ ๐ก ) ๐๐=1 + ๐พ๐ + 12 ๐ โ ๐ค ๐2๐๐=1 Eqn 43 = โ [(โ ๐ ๐๐ โ๐ผ ๐ ) ๐ค ๐ + 12 (โ โ ๐๐ โ๐ผ ๐ + ๐)๐ค ๐2 ] ๐๐=1 + ๐พ๐
Eqn 44 For a fixed structure ๐(๐ฅ) , we can compute the optimal weight ๐ค ๐โ of leaf ๐ by ๐ค ๐โ = โ โ ๐ ๐๐ โ๐ผ ๐ โ โ ๐๐ โ๐ผ ๐ + ๐ Eqn 45 and calculate the corresponding optimal value by โฬ (๐ก) (๐) = โ 12 โ (โ ๐ ๐๐ โ๐ผ ๐ ) โ โ ๐๐ โ๐ผ ๐ + ๐ ๐๐=1 + ๐พ๐
Eqn 46
Eqn (46) can be used as a scoring function to measure the quality of a tree structure ๐ . Normally it is impossible to enumerate all the possible tree structures ๐ . A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that ๐ผ ๐ฟ and ๐ผ ๐ are the instance sets of left and right nodes after the split. Letting ๐ผ = ๐ผ ๐ฟ โช ๐ผ ๐ , then the loss reduction after the split is given by; โ ๐ ๐๐๐๐ก = 12 [ (โ ๐ ๐๐ โ๐ผ ๐ฟ ) โ โ ๐๐ โ๐ผ ๐ฟ + ๐ + (โ ๐ ๐๐ โ๐ผ ๐ ) โ โ ๐๐ โ๐ผ ๐ + ๐ โ (โ ๐ ๐๐ โ๐ผ ๐ ) โ โ ๐๐ โ๐ผ ๐ + ๐] โ ๐พ Eqn 47 Eqn (47) is used in practice for evaluating the split candidates.
3. Numerical Experiment 1 (Toy Problem)
Logic of novel CCR-MPC โข The weather state variables are ๐ . If we assume, ๐ โ [๐ฝ] ๐ , then; ๐ ๐ก+1 = ๐ (๐ ๐ก ) + ๐ Assuming a sequential approach where the weather states variable are known, the parameters of the time series machine ๐ can be relearned, using this logic ๐ = ๐๐ โ ๐๐๐๐๐[๐ , ๐ฬ ๐ก ] Where ๐ก is time step and ๐ฬ ๐ก is the true weather states. We then set for the next time step optimisation, ๐ = ๐ Until final time where ๐ is the time series machine โข The controller is ๐ โ [๐ผ] ๐ that has a direct consequence to maintain the indoor set point temperature; ๐ = โ ๐ผ ๐๐๐=1 , . . โ ๐ = 1 โฆ ๐ ๐ is the dimension of the control parameters โข Hence the current weather states ๐ and the controller ๐ โ [๐ผ] ๐ are inputs to the forward problem; ๐พ = ๐ (๐, ๐) + ๐ Where ๐พ is the True temperature and ๐บ is the set point temperature. ๐ is the supervised learning model learned from either CCR/DNN/RF, The equation (3) is then posed as an optimisation problem to track the setpoint ๐(๐|๐บ, ๐พ, ๐) = argmin ๐ โ๐บ โ ๐พโ + ๐๐๐๐ ๐ก๐๐๐๐๐ก๐ Where โconstraintsโ could be any economical or realistic building physics scenarios we impose on the optimisation problem. The link to GitHub Repository for implementation of the code can be found at : https://github.com/clementetienam/Data_Driven_MPC_Controller_Using_CCR
A neural network (DNN) was trained for mapping the weather states with control variables to the zone mean temperature. We compute our ๐ accuracy for both the model learning and set point tracking by ๐ = 1 โ โ [๐ฆ (๐ก๐๐ข๐ ๐๐๐ก๐ ๐๐ ๐ ๐๐ก๐๐๐๐๐ก)๐ โ ๐ฆ (๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐)๐ ] ๐๐=1 2 โ [๐ฆ (๐ก๐๐ข๐ ๐๐๐ก๐ ๐๐ ๐ ๐๐ก๐๐๐๐๐ก)๐ โ ๐๐๐๐(๐ฆ (๐ก๐๐ข๐ ๐๐๐ก๐ ๐๐ ๐ ๐๐ก๐๐๐๐๐ก) )] Eqn 48 Model discomfort is then
๐ท๐๐ ๐๐๐๐๐๐๐ก(๐ถ) = 1๐ โ[๐ฆ (๐ก๐๐ข๐ ๐๐๐ก๐ ๐๐ ๐ ๐๐ก๐๐๐๐๐ก)๐ โ ๐ฆ (๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐)๐ ] Eqn 49 In this experiment, the NelderโMead optimisation method [24] and the iterative ensemble smoother (I-ES) [29] are compared and used as the optimisation method.
Box Configuration
Inputs: โข Environment: Site Outdoor Air Drybulb Temperature [C] โข Environment: Site Outdoor Air Wetbulb Temperature [C] โข Environment: Site Outdoor Air Relative Humidity [%] โข Environment: Site Wind Speed [m/s] โข Environment: Site Wind Direction [deg] โข Environment: Site Horizontal Infrared Radiation Rate per Area [W/m2] โข Environment: Site Diffuse Solar Radiation Rate per Area [W/m2] โข Environment: Site Direct Solar Radiation Rate per Area [W/m2] โข THERMAL ZONE: BOX: Zone Outdoor Air Wind Speed [m/s]
Outputs: โข THERMAL ZONE: BOX: Zone Mean Air Temperature [C]
Fig. 4:
Correlation map of the features to the outputs for Machine 1
Fig.5 : (a, b)- ๐ (%) training and test accuracy on Air Temperature [C](Hourly). (b, e)- histogram plot showing ๐ (๐ฅ ๐ ; ๐ โ ) โ ๐ฆ ๐ก๐๐ข๐ . (c) - Super imposed plot of ๐ (๐ฅ ๐ ; ๐ โ ), ๐ฆ ๐ก๐๐ข๐ . Ground Source Heat Pump (GSHP) configuration
Inputs: โข Environment: Site Outdoor Air Drybulb Temperature [C] โข Environment: Site Outdoor Air Wetbulb Temperature [C] โข Environment: Site Outdoor Air Relative Humidity [%] โข Environment: Site Wind Speed [m/s] โข Environment: Site Wind Direction [deg] โข Environment: Site Horizontal Infrared Radiation Rate per Area [W/m2] โข Environment: Site Diffuse Solar Radiation Rate per Area [W/m2] โข Environment: Site Direct Solar Radiation Rate per Area [W/m2] โข THERMAL ZONE: BOX: Zone Outdoor Air Wind Speed [m/s] โข GSHPCLG: Heat Pump Electric Power [W]-control signal โข GSHPCLG: Heat Pump Source Side Inlet Temperature [C]-control signal โข GSHPHEATING: Heat Pump Electric Power [W]-control signal โข GSHPHEATING: Heat Pump Source Side Inlet Temperature [C]-control signal
Outputs: โข THERMAL ZONE: BOX: Zone Mean Air Temperature [C]
Fig.6 : (a, b)- ๐ (%) training and test accuracy on Air Temperature [C](Hourly). (b, e)- histogram plot showing ๐ (๐ฅ ๐ ; ๐ โ ) โ ๐ฆ ๐ก๐๐ข๐ . (c) - Super imposed plot of ๐ (๐ฅ ๐ ; ๐ โ ), ๐ฆ ๐ก๐๐ข๐ . Fig.7:
LSTM performance on the hourly prediction of the Box dataset. Each of โ has been learned and forecasted to the future using only its output at its previous time step. The lookback is 7 time step, i.e ๐ก, . . ๐ก โ1. ๐ก โ 2, โฆ . . ๐ก โ 6 used to predict ๐ก + 1 . (a) (b) 5 | P a g e Fig.8:
Open loop Batch Optimisation. The room temperature is forecasted โblindlyโ with the LSTM model , and the GSHP control parameters are optimised to the set temperature . The optimisation algorithm is NelderโMead on panel (a) and I-ES on panel (b) (a) (b) 6 | P a g e
Fig.9:
Closed loop sequential Optimisation. The room temperature is initially forecasted โblindlyโ with the LSTM model and corrected to the true temperature reading (from a sensor)., and the GSHP control parameters are optimised to the set temperature . The optimisation algorithm is NelderโMead on panel (a) and I-ES on panel (b)
The first step is developing a good time series model that could predict the states ( of 5 variables) accurately. Fig.10. below shows the approach using 50% of the data set and training set and 50% as test set. The time series model is the XGboost. From the โdate-timeโ index in pandas that has this format, โ01/01/2010 00:00:00โ, 8 new features (Pseudo-inputs), namely; 'hour', 'day of week', 'quarter', 'month', 'year', 'day of year', 'day of month', 'week of year' where created. 5 overall time series machine was modelled, mapping this 8 features to each of the 5 weather states variables. (a) (b) (c) (d) (e)
Fig. 10:
XGboost time series model for predicting the states. (a-e) are for states 'Ambient temperature (C)','Ground temperature (C)', 'Global irradiance (W/m2)', 'Direct irradiance (W/m2)','Diffuse irradiance (W/m2)'
Next, we construct a forward problem that will map these 5 weather states with the 2 controller inputs which are โHeat pump heat supply(kw) and
Heat pump electrical load(kw) , making 7 variables, to predict the internal temperature. Several regression algorithms were tested as seen in Fig.11(b). below; (a) (b)
Fig.11: (a) Correlation map of data (b) Various regression algorithm to approximate the forward mapping of 7 to 1 on test data
In all, Random Forest showed the best ๐ (%) accuracy. But Random Forest fails in optimising to the set point temperature because during optimisation, the prediction from Random Forest is based on an ensemble of decision tress, and using ensemble averaging for discrete Heaviside signals like the setpoint (17.5C /21C) gave in-accurate reconstruction as shown in Fig. 12. Fig. 12:
XGboost time series model and RandomForest forward model. Discomfort is 3.821C and it is due to the ensemble averaging nature of the Random Forest Algorithm
With that in mind deterministic algorithms like Deep neural network- DNN and Polynomial regression were tested. we first trained the forward model and had the ๐ (%) accuracy for them as depicted in Fig.13. (A)-DNN machine (B)-Polynomial regression- Degree 9 Fig. 13: ๐ (%) accuracy on test data for the forward problem (7 to 1). (A) DNN and (B) Polynomial regression degree 9 With these in mind, the DNN controller was re-run for test set point data using XGboost as the time series machine. The discomfort for the model is depicted in Fig.(14 & 15). for DNN and Polynomial regression respectively, with Polynomial regression having a lower discomfort of
Table 1 shows the summary of various combinations of XGboost series machine with 4 promising forward model, with the Polynomial regression coming 1 st and DNN coming 2 nd . Fig. 14:
Data driven Model- XGboost for modelling time series and DNN for modelling forward problem. Optimisation is with NelderโMead.
Fig. 15:
Data driven Model-XGboost for modelling time series and Polynomial regression (Degree=9) for modelling forward problem. Optimisation is with NelderโMead.
Model Discomfort (C) XGboost + DNN
XGboost +Polynomial regression
XGboost +Random Forest
XGboost + Nearest Neighbour
Table 1: we re-ran the sequence now substituting the forward model with the Cluster Classify Regress algorithm (CCR) [6 & 8] and had better performance. The prediction is for 500 steps ahead which are for 7500 minutes in the future- (6 days in advance). The forward mapping was able to trace the discontinuity of the data and better solve the inverse problem. Fig.15. are the Data-driven MPC using CCR as a forward model. Conclusively, the approach using a CCR model as a forward model gave a better performance than one using a single machine to map the forward model. (a) (b) (c) (d)
Figure 15:
Data driven MPC Model- (a): Time series machine is an XGboost model, the forward model is a CCR model where the gate is an XGboost And the experts (2) is a 9 th -degree polynomial regressor, (d): Time series machine is an XGboost model, the forward model is a CCR model where the gate is a RandomForest and the experts (2) is a 9 th -degree polynomial regressor (c):XGboost for modelling time series and Polynomial regression (Degree=9) for modelling forward problem. Optimisation is with NelderโMead.(d): XGboost for modelling time series and DNN for modelling forward problem. Optimisation is with NelderโMead.
Model Discomfort (C) (a)
XGboost + CCR(XGboost/Polynomial regression) (b)
XGboost + CCR(XGboost/RandomForest) (c) XGboost +Polynomial Regressor (d)
XGboost + Nearest Neighbour
Table 2: Figure 16:
5. Conclusion & Future Work
We have developed a novel model predictive controller called
CCR-MPC for optimising the conditions of a building to track a desired set point. The approach is data driven and scales easily to any size of the data set. As with any other machine learning conjecture, the quality of the surrogate model is almost dependent on the quality of the training dataset, hence domain knowledge is important in interpreting results from this novel controller. The CCR-MPC is elegant and from the numerical results is able to track and maintain the desired set point temperature in a predictive manner. Inverse and forward UQ is also naturally imbedded from the components of the forward CCR model to the type of optimisation used in solving the inverse problem. Future work will be to apply direct reinforcement learning approach to the MPC formulation to give a natural logic of the controller learning directly from its environment and choosing the best course of action to take for future time horizons.
6. Acknowledgement
CE developed the idea /mathematics for the novel CCR-MPC approach, coded the numerical implementation in Python and wrote the document. SS prepared the data set for toy problem 1 and gave valuable civil engineering knowledge during code implementation. ED prepared the data set for Imperial College Data and gave valuable civil engineering knowledge during code implementation. JS directed the research and gave valuable civil engineering knowledge/feedback during code implementation
References [1] Alex Gorodetsky and Youssef Marzouk. Efficient localization of discontinuities in complex computational simulations. SIAM Journal on Scientific Computing, 36(6):A2584โA2610, 2014. [2] Anne Gelb and Eitan Tadmor. Spectral reconstruction of piecewise smooth functions from their discrete data. ESAIM: Mathematical Modelling and Numerical Analysis, 36(2):155โ175, 2002. [3] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. [4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [5] Carl Rasmussen and Chris Williams. Gaussian processes for machine learning. Gaussian Processes for Machine Learning, 2006. [6] Clement Etienam, Kody Law, Sara Wade. Ultra-fast Deep Mixtures of Gaussian Process Experts. arXiv preprint arXiv:2006.13309, 2020. [7] David Adalsteinsson and James A Sethian. A fast level set method for propagating interfaces. Journal of computational physics, 118(2):269โ277, 1995. [8] David E. Bernholdt, Mark R. Cianciosa, David L. Green, Jin M. Park, Kody J. H. Law, and Clement Etienam. Cluster, classify, regress: A general method for learning discontinuous functions. Foundations of Data Science, 1(2639-8001-2019-4-491):491, 2019. [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [10] Dirk Pflรผger, Benjamin Peherstorfer, and Hans-Joachim Bungartz. Spatially adaptive sparse grids for high-dimensional data-driven problems. Journal of Complexity, 26(5):508โ522, 2010. [11] Dmitry Batenkov. Complete algebraic reconstruction of piecewise-smooth functions from fourier data. Mathematics of Computation, 84(295):2329โ2350, 2015. [12] Fabian Pedregosa, Gaรซl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825โ2830, 2011. [13] Habib N Najm, Bert J Debusschere, Youssef M Marzouk, Steve Widmer, and OP Le Maรฎtre. Uncertainty quantification in chemical systems. International journal for numerical methods in engineering, 80(6-7):789โ814, 2009. [14] Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta numerica, 13:147โ269, 2004. [15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [16] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [17] John D Jakeman, Richard Archibald, and Dongbin Xiu. Characterization of discontinuities in high-dimensional stochastic problems on adaptive sparse grids. Journal of Computational Physics, 230(10):3977โ3997, 2011. [18] Karla Monterrubio-Gรณmez, Lassi Roininen, Sara Wade, Theo Damoulas, and Mark Girolami. Posterior inference for sparse hierarchical non-stationary models. arXiv preprint arXiv:1804.01431, 2018. [19] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. [20] Knut S Eckhoff. Accurate reconstructions of functions of finite regularity from truncated fourier series expansions. Mathematics of Computation, 64(210):671โ690, 1995. [21] Leo Breiman. Bagging predictors. Machine learning, 24(2):123โ140, 1996. [22] Matthew M Dunlop, Marco A Iglesias, and Andrew M Stuart. Hierarchical bayesian level set inversion. Statistics and Computing, 27(6):1555โ1584, 2017. [23] Michael Yu Wang, Xiaoming Wang, and Dongming Guo. A level set method for structural topology optimization. Computer methods in applied mechanics and engineering, 192(1-2):227โ246, 2003. [24] Powell, M.J.D. (1998), โDirect search algortihms for optimization calculationsโ, in Acta Numerica 1998, A. Iserles (Ed.), Cambridge University Press, Cambridge, UK, pp. 287โ336. [25] Rick Archibald, Anne Gelb, Rishu Saxena, and Dongbin Xiu. Discontinuity detection in multivariate space for stochastic simulations. Journal of Computational Physics, 228(7):2676โ2689, 2009. [26] Rick Archibald, Anne Gelb, and Jungho Yoon. Polynomial fitting for edge detection in irregularly sampled signals and images. SIAM journal on numerical analysis, 43(1):259โ279, 2005. [27] Stefano Conti and Anthony O Hagan. Bayesian emulation of complex multi-output and dynamic computer models. Journal of statistical planning and inference, 140(3):640โ651, 2010. [28] Tianqi Chen and Carlos Guestrin arXiv preprint arXiv:[23] Michael Yu Wang, Xiaoming Wang, and Dongming Guo. A level set method for structural topology optimization. Computer methods in applied mechanics and engineering, 192(1-2):227โ246, 2003. [24] Powell, M.J.D. (1998), โDirect search algortihms for optimization calculationsโ, in Acta Numerica 1998, A. Iserles (Ed.), Cambridge University Press, Cambridge, UK, pp. 287โ336. [25] Rick Archibald, Anne Gelb, Rishu Saxena, and Dongbin Xiu. Discontinuity detection in multivariate space for stochastic simulations. Journal of Computational Physics, 228(7):2676โ2689, 2009. [26] Rick Archibald, Anne Gelb, and Jungho Yoon. Polynomial fitting for edge detection in irregularly sampled signals and images. SIAM journal on numerical analysis, 43(1):259โ279, 2005. [27] Stefano Conti and Anthony O Hagan. Bayesian emulation of complex multi-output and dynamic computer models. Journal of statistical planning and inference, 140(3):640โ651, 2010. [28] Tianqi Chen and Carlos Guestrin arXiv preprint arXiv: