[PDF] A Novel Cluster Classify Regress Model Predictive Controller Formulation; CCR-MPC

Abstract

In this work, we develop a novel data-driven model predictive controller using advanced techniques in the field of machine learning. The objective is to regulate control signals to adjust the desired internal room setpoint temperature, affected indirectly by the external weather states. The methodology involves developing a time-series machine learning model with either a Long Short Term Memory model (LSTM) or a Gradient Boosting Algorithm (XGboost), capable of forecasting this weather states for any desired time horizon and concurrently optimising the control signals to the desired set point. The supervised learning model for mapping the weather states together with the control signals to the room temperature is constructed using a previously developed methodology called Cluster Classify regress (CCR), which is similar in style but scales better to high dimensional dataset than the well-known Mixture-of-Experts. The overall method called CCR-MPC involves a combination of a time series model for weather states prediction, CCR for forwarding and any numerical optimisation method for solving the inverse problem. Forward uncertainty quantification (Forward-UQ) leans towards the regression model in the CCR and is attainable using a Bayesian deep neural network or a Gaussian process (GP). For this work, in the CCR modulation, we employ K-means clustering for Clustering, XGboost classifier for Classification and 5th order polynomial regression for Regression. Inverse UQ can also be obtained by using an I-ES approach for solving the inverse problem or even the well-known Markov chain Monte Carlo (MCMC) approach. The developed CCR-MPC is elegant, and as seen on the numerical experiments is able to optimise the controller to attain the desired setpoint temperature.

Full PDF

11 | P a g e

A Novel Cluster Classify Regress Model Predictive Controller Formulation ; CCR-MPC

Clement Etienam , Siying Shen , Edward J O'Dwyer & Joshua Sykes Active Building Research Programme Imperial College London * Corresponding Author | P a g e

Abstract

In this work, we develop a novel data driven model predictive controller using advanced techniques in the field of machine learning. The objective is to regulate control signals to adjust the desired internal room set point temperature , affected indirectly by the external weather states. The methodology involves developing a time-series machine learning model with either a Long Short Term Memory model (LSTM) or a Gradient Boosting Algorithm (XGboost), capable of forecasting this weather states for any desired time horizon and concurrently optimising the control signals to the desired set point. The supervised learning model for mapping the weather states together with the control signals to the room temperature is constructed using a previously developed methodology called Cluster Classify regress (CCR), which is similar in style but scales better to high dimensional dataset than the well-known Mixture-of-Experts. The overall methodology involves using the CCR as a forward model in a batch or sequential optimisation approach. The overall method called CCR-MPC involves a combination of a time series model for weather states prediction, CCR for forwarding and any numerical optimisation method for solving the inverse problem . Two numerical method for optimisation are shown,

Nelder–Mead

Approximation and the Bayesian style, iterative ensemble smoother(I-ES). Forward uncertainty quantification (Forward-UQ) leans towards the regression model in the CCR and is attainable using a Bayesian deep neural network or a Gaussian process (GP). For this work, in the CCR modulation, we employ K-means clustering for Clustering , XGboost classifier for Classification and 5 th order polynomial regression for Regression. Inverse UQ can also be obtained by using an I-ES approach for solving the inverse problem or even the well-known Markov chain Monte Carlo (MCMC) approach. The developed CCR-MPC is elegant, and as seen on the numerical experiments is able to optimise the controller to attain the desired set point temperature. | P a g e

1. Introduction

Energy efficiency is a major concern to achieve sustainability in modern society. Smart cities sustainability depends on the availability of energy-efficient infrastructures and services. Buildings compose most of the city, and they are responsible for most of the energy consumption and emissions to the atmosphere (40%). Smart cities need smart buildings to achieve sustainability goals. Building’s thermal modelling is essential to face the energy efficiency race. In this report, we introduce a novel data driven model predictive controller (MPC). A mathematical introduction is presented to help understand the fundamental supervised learning problem to predict the room temperature in a building and also forecast weather states for 10-15 minutes lag duration. Finally the overall data driven MPC is implemented on some toy problems. The rest of the report is structured in this manner, section 2 introduces the necessary background necessary for our novel data driven MPC methodology. This session consists of the supervised learning theory, time series modelling using recurrent neural network (RNN) and Long short term memory network (LSTM), optimisation methods, neural networks, gradient boosting and a novel cluster classify regress algorithm developed in [6] and reformulated [8]. Section 3 & 4 shows some numerical experiments and section 5 gives conclusions and insights into future work. Background

For a supervised learning mode the following ansatz holds; Assume {(𝑥 𝑖 , 𝑦 𝑖 )} 𝑖=1𝑁 where 𝑥 𝑖 ∈ ℝ 𝐾 and 𝑦 𝑖 ∈ ℝ 𝑀 for regression or 𝑦 𝑖 ∈ {0,1, … 𝐽} 𝑀 for classification , where (𝑥 𝑖 , 𝑦 𝑖 ) ∈ 𝜒 × 𝛾 assumed to be input and output of a model, Postulating amongst a family of ansatz 𝑓(. ; 𝜃) , parametrized by 𝜃 ∈ ℝ 𝑃 , we can find a 𝜃 ∗ such that for all 𝑖 = 1, … 𝑁 , 𝑦 𝑖 = 𝑓(𝑥 𝑖 ; 𝜃 ∗ ) Eqn.1(a) 𝑓: 𝜒 → 𝛾 is the forward mapping irregular and has sharp features, very non-linear and has noticeable discontinuities, Then ∀ 𝑥 ′ in the set {𝑥 𝑖 } 𝑖=1𝑁 , 𝑦 ′ ≈ 𝑓(𝑥 ′ ; 𝜃 ∗ ) | P a g e Eqn 1(b) Where 𝑦 ′ is the true label of 𝑥 ′ Example includes; generalized linear models (GLM), Neural networks (NN). The idea is that; • 𝑁 data points only seen in training and often not all at once, memory is sublinear or constant in 𝑁 • 𝑁 data points and 𝑁 inner product, that is required for prediction • Bayesian • Number of hyper-parameters 𝑃 can be small (3 for Gaussian processes [5]) • Seeks to identify the best model 𝑓(. ; 𝑋, 𝑌, 𝜃) without relying on any parametric form.

Find best 𝜃 ∗ which minimizes data misfit; Φ(𝜃) = ∑ 𝑑(𝑦 𝑖 , 𝑓(𝑥 𝑖 ; 𝜃)) 𝑁𝑖=1

Eqn 1(c) 𝜃 ∗ = (𝑋 𝑇 𝑋) −1 𝑋 𝑇 𝑌 Eqn 1(d) Where

𝑋 = [𝑥 , … , 𝑥 𝑁 ] and 𝑌 = [𝑦 , … , 𝑦 𝑁 ] An RNN model captures a time series sequence from the following ansatz; [15 &19] Assume a hidden state ℎ 𝑡 ∈ ℋ ⊂ ℝ 𝑛 ℎ , the dynamics are described as follows, ℎ 𝑡 = 𝑓 ℎ (ℎ 𝑡−1 , 𝑥 𝑡 ) Eqn 2 𝑦 𝑡 = 𝑓 𝑦 (ℎ 𝑡 ) Eqn 3 𝑓 ℎ : ℋ × 𝑋 → ℋ is the transition nonlinear function describing the dynamics of ℎ 𝑡 ∈ ℋ and 𝑓 𝑦 ∶ ℋ → 𝑌 describes the functional mapping from the hidden state to the output variable. | P a g e ℎ 𝑡 ∈ ℋ is the running summary of 𝑢 𝑡 ∈ 𝑈 until time 𝑡 ∈ ℕ . The reoccurrence formula updates this summary based on its previous value, ℎ 𝑡−1 ∈ ℋ where we assume ℎ = 0 𝑓 ℎ is the composition of an element wise nonlinearity with an affine transformation of both 𝑢 𝑡 and ℎ 𝑡−1 such that, ℎ 𝑡 = 𝑓 ℎ (𝑊 𝑢ℎ 𝑢 𝑡 + 𝑊 ℎℎ ℎ 𝑡−1 ) Eqn 4 𝑊 𝑢ℎ ∈ ℝ 𝑛 ℎ ×𝑛 𝑢 is the input to hidden weight matrix 𝑊 ℎℎ ∈ ℝ 𝑛 ℎ ×𝑛 ℎ is the hidden to hidden weight matrix 𝑦 𝑡 = 𝑓 𝑦 (𝑊 ℎ𝑦 ℎ 𝑡 ) Eqn 5 Where 𝑊 ℎ𝑦 ∈ ℝ 𝑛 𝑦 ×𝑛 ℎ is the hidden to output weight matrix Defining 𝑓 ℎ as a sigmoid or hyper-bolic tangent function and 𝑓 𝑦 as an identity function, we would have, ℎ 𝑡 = 𝑡𝑎𝑛ℎ(𝑊 𝑢ℎ 𝑢 𝑡 + 𝑊 ℎℎ ℎ 𝑡−1 ) Eqn 6 and also, 𝑦 𝑡 = (𝑊 ℎ𝑦 ℎ 𝑡 ) Eqn 7 Where tanh(𝑥) = 𝑒 −1𝑒 +1 In (6),

𝑊 ≡ {𝑊 𝑢ℎ , 𝑊 ℎℎ , 𝑊 ℎ𝑦 } are parameters that are estimated using back propagation method through time (BPTT) . RNN’s have memory ability, but long short term memory is limited. This makes the gradient explode or varnish while training using BPTT algorithm.[15] An LSTM unit consists of memory cell 𝑐 𝑡 an input gate 𝑖 𝑡 , a forget gate 𝑓 𝑡 and an output gate 𝑜 𝑡 . The memory cell carries the memory content of the LSTM unit while the three gates control the amount of changes to an exposure of the memory content. | P a g e (a) (b) Fig 1:

The architecture of an (a) RNN and (b) LSTM.

In Fig.1.(b), 𝑢 𝑡 is the input to the memory cell layer, and 𝜎 is the element wise logistic sigmoid function describes as 𝜎(𝑥) = −𝑥 . Mathematically, for 𝑛 𝑝 LSTM units, the forget, input and output gates are described as follows, 𝑓 𝑡 = 𝜎(𝑊 𝑢𝑓 𝑢 𝑡 + 𝑊 ℎ𝑓 ℎ 𝑡−1 ) Eqn 8a 𝑖 𝑡 = 𝜎(𝑊 𝑢𝑖 𝑢 𝑡 + 𝑊 ℎ𝑖 ℎ 𝑡−1 ) Eqn 8b 𝑜 𝑡 = 𝜎(𝑊 𝑢𝑜 𝑢 𝑡 + 𝑊 ℎ𝑜 ℎ 𝑡−1 ) Eqn 8c 𝑊 𝑢𝑓 ∈ ℝ 𝑛 𝑝 ×𝑛 𝑢 , 𝑊 𝑢𝑖 ∈ ℝ 𝑛 𝑝 ×𝑛 𝑢 and 𝑊 𝑢𝑜 ∈ ℝ 𝑛 𝑝 ×𝑛 𝑢 are the LSTM weights from the input 𝑢 𝑙 to 𝑓 𝑙 , 𝑖 𝑙 and 𝑜 𝑙 respectively. Similarly, 𝑊 ℎ𝑓 ∈ ℝ 𝑛 𝑝 ×𝑛 ℎ , 𝑊 ℎ𝑖 ∈ ℝ 𝑛 𝑝 ×𝑛 ℎ and 𝑊 ℎ𝑜 ∈ ℝ 𝑛 𝑝 ×𝑛 ℎ are the weights from the hidden state ℎ 𝑡−1 to 𝑓 𝑡 , 𝑖 𝑡 and 𝑜 𝑡 respectively. The gates from 7a-7c control the information flow through the LSTM. The forget gate 𝑓 𝑡 determines how much of the hidden state ℎ 𝑡−1 is allowed tom pass through; the input gate 𝑖 𝑡 determines the past values of the hidden state, ℎ 𝑡−1 that will be updated; and the output gate 𝑜 𝑡 , determines how much of ℎ 𝑡−1 will be made available to the next layer. The final candidate value , 𝑔 𝑡 and the memory cell, 𝑐 𝑡 are updated by; 𝑔 𝑡 = 𝑡𝑎𝑛ℎ(𝑊 𝑢𝑔 𝑢 𝑡 + 𝑊 ℎ𝑔 ℎ 𝑡−1 ) Eqn 9 𝑐 𝑡 = 𝑓 𝑡 𝑐 𝑡−1 + 𝑖 𝑡 𝑔 𝑡 Eqn 10 | P a g e ℎ 𝑡 = 𝑜 𝑡 𝑡𝑎𝑛ℎ(𝑐 𝑡 ) Eqn 11 Finally an RNN with LSTM architecture is implemented by replacing the recurrent hidden layer with an LSTM cell.

We give a brief introduction here, for more the reader may refer to [29] Let 𝑦 = 𝑓(𝑥; 𝜃) + 𝜀 where 𝜀 is random noise .Let 𝑋 = [𝑥 , … , 𝑥 𝑁 ] and = [𝑦 , … , 𝑦 𝑁 ] , we place a prior on 𝜃 , the aim is to sample from the posterior 𝑝(𝜃|𝑋, 𝑌) ∝ 𝑝(𝑌|𝑋, 𝜃)𝑝(𝑋|𝜃) 𝑝(𝜃) Eqn 12 A typical method would be using the ensemble Kalman filter (EnKF). Assuming a linear-Gaussian case, the prior pdf is expressed as 𝑝(𝜃) ∝ 𝑎 exp (− 12 (𝜃 − 𝜃 𝑝𝑟𝑖𝑜𝑟 ) 𝑇 𝐶 𝜃−1 (𝜃 − 𝜃 𝑝𝑟𝑖𝑜𝑟 )) Eqn 13 In the equation above, 𝜃 𝑝𝑟𝑖𝑜𝑟 =Best prior estimate (mean) of hyper- parameters for the machine, 𝜃 =Model hyper-parameters vectors, 𝐶 𝜃 =covariance matrix of the model , 𝑎 =constant. Corrupting the true data, 𝑌 𝑜𝑏𝑠 = 𝑌 𝑡𝑟𝑢𝑒 + 𝜀 Eqn 14 𝑦 𝑜𝑏𝑠 = Observed data , 𝑦 𝑡𝑟𝑢𝑒 =true output data, 𝜀 =the noise that accounts for the limited functionality of the measurement equipment. For full Gaussian linearity, we can assume the pdf of likelihood to be, | P a g e 𝑝(𝑌|𝑋, 𝜃) = 𝑏 exp (− 12 (𝑌 𝑜𝑏𝑠 − 𝑓(𝑋; 𝜃)) 𝑇 𝐶 𝑌 𝑜𝑏𝑠 −1 (𝑌 𝑜𝑏𝑠 − 𝑓(𝑋; 𝜃)) Eqn 15 𝐶 𝑌 𝑜𝑏𝑠 =covariance matrix of the measurement noise. The conditional pdf can then be derived as, 𝑝(𝜃│𝑋, 𝑌) = 𝑐 exp (− 12 ((𝜃 − 𝜃 𝑝𝑟𝑖𝑜𝑟 ) 𝑇 𝐶 𝜃−1 (𝜃 − 𝜃 𝑝𝑟𝑖𝑜𝑟 )× (− 12 (𝑌 𝑜𝑏𝑠 − 𝑓(𝑋; 𝜃)) 𝑇 𝐶 𝑌 𝑜𝑏𝑠 −1 (𝑌 𝑜𝑏𝑠 − 𝑓(𝑋; 𝜃)) Eqn 16 𝑐 =normalizing constant .The aim is the minimisation of the objective function Q (m) in the equation below, 𝑄(𝜃) = 𝑄𝑚(𝜃) + 𝑄𝑑(𝜃)

Eqn 17

𝑄 (𝜃) = 12 (𝜃 − 𝜃 𝑝𝑟𝑖𝑜𝑟 ) 𝑇 𝐶 𝜃−1 (𝜃 − 𝜃 𝑝𝑟𝑖𝑜𝑟 ) + 12 (𝑌 𝑜𝑏𝑠 − 𝑓(𝑋; 𝜃)) 𝑇 𝐶 𝑌 𝑜𝑏𝑠 −1 (𝑌 𝑜𝑏𝑠 − 𝑓(𝑋; 𝜃)) Eqn 18

𝑄𝑚(𝜃) = model mismatch term that provides normalisation for the Hessian matrix.

𝑄𝑑(𝜃) =data mismatch term If we assumed the relationship between model and predicted data is linear i.e. 𝑓(𝑋; 𝜃) = 𝐺𝜃 , then the posterior mean in Eqn 18 is 𝜃 𝑘+1 = 𝜃 𝑘 + 𝐶 𝜃 𝐺 𝑇 (𝐺𝐶 𝜃 𝐺 𝑇 + 𝐶 𝐷 ) −1 (𝑌 𝑜𝑏𝑠 − 𝐺𝜃 𝑘 ) Eqn 19 Let 𝑌 𝑠𝑖𝑚 = 𝐺𝜃 be the solution of the forward problem, 𝐶̃ 𝜃𝑌𝑘 = 1𝑁 𝑒 − 1 ∑(𝜃 𝑢𝑐,𝑗 − 𝜃̅ 𝑘 ) 𝑁 𝑒 𝑗=1 (𝑌 𝑠𝑖𝑚𝑗𝑘 − 𝑌 𝑠𝑖𝑚 𝑓 ̅̅̅̅̅̅̅) 𝑇 = 1𝑁 𝑒 − 1 ∑(𝜃 𝑗𝑘 − 𝜃̅ 𝑘 ) 𝑁 𝑒 𝑗=1 (𝐺̅(𝜃 𝑗𝑘 − 𝜃̅ 𝑘 )) 𝑇 = 𝐶 𝜃 𝐺 𝑇 | P a g e Eqn 20(a) 𝐶̃ 𝑌𝑌𝑘 = 1𝑁 𝑒 − 1 ∑ (𝑌 𝑠𝑖𝑚𝑗𝑘 − 𝑌 𝑠𝑖𝑚 𝑓 ̅̅̅̅̅̅̅) 𝑁 𝑒 𝑗=1 (𝑌 𝑠𝑖𝑚𝑗𝑘 − 𝑌 𝑠𝑖𝑚 𝑓 ̅̅̅̅̅̅̅) 𝑇 = 1𝑁 𝑒 − 1 ∑ (𝐺̅(𝜃 𝑗𝑘 − 𝜃̅ 𝑘 )) 𝑁 𝑒 𝑗=1 (𝐺̅(𝜃 𝑗𝑘 − 𝜃̅ 𝑘 )) 𝑇 = 𝐺𝐶 𝜃 𝐺 𝑇 Eqn 20(b) 𝜃 𝑘+1 = 𝜃 𝑘 + 𝐶̃ 𝜃𝑌𝑘 (𝐶̃ 𝑌𝑌𝑘 + 𝐶 𝐷 ) −1 (𝑌 𝑜𝑏𝑠 − 𝐺𝜃 𝑘 ) Eqn 21 Where 𝐶̃ 𝜃𝑌𝑘 (𝐶̃ 𝑌𝑌𝑘 + 𝐶 𝐷 ) −1 is the Kalman gain matrix, and 𝑘 is the iteration index . The method shown in Eqn 21 can also be posed in an online sequential learning of the model hyper-parameters for a parametric supervised learning model This sub- section, re-introduces a recent algorithm developed in [6] and re-formulated in [8] For a set of labelled data {(𝑥 𝑖 , 𝑦 𝑖 )} 𝑖=1𝑁 , where (𝑥 𝑖 , 𝑦 𝑖 ) ∈ 𝜒 × 𝛾 assumed to be input and output of a model shown in Eqn 22 𝑦 𝑖 ≈ 𝑓(𝑥 𝑖 ) Eqn 22 𝑓: 𝜒 → 𝛾 is irregular and has sharp features, very non-linear and has noticeable discontinuities, The output space is taken as 𝛾 = ℝ and 𝜒 = ℝ 𝑑 In this stage, we seek to cluster the training input and output pairs 𝜆: 𝜒 × 𝛾 → ℒ ≔ {1, … , 𝐿} where the label function minimizes, Φ 𝑐𝑙𝑢𝑠𝑡 (𝜆) = ∑ ∑ ℓ(𝑥 𝑖 , 𝑦 𝑖 ) 𝑖𝜖𝑆 𝑙 𝐿𝑙=1

Eqn 23 𝑆 𝑙 = {(𝑥 𝑖 , 𝑦 𝑖 ); 𝜆(𝑥 𝑖 , 𝑦 𝑖 ) = 𝑙} Eqn 24 ℓ 𝑙 loss function associated to cluster 𝑙 𝑧 𝑖 = (𝑥 𝑖 , 𝑦 𝑖 ), ℓ 𝑙 = |𝑧 𝑖 − 𝜇 𝑙 | Eqn 25 𝜇 𝑙 = 𝑙 | ∑ 𝑧 𝑖𝑖𝜖𝑆 𝑙 where |. | denotes the Euclidean norm 𝑙 𝑖 = 𝜆(𝑥 𝑖 , 𝑦 𝑖 ) is an expanded training set {(𝑥 𝑖 , 𝑦 𝑖 , 𝑙 𝑖 )} 𝑖=1𝑁 Eqn 26(a) 𝑓 𝑐 : 𝜒 → ℒ Eqn 26(b) 𝑥 ∈ 𝜒 provides an estimate 𝑓 𝑐 ∶ 𝑥 ⟼ 𝑓(𝑥) ∈ ℒ such that 𝑓 𝑐 (𝑥 𝑖 ) = 𝑙 𝑖 for the majority of the data. Crucial for the ultimate fidelity of the prediction. {𝑦 𝑖 } is ignored at this phase. The classification function minimizes Φ 𝑐𝑙𝑢𝑠𝑡 (𝑓 𝑐 ) = ∑ 𝜙 𝑐 (𝑙 𝑖 , 𝑓 𝑐 (𝑥 𝑖 )) 𝑁𝑖=1

Eqn 26(c) 𝜙 𝑐 ∶ ℒ × ℒ → ℝ + is small if 𝑓 𝑐 (𝑥 𝑖 ) = 𝑙 𝑖 for example we can choose 𝑓 𝑐 (𝑥) =𝑎𝑟𝑔𝑚𝑎𝑥 𝑙∈ℒ 𝑔 𝑙 (𝑥) where 𝑔 𝑙 (𝑥) > 0, ∑ 𝑔 𝑙 (𝑥) ℒ𝑙=1 is a soft classifier and 𝜙 𝑐 (𝑙 𝑖 , 𝑓 𝑐 (𝑥 𝑖 )) =− log(𝑔 𝑙 (𝑥)) is a cross-entropic loss. 𝑓 𝑟 ∶ 𝜒 × ℒ → 𝛾 Eqn 27(a)

For each (𝑥, 𝑙) ∈ 𝜒 × ℒ must provide an estimate 𝑓 𝑟 ∶ (𝑥, 𝑙) ⟼ 𝑓 𝑟 (𝑥, 𝑙) ∈ 𝛾 such that 𝑓 𝑟 (𝑥, 𝑓 𝑐 (𝑥)) ≈ 𝑦 for both the training and test data. If successful a good reconstruction for 𝑓: 𝜒 → 𝛾 Eqn 27(b) Where 𝑓(. ) = 𝑓 𝑟 (. , 𝑓 𝑐 (. )) the regression function can be found by minimizing Φ 𝑟 (𝑓 𝑟 ) = ∑ 𝜙 𝑟 (𝑦 𝑖 , 𝑓 𝑟 (𝑥 𝑖 , 𝑓 𝑐 (𝑥 𝑖 ))) 𝑁𝑖=1

Eqn 27(c) Where 𝜙 𝑟 ∶ 𝛾 × 𝛾 → ℝ + minimized when 𝑓 𝑟 (𝑥 𝑖 , 𝑓 𝑐 (𝑥 𝑖 )) = 𝑦 𝑖 in this case can be chosen as 𝜙 𝑟 (𝑦, 𝑓 𝑟 (𝑥, 𝑓 𝑐 (𝑥))) = |𝑦 − 𝑓 𝑟 (𝑥, 𝑓 𝑐 (𝑥))| . Data can be partitioned into 𝐶 𝑙 = {𝑖; 𝑓 𝑐 (𝑥 𝑖 ) = 𝑙} for 𝑙 = 1, … 𝐿 and then perform 𝐿 separate regressions done in parallel Φ 𝑟𝑙 (𝑓 𝑟 (. , 𝑙) = ∑ 𝜙 𝑟 (𝑦 𝑖 , 𝑓 𝑟 (𝑥 𝑖 , 𝑙)) 𝑖∈𝐶 𝑙 Eqn 27(d) 𝑥 = (𝑥 , … 𝑥 𝑑 ) ∈ ℝ 𝑑 and 𝑦 ∈ ℝ hence, have |(𝑥, 𝑦) − (𝑥 ′ , 𝑦 ′ )| = (𝑦 − 𝑦 ′ ) + |𝑥 − 𝑥 ′ | For 𝑗 = 1, … 𝑑, 𝑥̃ 𝑗 = (𝑥 𝑗 − 𝑚𝑖𝑛 𝑖∈{1,..𝑁} 𝑥 𝑖𝑗 ) (𝑚𝑎𝑥 𝑖∈{1,..𝑁} 𝑥 𝑖𝑗 − 𝑚𝑖𝑛 𝑖∈{1,..𝑁} 𝑥 𝑖𝑗 )⁄ Eqn 28(a) 𝑦̃ = 𝐶(𝑦 − 𝑚𝑖𝑛 𝑖∈{1,..𝑁} 𝑦 𝑖 ) (𝑚𝑎𝑥 𝑖∈{1,..𝑁} 𝑦 𝑖 − 𝑚𝑖𝑛 𝑖∈{1,..𝑁} 𝑦 𝑖 )⁄ Eqn 28(b) For

𝐶 > 1

𝐶 = 10𝑑 where 𝑑 = dim(𝑥)

For regression

𝐶 = 1

A critique of this method is that it re-uses the data in each phase. A Bayesian postulation handles this limitation elegantly well. Recall;

𝐷 = {(𝑥 𝑖 , 𝑦 𝑖 )} 𝑖=1𝑁 Eqn 29(a) Assume parametric models for the classifier 𝑔 𝑙 (. ; 𝜃 𝑐 ) = 𝑔 𝑙 (. ; 𝜃 𝑐𝑙 ) and the regressor 𝑓 𝑟 (. , 𝑙; 𝜃 𝑟𝑙 ) for 𝑙 = 1, … 𝐿 where 𝜃 𝑐 = (𝜃 𝑐1 , … , 𝜃 𝑐𝐿 ) and 𝜃 𝑟 = (𝜃 𝑟1 , … , 𝜃 𝑟𝐿 ) and let 𝜃 = (𝜃 𝑐 , 𝜃 𝑟 ) , the posterior density has the form 𝜋(𝜃, 𝑙|𝐷) ∝ ∏ 𝜋(𝑦 𝑖 |𝑥 𝑖 , 𝜃 𝑟 , 𝑙) 𝑁𝑖=1 𝜋(𝑙|𝑥 𝑖 , 𝜃 𝑐 )𝜋(𝜃 𝑟 )𝜋(𝜃 𝑐 ) Eqn 29(b) 𝜋(𝑦 𝑖 |𝑥 𝑖 , 𝜃 𝑟 , 𝑙) ∝ exp (− 12 |𝑦 𝑖 − 𝑓 𝑟 (𝑥 𝑖 , 𝑙; 𝜃 𝑟𝑙 )| ) Eqn 29(c) And 𝜋(𝑙|𝑥 𝑖 , 𝜃 𝑐 ) = 𝑔 𝑙 (𝑥 𝑖 ; 𝜃 𝑐 ) Eqn 29(d) 𝑔 𝑙 (𝑥 𝑖 ; 𝜃 𝑐 ) = exp(ℎ 𝑙 (𝑥; 𝜃 𝑐𝑙 ))∑ exp (ℎ 𝑙 (𝑥; 𝜃 𝑐𝑙 )) 𝐿𝑙=1

Eqn 29(e) ℎ 𝑙 (𝑥; 𝜃 𝑐𝑙 ) are some standard parametric regressors Random Forests (RF) [21] are part of the algorithms called decision trees. In decision trees, the goal is to create a prediction model that predicts an output by combining different input variables. In the decision tree, each node corresponds to one of the input variables; each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. Random forest algorithm trains different decision trees by using different subsets of the training data. as depicted in Fig.2.

Fig.2 : Sketch representation of a RF workflow. Random subsets of the data are trained by the RF. The randomness generates models that are not correlated to each other • Maximum features : this is the maximum number of features that a RF is allowed to try in each individual tree. • Number of estimators:

This is the number of built trees before taking the maximum voting or averages of predictions. • Minimum sample leaf size : The leaf is the end of a decision tree

Advantages • The chance of overfitting decreases, since several different decision trees are used in the learning procedure. This is very crucial for making predictions in polymer performance datasets • RFs apply pretty well when a particular distribution of the data is not required. For example, no data normalization is needed. • Parallelization: the training of multiple trees can be parallelized (for example through different computational slots)

Disadvantages • RFs usually might suffer from smaller training datasets • The time to train a RF algorithm might be longer compared to other algorithms.

An Artificial Neural Network – ANN, with a single hidden layer can be represented graphically as follows in Fig.3.;

Figure 3:

Depiction of a neural network architecture

Formally, a one-hidden-layer MLP is a function 𝑓 ∶ ℜ 𝐾 → ℜ 𝑀 , where 𝐾 is the size of input vector 𝑥 and 𝑀 is the size of the output vector. We define 𝑓(𝑥) , such that, in matrix notation: 𝑓(𝑥) = 𝐺 (𝑏 (2) + 𝑊 (2) (𝑠(𝑏 (1) + 𝑊 (1) 𝑥))) Eqn 30 With bias vectors 𝑏 (1) , 𝑏 (2) ; weight matrices 𝑊 (1) , 𝑊 (2) and activation functions 𝐺 and 𝑠 The vector ℎ(𝑥) = Φ(𝑥) = 𝑠(𝑏 (1) + 𝑊 (1) 𝑥) constitutes the hidden layer 𝑊 (1) ∈ ℜ 𝐾×𝐾 ℎ is the weight matrix connecting the input vector to the hidden layer. Each column 𝑊 𝑖(1) represents the weights from the input units to the 𝑖 𝑡ℎ hidden unit. Typical choices for 𝑠 include 𝑡𝑎𝑛ℎ(𝑎) = (𝑒 𝑎 − 𝑒 −𝑎 )(𝑒 𝑎 + 𝑒 −𝑎 ) Eqn 31(a) 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑎) = 1(1 + 𝑒 −𝑎 ) Eqn 31(b)

𝑅𝑒𝐿𝑈(𝑎) = max(0, 𝑎)

Eqn 31(c) The output vector is then

𝑂(𝑥) = 𝐺 (𝑏 (2) + 𝑊 (2) ℎ(𝑥))

Eqn 32 For classification problems, class-membership probabilities can be obtained by choosing 𝐺 as the softmax function (in the case of multi-class classification). 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑎) 𝑖 = 𝑒𝑥𝑝(𝑎 𝑖 )∑ 𝑒𝑥𝑝(𝑎 𝑙 ) 𝐿𝑙=1

Eqn 33 Where 𝑎 𝑖 represents the 𝑖 th element of the input to softmax which corresponds to class 𝑖 and 𝐿 is the number of classes. The result is a vector that contains the probabilities which samples 𝑥 that belongs to each class. The output is the class with the highest probability. For regression the output remains 𝑂(𝑥) = (𝑏 (2) + 𝑊 (2) ℎ(𝑥))

Eqn 34 Where the activation function is unity. Depending on the problem type ANN uses different loss functions depending on the problem type. The loss function for classification is Cross-Entropy, which in binary case is given as,

𝐿𝑜𝑠𝑠(𝑦̂, 𝑦, 𝑊) = − yln 𝑦̂ − (1 − 𝑦) ln(1 − 𝑦̂) + 𝛼2 ||𝑊|| Eqn 35 where 𝛼2 ||𝑊|| is an L2-regularization term (aka penalty) that penalizes complex models; and 𝛼 > 0 is a non-negative hyper parameter that controls the magnitude of the penalty. For regression problems, ANN uses the Square Error loss function; written as, 𝐿𝑜𝑠𝑠(𝑦̂, 𝑦, 𝑊) = 12 ||𝑦̂ − 𝑦|| + 𝛼2 ||𝑊|| Eqn 36

Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function by repeatedly updating these weights. After computing the loss, a backward pass propagates it from the output layer to the previous layers, providing each weight parameter with an update value meant to decrease the loss. In gradient descent, the gradient ∇𝐿𝑜𝑠𝑠 W of the loss with respect to the weights is computed and deducted from W. More formally, this is expressed as, 𝑊 𝑖+1 = 𝑊 𝑖 − 𝜖𝛻𝐿𝑜𝑠𝑠 𝑊𝑖 Eqn 37 where i is the iteration step, and 𝜖 is the learning rate with a value larger than 0. The algorithm stops when it reaches a pre-set maximum number of iterations; or when the improvement in loss is below a certain, small number. We give a brief introduction here, for more information the reader may refer to [28] For a given data set with 𝑛 examples and 𝑚 features 𝐷 = {(𝑥 𝑖 , 𝑦 𝑖 )}(|𝐷| = 𝑛, 𝑥 𝑖 ∈ 𝑅 𝑚 , 𝑦 𝑖 ∈ 𝑅), a tree ensemble model uses 𝐾 additive functions to predict the output. 𝑦 𝑖 ̂ = 𝜑(𝑥 𝑖 ) = ∑ 𝑓 𝑘 (𝑥 𝑖 ) 𝐾𝑘=1 , 𝑓 𝑘 ∈ 𝑭 Eqn 38 where

𝑭 = {𝑓(𝑥) = 𝑤 𝑞(𝑥) }(𝑞 ∶ 𝑅 𝑚 → 𝑇, 𝑤 ∈ 𝑅 𝑇 ) is the space of regression trees ( known also as CART). 𝑞 represents the structure of each tree that maps an example to the corresponding leaf index. 𝑇 is the number of leaves in the tree. Each 𝑓 𝑘 corresponds to an independent tree structure 𝑞 and leaf weights 𝑤 . Each regression tree contains a continuous score on each of the leaf, 𝑤 𝑖 represents score on 𝑖 -th leaf. For a given example, we will use the decision rules in the trees (given by 𝑞 ) to classify To learn the set of functions used in the model, we minimize the following regularized objective;

ℒ(𝜑) = ∑ 𝑙(𝑦 𝑖 ̂ , 𝑦 𝑖 ) 𝑖 + ∑ Ω(𝑓 𝑘 ) 𝑘 where Ω(𝑓) = 𝛾𝑇 + 12 𝜆‖𝑤‖ Eqn 39 𝑙 is a differentiable convex loss function that measures the difference between the prediction 𝑦 𝑖 ̂ and the target 𝑦 𝑖 . The second term Ω penalizes the complexity of the model (i.e., the regression tree functions). The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions. The tree ensemble model in Eq. (39) includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space. Instead, the model is trained in an additive manner. Formally, let 𝑦 𝑖 ̂ 𝑡 be the prediction of the 𝑖 -th instance at the 𝑡 -th iteration, we will need to add 𝑓 𝑡 to minimize the following objective. ℒ (𝑡) = ∑ 𝑙(𝑦 𝑖 , 𝑦 𝑖 ̂ (𝑡−1) 𝑛𝑖=1 + 𝑓 𝑡 (𝑥 𝑖 )) + Ω(𝑓 𝑡 ) Eqn 40 This means we greedily add the 𝑓 𝑡 that most improves our model according to Eq. (39). Second-order approximation can be used to quickly optimize the objective in the general setting. ℒ (𝑡) ≅ ∑[𝑙(𝑦 𝑖 , 𝑦 𝑖 ̂ (𝑡−1) ) + 𝑔 𝑖 𝑓 𝑡 (𝑥 𝑖 ) + 12 ℎ 𝑖 𝑓 𝑡2 (𝑥 𝑖 )] + Ω(𝑓 𝑡 ) 𝑛𝑖=1 Eqn 41 where 𝑔 𝑖 = 𝜕 𝑦 𝑖 ̂ (𝑡−1) 𝑙(𝑦 𝑖 , 𝑦 𝑖 ̂ (𝑡−1) ) and ℎ𝑖 = 𝜕 𝑖 ̂ (𝑡−1) 𝑙(𝑦 𝑖 , 𝑦 𝑖 ̂ (𝑡−1) ) are first and second order gradient statistics on the loss function. Removing the constant terms to obtain the following simplified objective at step 𝑡 ; ℒ̃ (𝑡) ≅ ∑[ 𝑔 𝑖 𝑓 𝑡 (𝑥 𝑖 ) + 12 ℎ 𝑖 𝑓 𝑡2 (𝑥 𝑖 )] + Ω(𝑓 𝑡 ) 𝑛𝑖=1 Eqn 42 Define 𝐼 𝑗 = {𝑖|𝑞(𝑥 𝑖 ) = 𝑗} as the instance set of leaf 𝑗 . We can rewrite Eqn (42) by expanding Ω as follows ℒ̃ (𝑡) = ∑ [ 𝑔 𝑖 𝑓 𝑡 (𝑥 𝑖 ) + 12 ℎ 𝑖 𝑓 𝑡2 (𝑥 𝑖 )] + Ω(𝑓 𝑡 ) 𝑛𝑖=1 + 𝛾𝑇 + 12 𝜆 ∑ 𝑤 𝑗2𝑇𝑗=1 Eqn 43 = ∑ [(∑ 𝑔 𝑖𝑖 ∈𝐼 𝑗 ) 𝑤 𝑗 + 12 (∑ ℎ 𝑖𝑖 ∈𝐼 𝑗 + 𝜆)𝑤 𝑗2 ] 𝑇𝑗=1 + 𝛾𝑇

Eqn 44 For a fixed structure 𝑞(𝑥) , we can compute the optimal weight 𝑤 𝑗∗ of leaf 𝑗 by 𝑤 𝑗∗ = − ∑ 𝑔 𝑖𝑖 ∈𝐼 𝑗 ∑ ℎ 𝑖𝑖 ∈𝐼 𝑗 + 𝜆 Eqn 45 and calculate the corresponding optimal value by ℒ̃ (𝑡) (𝑞) = − 12 ∑ (∑ 𝑔 𝑖𝑖 ∈𝐼 𝑗 ) ∑ ℎ 𝑖𝑖 ∈𝐼 𝑗 + 𝜆 𝑇𝑗=1 + 𝛾𝑇

Eqn 46

Eqn (46) can be used as a scoring function to measure the quality of a tree structure 𝑞 . Normally it is impossible to enumerate all the possible tree structures 𝑞 . A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that 𝐼 𝐿 and 𝐼 𝑅 are the instance sets of left and right nodes after the split. Letting 𝐼 = 𝐼 𝐿 ∪ 𝐼 𝑅 , then the loss reduction after the split is given by; ℒ 𝑠𝑝𝑙𝑖𝑡 = 12 [ (∑ 𝑔 𝑖𝑖 ∈𝐼 𝐿 ) ∑ ℎ 𝑖𝑖 ∈𝐼 𝐿 + 𝜆 + (∑ 𝑔 𝑖𝑖 ∈𝐼 𝑅 ) ∑ ℎ 𝑖𝑖 ∈𝐼 𝑅 + 𝜆 − (∑ 𝑔 𝑖𝑖 ∈𝐼 𝑗 ) ∑ ℎ 𝑖𝑖 ∈𝐼 𝑗 + 𝜆] − 𝛾 Eqn 47 Eqn (47) is used in practice for evaluating the split candidates.

3. Numerical Experiment 1 (Toy Problem)

Logic of novel CCR-MPC • The weather state variables are 𝜃 . If we assume, 𝜃 ∈ [𝛽] 𝒅 , then; 𝜃 𝑡+1 = 𝑓 (𝜃 𝑡 ) + 𝜀 Assuming a sequential approach where the weather states variable are known, the parameters of the time series machine 𝑓 can be relearned, using this logic 𝑓 = 𝑟𝑒 − 𝑙𝑒𝑎𝑟𝑛[𝑓 , 𝜃̂ 𝑡 ] Where 𝑡 is time step and 𝜃̂ 𝑡 is the true weather states. We then set for the next time step optimisation, 𝑓 = 𝑓 Until final time where 𝑓 is the time series machine • The controller is 𝜇 ∈ [𝛼] 𝒅 that has a direct consequence to maintain the indoor set point temperature; 𝜇 = ∑ 𝛼 𝑖𝑑𝑖=1 , . . ∀ 𝑖 = 1 … 𝑑 𝒅 is the dimension of the control parameters • Hence the current weather states 𝜃 and the controller 𝜇 ∈ [𝛼] 𝒅 are inputs to the forward problem; 𝛾 = 𝑓 (𝜃, 𝜇) + 𝜀 Where 𝛾 is the True temperature and 𝐺 is the set point temperature. 𝑓 is the supervised learning model learned from either CCR/DNN/RF, The equation (3) is then posed as an optimisation problem to track the setpoint 𝑝(𝜇|𝐺, 𝛾, 𝜃) = argmin 𝜇 ‖𝐺 − 𝛾‖ + 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡𝑠 Where “constraints” could be any economical or realistic building physics scenarios we impose on the optimisation problem. The link to GitHub Repository for implementation of the code can be found at : https://github.com/clementetienam/Data_Driven_MPC_Controller_Using_CCR

A neural network (DNN) was trained for mapping the weather states with control variables to the zone mean temperature. We compute our 𝑅 accuracy for both the model learning and set point tracking by 𝑅 = 1 − ∑ [𝑦 (𝑡𝑟𝑢𝑒 𝑑𝑎𝑡𝑎 𝑜𝑟 𝑠𝑒𝑡𝑝𝑜𝑖𝑛𝑡)𝑖 − 𝑦 (𝑚𝑜𝑑𝑒𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛)𝑖 ] 𝑁𝑖=1 2 ∑ [𝑦 (𝑡𝑟𝑢𝑒 𝑑𝑎𝑡𝑎 𝑜𝑟 𝑠𝑒𝑡𝑝𝑜𝑖𝑛𝑡)𝑖 − 𝑚𝑒𝑎𝑛(𝑦 (𝑡𝑟𝑢𝑒 𝑑𝑎𝑡𝑎 𝑜𝑟 𝑠𝑒𝑡𝑝𝑜𝑖𝑛𝑡) )] Eqn 48 Model discomfort is then

𝐷𝑖𝑠𝑐𝑜𝑚𝑓𝑜𝑟𝑡(𝐶) = 1𝑁 ∑[𝑦 (𝑡𝑟𝑢𝑒 𝑑𝑎𝑡𝑎 𝑜𝑟 𝑠𝑒𝑡𝑝𝑜𝑖𝑛𝑡)𝑖 − 𝑦 (𝑚𝑜𝑑𝑒𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛)𝑖 ] Eqn 49 In this experiment, the Nelder–Mead optimisation method [24] and the iterative ensemble smoother (I-ES) [29] are compared and used as the optimisation method.

Box Configuration

Outputs: • THERMAL ZONE: BOX: Zone Mean Air Temperature [C]

Fig. 4:

Correlation map of the features to the outputs for Machine 1

Fig.5 : (a, b)- 𝑅 (%) training and test accuracy on Air Temperature [C](Hourly). (b, e)- histogram plot showing 𝑓 (𝑥 𝑖 ; 𝜃 ∗ ) − 𝑦 𝑡𝑟𝑢𝑒 . (c) - Super imposed plot of 𝑓 (𝑥 𝑖 ; 𝜃 ∗ ), 𝑦 𝑡𝑟𝑢𝑒 . Ground Source Heat Pump (GSHP) configuration

Inputs: • Environment: Site Outdoor Air Drybulb Temperature [C] • Environment: Site Outdoor Air Wetbulb Temperature [C] • Environment: Site Outdoor Air Relative Humidity [%] • Environment: Site Wind Speed [m/s] • Environment: Site Wind Direction [deg] • Environment: Site Horizontal Infrared Radiation Rate per Area [W/m2] • Environment: Site Diffuse Solar Radiation Rate per Area [W/m2] • Environment: Site Direct Solar Radiation Rate per Area [W/m2] • THERMAL ZONE: BOX: Zone Outdoor Air Wind Speed [m/s] • GSHPCLG: Heat Pump Electric Power [W]-control signal • GSHPCLG: Heat Pump Source Side Inlet Temperature [C]-control signal • GSHPHEATING: Heat Pump Electric Power [W]-control signal • GSHPHEATING: Heat Pump Source Side Inlet Temperature [C]-control signal

Outputs: • THERMAL ZONE: BOX: Zone Mean Air Temperature [C]

Fig.6 : (a, b)- 𝑅 (%) training and test accuracy on Air Temperature [C](Hourly). (b, e)- histogram plot showing 𝑓 (𝑥 𝑖 ; 𝜃 ∗ ) − 𝑦 𝑡𝑟𝑢𝑒 . (c) - Super imposed plot of 𝑓 (𝑥 𝑖 ; 𝜃 ∗ ), 𝑦 𝑡𝑟𝑢𝑒 . Fig.7:

LSTM performance on the hourly prediction of the Box dataset. Each of ℝ has been learned and forecasted to the future using only its output at its previous time step. The lookback is 7 time step, i.e 𝑡, . . 𝑡 −1. 𝑡 − 2, … . . 𝑡 − 6 used to predict 𝑡 + 1 . (a) (b) 5 | P a g e Fig.8:

Open loop Batch Optimisation. The room temperature is forecasted “blindly” with the LSTM model , and the GSHP control parameters are optimised to the set temperature . The optimisation algorithm is Nelder–Mead on panel (a) and I-ES on panel (b) (a) (b) 6 | P a g e

Fig.9:

Closed loop sequential Optimisation. The room temperature is initially forecasted “blindly” with the LSTM model and corrected to the true temperature reading (from a sensor)., and the GSHP control parameters are optimised to the set temperature . The optimisation algorithm is Nelder–Mead on panel (a) and I-ES on panel (b)

The first step is developing a good time series model that could predict the states ( of 5 variables) accurately. Fig.10. below shows the approach using 50% of the data set and training set and 50% as test set. The time series model is the XGboost. From the “date-time” index in pandas that has this format, “01/01/2010 00:00:00”, 8 new features (Pseudo-inputs), namely; 'hour', 'day of week', 'quarter', 'month', 'year', 'day of year', 'day of month', 'week of year' where created. 5 overall time series machine was modelled, mapping this 8 features to each of the 5 weather states variables. (a) (b) (c) (d) (e)

Fig. 10:

XGboost time series model for predicting the states. (a-e) are for states 'Ambient temperature (C)','Ground temperature (C)', 'Global irradiance (W/m2)', 'Direct irradiance (W/m2)','Diffuse irradiance (W/m2)'

Next, we construct a forward problem that will map these 5 weather states with the 2 controller inputs which are ‘Heat pump heat supply(kw) and

Heat pump electrical load(kw) , making 7 variables, to predict the internal temperature. Several regression algorithms were tested as seen in Fig.11(b). below; (a) (b)

Fig.11: (a) Correlation map of data (b) Various regression algorithm to approximate the forward mapping of 7 to 1 on test data

In all, Random Forest showed the best 𝑅 (%) accuracy. But Random Forest fails in optimising to the set point temperature because during optimisation, the prediction from Random Forest is based on an ensemble of decision tress, and using ensemble averaging for discrete Heaviside signals like the setpoint (17.5C /21C) gave in-accurate reconstruction as shown in Fig. 12. Fig. 12:

XGboost time series model and RandomForest forward model. Discomfort is 3.821C and it is due to the ensemble averaging nature of the Random Forest Algorithm

With that in mind deterministic algorithms like Deep neural network- DNN and Polynomial regression were tested. we first trained the forward model and had the 𝑅 (%) accuracy for them as depicted in Fig.13. (A)-DNN machine (B)-Polynomial regression- Degree 9 Fig. 13: 𝑅 (%) accuracy on test data for the forward problem (7 to 1). (A) DNN and (B) Polynomial regression degree 9 With these in mind, the DNN controller was re-run for test set point data using XGboost as the time series machine. The discomfort for the model is depicted in Fig.(14 & 15). for DNN and Polynomial regression respectively, with Polynomial regression having a lower discomfort of

Table 1 shows the summary of various combinations of XGboost series machine with 4 promising forward model, with the Polynomial regression coming 1 st and DNN coming 2 nd . Fig. 14:

Data driven Model- XGboost for modelling time series and DNN for modelling forward problem. Optimisation is with Nelder–Mead.

Fig. 15:

Data driven Model-XGboost for modelling time series and Polynomial regression (Degree=9) for modelling forward problem. Optimisation is with Nelder–Mead.

Model Discomfort (C) XGboost + DNN

XGboost +Polynomial regression

XGboost +Random Forest

XGboost + Nearest Neighbour

Table 1: we re-ran the sequence now substituting the forward model with the Cluster Classify Regress algorithm (CCR) [6 & 8] and had better performance. The prediction is for 500 steps ahead which are for 7500 minutes in the future- (6 days in advance). The forward mapping was able to trace the discontinuity of the data and better solve the inverse problem. Fig.15. are the Data-driven MPC using CCR as a forward model. Conclusively, the approach using a CCR model as a forward model gave a better performance than one using a single machine to map the forward model. (a) (b) (c) (d)

Figure 15:

Data driven MPC Model- (a): Time series machine is an XGboost model, the forward model is a CCR model where the gate is an XGboost And the experts (2) is a 9 th -degree polynomial regressor, (d): Time series machine is an XGboost model, the forward model is a CCR model where the gate is a RandomForest and the experts (2) is a 9 th -degree polynomial regressor (c):XGboost for modelling time series and Polynomial regression (Degree=9) for modelling forward problem. Optimisation is with Nelder–Mead.(d): XGboost for modelling time series and DNN for modelling forward problem. Optimisation is with Nelder–Mead.

Model Discomfort (C) (a)

XGboost + CCR(XGboost/Polynomial regression) (b)

XGboost + CCR(XGboost/RandomForest) (c) XGboost +Polynomial Regressor (d)

XGboost + Nearest Neighbour

Table 2: Figure 16:

5. Conclusion & Future Work

We have developed a novel model predictive controller called

CCR-MPC for optimising the conditions of a building to track a desired set point. The approach is data driven and scales easily to any size of the data set. As with any other machine learning conjecture, the quality of the surrogate model is almost dependent on the quality of the training dataset, hence domain knowledge is important in interpreting results from this novel controller. The CCR-MPC is elegant and from the numerical results is able to track and maintain the desired set point temperature in a predictive manner. Inverse and forward UQ is also naturally imbedded from the components of the forward CCR model to the type of optimisation used in solving the inverse problem. Future work will be to apply direct reinforcement learning approach to the MPC formulation to give a natural logic of the controller learning directly from its environment and choosing the best course of action to take for future time horizons.

6. Acknowledgement

CE developed the idea /mathematics for the novel CCR-MPC approach, coded the numerical implementation in Python and wrote the document. SS prepared the data set for toy problem 1 and gave valuable civil engineering knowledge during code implementation. ED prepared the data set for Imperial College Data and gave valuable civil engineering knowledge during code implementation. JS directed the research and gave valuable civil engineering knowledge/feedback during code implementation

References [1] Alex Gorodetsky and Youssef Marzouk. Efficient localization of discontinuities in complex computational simulations. SIAM Journal on Scientific Computing, 36(6):A2584–A2610, 2014. [2] Anne Gelb and Eitan Tadmor. Spectral reconstruction of piecewise smooth functions from their discrete data. ESAIM: Mathematical Modelling and Numerical Analysis, 36(2):155–175, 2002. [3] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. [4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [5] Carl Rasmussen and Chris Williams. Gaussian processes for machine learning. Gaussian Processes for Machine Learning, 2006. [6] Clement Etienam, Kody Law, Sara Wade. Ultra-fast Deep Mixtures of Gaussian Process Experts. arXiv preprint arXiv:2006.13309, 2020. [7] David Adalsteinsson and James A Sethian. A fast level set method for propagating interfaces. Journal of computational physics, 118(2):269–277, 1995. [8] David E. Bernholdt, Mark R. Cianciosa, David L. Green, Jin M. Park, Kody J. H. Law, and Clement Etienam. Cluster, classify, regress: A general method for learning discontinuous functions. Foundations of Data Science, 1(2639-8001-2019-4-491):491, 2019. [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [10] Dirk Pflüger, Benjamin Peherstorfer, and Hans-Joachim Bungartz. Spatially adaptive sparse grids for high-dimensional data-driven problems. Journal of Complexity, 26(5):508–522, 2010. [11] Dmitry Batenkov. Complete algebraic reconstruction of piecewise-smooth functions from fourier data. Mathematics of Computation, 84(295):2329–2350, 2015. [12] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011. [13] Habib N Najm, Bert J Debusschere, Youssef M Marzouk, Steve Widmer, and OP Le Maître. Uncertainty quantification in chemical systems. International journal for numerical methods in engineering, 80(6-7):789–814, 2009. [14] Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta numerica, 13:147–269, 2004. [15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [16] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [17] John D Jakeman, Richard Archibald, and Dongbin Xiu. Characterization of discontinuities in high-dimensional stochastic problems on adaptive sparse grids. Journal of Computational Physics, 230(10):3977–3997, 2011. [18] Karla Monterrubio-Gómez, Lassi Roininen, Sara Wade, Theo Damoulas, and Mark Girolami. Posterior inference for sparse hierarchical non-stationary models. arXiv preprint arXiv:1804.01431, 2018. [19] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. [20] Knut S Eckhoff. Accurate reconstructions of functions of finite regularity from truncated fourier series expansions. Mathematics of Computation, 64(210):671–690, 1995. [21] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [22] Matthew M Dunlop, Marco A Iglesias, and Andrew M Stuart. Hierarchical bayesian level set inversion. Statistics and Computing, 27(6):1555–1584, 2017. [23] Michael Yu Wang, Xiaoming Wang, and Dongming Guo. A level set method for structural topology optimization. Computer methods in applied mechanics and engineering, 192(1-2):227–246, 2003. [24] Powell, M.J.D. (1998), “Direct search algortihms for optimization calculations”, in Acta Numerica 1998, A. Iserles (Ed.), Cambridge University Press, Cambridge, UK, pp. 287–336. [25] Rick Archibald, Anne Gelb, Rishu Saxena, and Dongbin Xiu. Discontinuity detection in multivariate space for stochastic simulations. Journal of Computational Physics, 228(7):2676–2689, 2009. [26] Rick Archibald, Anne Gelb, and Jungho Yoon. Polynomial fitting for edge detection in irregularly sampled signals and images. SIAM journal on numerical analysis, 43(1):259–279, 2005. [27] Stefano Conti and Anthony O Hagan. Bayesian emulation of complex multi-output and dynamic computer models. Journal of statistical planning and inference, 140(3):640–651, 2010. [28] Tianqi Chen and Carlos Guestrin arXiv preprint arXiv:[23] Michael Yu Wang, Xiaoming Wang, and Dongming Guo. A level set method for structural topology optimization. Computer methods in applied mechanics and engineering, 192(1-2):227–246, 2003. [24] Powell, M.J.D. (1998), “Direct search algortihms for optimization calculations”, in Acta Numerica 1998, A. Iserles (Ed.), Cambridge University Press, Cambridge, UK, pp. 287–336. [25] Rick Archibald, Anne Gelb, Rishu Saxena, and Dongbin Xiu. Discontinuity detection in multivariate space for stochastic simulations. Journal of Computational Physics, 228(7):2676–2689, 2009. [26] Rick Archibald, Anne Gelb, and Jungho Yoon. Polynomial fitting for edge detection in irregularly sampled signals and images. SIAM journal on numerical analysis, 43(1):259–279, 2005. [27] Stefano Conti and Anthony O Hagan. Bayesian emulation of complex multi-output and dynamic computer models. Journal of statistical planning and inference, 140(3):640–651, 2010. [28] Tianqi Chen and Carlos Guestrin arXiv preprint arXiv: