[PDF] Exploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring

Abstract

Non-Intrusive Load Monitoring (NILM) is a field of research focused on segregating constituent electrical loads in a system based only on their aggregated signal. Significant computational resources and research time are spent training models, often using as much data as possible, perhaps driven by the preconception that more data equates to more accurate models and better performing algorithms. When has enough prior training been done? When has a NILM algorithm encountered new, unseen data? This work applies the notion of Bayesian surprise to answer these questions which are important for both supervised and unsupervised algorithms. We quantify the degree of surprise between the predictive distribution (termed postdictive surprise), as well as the transitional probabilities (termed transitional surprise), before and after a window of observations. We compare the performance of several benchmark NILM algorithms supported by NILMTK, in order to establish a useful threshold on the two combined measures of surprise. We validate the use of transitional surprise by exploring the performance of a popular Hidden Markov Model as a function of surprise threshold. Finally, we explore the use of a surprise threshold as a regularization technique to avoid overfitting in cross-dataset performance. Although the generality of the specific surprise threshold discussed herein may be suspect without further testing, this work provides clear evidence that a point of diminishing returns of model performance with respect to dataset size exists. This has implications for future model development, dataset acquisition, as well as aiding in model flexibility during deployment.

Full PDF

aa r X i v : . [ c s . A I] S e p Exploring Bayesian Surprise to Prevent Overfitting and toPredict Model Performance in Non-Intrusive Load Monitoring

Richard Jones

School of Engineering ScienceSimon Fraser University, [email protected]

Christoph Klemenjak

Institute of Networked and Embedded SystemsUniversity of Klagenfurt, [email protected]

Stephen Makonin

School of Engineering ScienceSimon Fraser University, [email protected]

Ivan V. Bajić

School of Engineering ScienceSimon Fraser University, [email protected]

ABSTRACT

Non-Intrusive Load Monitoring (NILM) is a ﬁeld of research fo-cused on segregating constituent electrical loads in a system basedonly on their aggregated signal. Signiﬁcant computational resourcesand research time are spent training models, often using as muchdata as possible, perhaps driven by the preconception that moredata equates to more accurate models and better performing algo-rithms. When has enough prior training been done? When has aNILM algorithm encountered new, unseen data? This work appliesthe notion of Bayesian surprise to answer these questions whichare important for both supervised and unsupervised algorithms.We quantify the degree of surprise between the predictive distribu-tion (termed postdictive surprise), as well as the transitional proba-bilities (termed transitional surprise), before and after a window ofobservations. We compare the performance of several benchmarkNILM algorithms supported by NILMTK, in order to establish auseful threshold on the two combined measures of surprise. Wevalidate the use of transitional surprise by exploring the perfor-mance of a popular Hidden Markov Model as a function of surprisethreshold. Finally, we explore the use of a surprise threshold as aregularization technique to avoid overﬁtting in cross-dataset per-formance. Although the generality of the speciﬁc surprise thresh-old discussed herein may be suspect without further testing, thiswork provides clear evidence that a point of diminishing returnsof model performance with respect to dataset size exists. This hasimplications for future model development, dataset acquisition, aswell as aiding in model ﬂexibility during deployment.

KEYWORDS datasets, neural networks, gaze detection, text tagging

Non-Intrusive Load Monitoring (NILM), often referred to as loaddisaggregation, dates back to the seminal work presented in [20].In a nutshell, NILM describes the problem of identifying presentelectrical appliances within a time series consisting of a sequenceof (power) measurements taken at a central point in the distribu-tion grid of a building. As can be obtained from a recently-publishedreview [17], the number of NILM techniques relying on machinelearning approaches, especially Deep Learning, has signiﬁcantlyincreased during the past years. Compared to traditional NILM techniques, Deep Learning methods require considerably largeramounts of training data. Motivated by this, research groups haveinvested big eﬀorts in collecting and publishing energy datasets.Energy datasets are the outcome of measurement campaigns inone or several buildings with the aim to collect energy consump-tion data at both the aggregate and load/appliance levels. [38].In recent years, more and more energy datasets have emerged(e.g., [26, 32, 34, 35] to name a small few), which can vary con-siderably in terms of complexity, methodology, appliance charac-teristics and usage patterns, setting, etc. (e.g., see [27, 38]). Withsome datasets spanning several years of collection, considerabletime and computational resources are spent in training new mod-els. Newer approaches to NILM increasingly adopt deep learningmethods (e.g., [19, 29]), which can involve millions of tunable pa-rameters, not to mention the often arduous process of hyperpa-rameter tuning. It stands to reason, then, that eﬀectively isolatingthe most important segments of a dataset relative to a model couldimprove time-to-deployment as well potentially regularize againstoverﬁtting. A common technique to truncate training time is tomonitor the model’s loss metric over a validation partition of thedataset. However, the entire available training set is used in anepoch before evaluation on the validation set is made. Given thewide variation in dataset complexity, arbitrarily training on a sub-set of the available data runs the signiﬁcant risk of missing im-portant relationships between appliance modes or even missingappliances modes entirely.In an online setting, a common approach for disaggregation isto deploy generalized models that are subsequently specialized to agiven house by an additional round of training [44]. In these cases,appliance-level performance metrics are unavailable, and optimiza-tion of a model is instead left to crude estimates of performancesuch as internal consistency between proposed appliance proﬁlesor extracted features, fraction of the total energy assigned, conver-gence of model parameters to speciﬁc values, etc. As a result, itcan be diﬃcult to know how much data is necessary to re-traingeneralized models. In a real use-case, consumers need to knowwhen a NILM solution is accurate enough to be trusted. Addition-ally, appliances in a modern home can change abruptly. The addi-tion, removal, or replacement of appliances in a home can quicklyrender inﬂexible models obsolete. To ensure the longevity of NILM onference’17, July 2017, Washington, DC, USA Jones et al. solutions in residential homes, some measure of the novelty of in-coming data is needed. If data can be recognized as even potentiallyuseful in updating an existing model, these issues can be addressed.The concept of novelty in incoming data has a model-speciﬁcdimension, in that diﬀerent models may learn diﬀerent features ofthe data. Clearly, data exhibiting novel features relative to those themodel has already learned would qualify as novel or “surprising”.Generalizing this notion of novelty is diﬃcult and not amenableto a one-size-ﬁts-all approach. However, there is also a way inwhich data can be intrinsically surprising, in the sense that spe-ciﬁc appliance modes can be activated for the ﬁrst time or exhibitabnormal behaviour. Moreover, appliances such as dish washersor clothes washers/dryers are multi-sequence machines with manyuser-operated programs. Data exposing new relationships betweenpreviously observed appliance modes may also qualify as intrinsi-cally surprising. We approach both of these data-speciﬁc notionsof novelty through the framework of Bayesian surprise.The remainder of this paper is structured as follows: Section2 gives a brief overview of the motivations behind Bayesian sur-prise and some of the previous work in the area. Section 3 relatesthese concepts to NILM by modeling appliance activations in anon-parametric Gaussian mixture model and introducing postdic-tive surprise . Additionally, we introduce the concept of transitionalsurprise by simply modeling the relationships between appliancestates in a Markovian sense. Section 4 shows some preliminary re-sults, highlighting(1) the diminishing returns of increased amounts of similar data,(2) the potential “model-agnostic” regularization eﬀect of train-ing data truncation,(3) and the usefulness of transitional surprise to (crudely) ap-proximate system dynamics.Finally, Section 5 provides some insights into the conducted ex-periments and some suggestions for further development of theconcept of surprise in NILM.

Literature deﬁnes surprise as the result of a discrepancy betweenexpectation and observation, where expectation stems from expe-rience gained through observation [3]. As concerns surprise in theBayesian framework, several techniques of Bayesian surprise mea-sures have been proposed by related work. In [6], several measuresof surprise are derived for outlier detection in normal models. Onthe basis of comparative studies, the authors recommend partialposterior predictive p-value and plug-in measures.With regard to sequential (Bayesian) learning, Itti and Baldi [2,22] deﬁne

Bayesian surprise to be a measure of dissimilarity to as-sess the eﬀect of data D on the belief distributions of an observer.This means that Bayesian surprise can be understood as the dis-tance (i.e. dissimilarity) between the prior distribution P ( M ) andthe posterior distribution P ( M | D ) over a set M of possible models: ∀ M ∈ M , P ( M | D ) = P ( D | M ) P ( D ) P ( M ) (1) S ( D , M) = d [ P ( M | D ) , P ( M )] (2)where the relative entropy, or Kullback-Leibler (KL) divergence, issuggested to serve as distance measure d in the initial proposal of [1]. Instead of KL divergence, Jensen-Shannon and Cauchy-Schwarzcan be used as well to compute Bayesian surprise, as done in [21].It can be observed that Itti and Baldi’s interpretation of BayesianSurprise has found application in various forms: de-biasing of the-matic maps in [9], automatic detection of landmarks in computervision [39], detection of salient acoustic events [40], identiﬁcationof calciﬁcations in mammogram images [10], and to determine suit-able thresholds for extreme value models [30].In [28], Bayesian updating of an agent’s beliefs was groupedinto two general categories. First, Bayesian surprise is the termgiven to the change in beliefs over latent variables, i.e., the diver-gence between the prior and posterior over unobservable quan-tities inferred through observations. Second, postdictive surpriserefers to the divergence between the prior and posterior predic-tive distributions, quantifying the surprise over observable quanti-ties. In [13], the concept of conﬁdence-corrected surprise is devel-oped, in which the degree of commitment to a particular genera-tive model inﬂuences the extent to which observations update anagent’s beliefs. However, given that the intent of the present workis to develop a “model-agnostic” formulation for NILM datasets,surprise in the present work is restricted to a ﬁxed model (i.e., |M| = data-centric overﬁtting counter-measure that is applicable to all NILM techniques, it must be deter-minable without reference to the particular model being trained.In [21], the authors propose a Bayesian surprise metric basedon the Cauchy-Schwarz divergence to diﬀerentiate between usefulinformation and redundant observations during online learning ofmixtures of Gaussians. The main motivation behind this measure isto prevent outliers from signiﬁcantly changing the model param-eters as well as restrict redundant samples from over-specifyingcomponent parameters, which would lead to overﬁtting. In the con-text of online learning, our work can be considered somewhat ofan extension of [21] to non-parametric methods, rather than stor-ing outliers and instantiating new components based on GaussianMean Shifting. However, the main focus of the present work is touse GMMs to explore the point at which the data is no longer sur-prising with respect to improving the performance of any model.By contrast, [21] uses the concept of Bayesian surprise within aGMM to optimize its own clustering performance. xploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring Conference’17, July 2017, Washington, DC, USA A natural approach to characterize the novelty of incoming data isto examine the change in the signal and compare it to the changesso far observed. In other words, clustering on the ﬁrst-diﬀerencesof the signal permits an intuitive notion of surprising data: appli-ance events not yet seen. Following the basic appliance characteri-zations in [20], simple ON-OFF or multi-state appliances can havetheir initial activations modelled as Gaussian around some meanvalue.However, transient characteristics of appliances, such as the con-sumption spike at the start of a fridge’s condenser cycle, can resultin a highly varying activation value. Moreover, the consistency ofthese initial activations are dependent on sampling frequency. Weconsequently preprocess the data using a fast, steady-state block-ﬁlter developed in [23]. This ﬁlter imputes the mean value betweenchange-points identiﬁed using an adaptive threshold on the rawpower and ﬁrst-diﬀerences in the signal. This steady-state powerfor individual appliance states is far more amenable to Gaussianmodelling given its improved consistency. An example of the ﬁl-ter output and the corresponding raw aggregate data is shown inﬁgure 1.In a typical Gaussian mixture model with K components, thelikelihood is written as p ( x | θ , ..., θ K ) = K Õ k = π k N (cid:0) µ k , Σ k (cid:1) , (3)where θ k = { µ k , Σ k , π k } parameterizes component k by its meanvector µ k , its mixing proportion π k (where 0 ≤ π k ≤ Í k π k = Σ k .In the Bayesian context, prior distributions are placed on eachcomponent’s parameters, which in turn are parameterized by a setof hyperparameters shared across components. For the sake of in-ferential tractability, these priors are typically conjugate to theirlikelihoods. In the general case, component means and covariancesare unknown, requiring a normal-inverse-Wishart joint prior, de-scribed for each component by Σ ∼ IW ( ν , ∆ ) µ | Σ ∼ N( ϕ , Σ / κ ) , (4)Here, IW is the inverse-Wishart distribution with covariance/scalematrix ∆ and degrees of freedom ν . Similarly, the conditionallynormal prior on the component means is parameterized by a basemean, ϕ , and covariance scaled by another hyperparameter, κ .The mixing proportions for each component are typically givena Dirichlet conjugate prior with hyperparameter α : π | α ∼ Dir ( α , α , ..., α K ) , (5)where the α i ’s are the “pseudo-count” prior observations of the i th component. Typically the prior is symmetric such that α = α = ... = α K = α .This construction allows the parameters and weights of the k Gaussian components to be sampled according to the data, often inMarkov Chain Monte Carlo methods such as Gibbs sampling. De-spite the inherent ﬂexibility, GMM’s are a parametric method, i.e., one of ﬁxed dimensionality. Shifts in component weights when ob-serving new data may be surprising, but this is a more gradual shift,and the predictive distribution will converge to a relatively station-ary distribution that accounts for the prevalence of each compo-nent. Instead, the intuitively surprising aspect of new data is theinstantiation of a new component/appliance state. This requiresan extension of Gaussian mixtures into nonparametric methods,which we brieﬂy overview.In order to achieve an unbounded set of mixing componentsand their respective mixing proportions, we introduce the Dirich-let Process (DP). The DP is a stochastic process that generates ran-dom probability measures which follow a Dirichlet distribution forevery ﬁnite partition of some measurable space [14]. It is uniquelydeﬁned by a base measure on the measurable space and a concen-tration parameter, similar to the ﬁnite-dimensional Dirichlet distri-bution. The more intuitive “stick-breaking” picture of the DP wasprovided by [41], which naturally motivates the use of DPs in mix-ture models as a nonparametric prior. In the stick-breaking proce-dure, the inﬁnite sequence of mixing proportions are generated bydrawing from a GEM distribution, described by ν i | α ∼ Beta ( , α ) , i = { , , ... } π i = ν i i − Ö ℓ = ( − ν ℓ ) , i = { , , ... } . (6)This process can be understood by imagining a unit probabilitystick being continually partitioned, with the proportion of the re-maining stick to be broken oﬀ chosen according to a beta distribu-tion parameterized by ( , α ) .A draw from the DP (i.e., G ∼ DP ( α , G ) ) is a discrete, inﬁniterandom object that can be expressed by G = ∞ Õ i = π i δ θ i , i = { , , ... } , (7)where θ i is the i th of the countably inﬁnite atoms drawn i.i.d. froma base distribution, G . That is, θ i | G i . i . d . ∼ G . (8)In our case, G is typically the joint conjugate prior for themeans and covariances which specify the Gaussian components(i.e., the normal-inverse-Wishart distribution, equation 3) [18]. Inother words, the atoms of the DP parameterize Gaussians centeredaround the base hyperparameters. The concentration hyperparam-eter, α , determines the extent to which the atoms cluster around G . Marginalization over the inﬁnite sequence of mixture propor-tions in the so-called Chinese Restaurant Process (see [15]) exposesthe preferential attachment of the cluster assignments. This is in-tegral to instantiating as few components as necessary given theobserved data. Hierarchical models involving hyperpriors over thehyperparameters of G can be constructed to guard against poormodel initializations, however we restrict our attention to the sim-pler case of ﬁxed hyperparameters.To compute the postdictive surprise, we require the predictivedensity, given by p ( x N + | x , ..., x N , α , G ) = ∫ p ( x | θ ) p ( θ | x , ..., x N , α , G ) dθ . (9) onference’17, July 2017, Washington, DC, USA Jones et al. P o w e r ( W ) Raw aggregateFiltered aggregate

Figure 1: Example of Steady-state Block-ﬁlter output, from [23]

However, the DP prior precludes an analytic closed form forthe posterior distribution, p ( θ | x N , α , G ) . Although MCMC meth-ods are a common method for approximating such densities, in-ference of model parameters by sampling methods are typicallyslow, and scale poorly as the number of parameters or data pointsincreases [31]. Additionally, convergence metrics are heuristic atbest. In contrast, variational methods select a simpler family ofdistributions whose posterior density is ideally able to approxi-mate the true posterior by optimizing a set of variational parame-ters. These parameters are optimized with respect to the evidencelower bound (ELBO), a constraint on the log marginal likelihoodof the data, which is straight-forwardly related to the divergencebetween the variational posterior and the true posterior. Thus, con-vergence – at least to a local optimum – is well-deﬁned. Variationalinference methods for Dirichlet Process mixture models were ﬁrstintroduced in [7], and in this paper we make use of the scikit-learnimplementation [37], available as of version 0.18.The variational approximation is proposed to take the followingform: q ( ν , θ , z ) = K − Ö k = q γ k ( ν k ) K Ö k = q τ k ( θ k ) N Ö n = q ϕ n ( z n ) (10)Here, { γ , τ , ϕ } are the variational parameters subject to coordinateascent optimization. q γ are beta distributions parameterized by theindividual stick lengths, ν k . q τ are in our case Gaussians parame-terized by θ k = { µ k , Σ k } , although extension to general exponen-tial families is possible. q ϕ n are multinomial, parameterized by in-dicator variables z n , which denote the component to which the ob-servation x n is assigned. To speed up inference, a truncation on themaximum number of possible states is imposed on the variationalapproximation, similar to truncation in methods such as blockedGibbs sampling [7]. This value, K , is itself a variational parameterwhich can be ﬁxed or optimized with respect to the ELBO. K wasﬁxed in our work to 30 unique components. Under this approxi-mation, the resulting posterior predictive distribution needed forcomputing postdictive surprise can be neatly factored as expecta-tions with respect to the variational distribution: xploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring Conference’17, July 2017, Washington, DC, USA p ( x N + | x , ..., x N , α , G ) ≈ K Õ k = E q (cid:2) π k (cid:3) E q (cid:2) p ( x N + | θ k ) (cid:3) (11)For many machine learning algorithms, decay in the postdictivesurprise might be suﬃcient to demarcate useful data from superﬂu-ous data during training. However, it is often the case that temporalrelationships between appliance states are learned and contributeto inference. Such methods would include Hidden Markov Mod-els (HMMs) and their many extensions, more recent deep learningtechniques such as those based on Recurrent Neural Networks, andmany more. In the interest of simplicity, we restrict the notion of“transitional surprise” to the Markovian sense. That is, we treat thestate sequence as a Markov chain, such that the current state of thesystem is determined only by the state before it. For a system of K appliance states, this transitional surprise constitutes comparingthe rows of the K × K transition matrix. This approximation tothe dynamics is clearly crude, but even weak convergence of thetransition matrix to some stationary form can prove useful.To summarize, for each sliding window of w events, precededby N events, we compute the (approximate) postdictive surpriseas: S o = d (cid:2) p ( x N + | x N , α , G ) || p ( x N + w + | x ( N + w ) , α ∗ , G ) (cid:3) , (12)where d is some divergence metric (usually Kullback-Leibler diver-gence), and α ∗ is the posterior update for the concentration param-eter if a prior was placed on it.Over the same window of w events, we compute the transitionalsurprise over the truncated maximum number of states K as: S t = w Õ j = K Õ k = d (cid:2) T k ( z N + i )|| T k ( z N + i + ) (cid:3) , (13)where at time t , T j , k = p ( z t + = k | z t = j ) . The notation T k ( z N + i ) denotes the transition row built using event indicators z for obser-vations 1 , , ..., N + i .In order to simplify the concept of a surprise threshold underwhich data is no longer considered surprising, S o and S t are nor-malized according to their maximum values. Since the initial valueof the above divergences can certainly be exceeded as observationsare made, the maxima were updated and preceding surprise valueswere renormalized to the revised maxima. Since in an online set-ting it would be unreasonable to wait indeﬁnitely for surprisingwindows, we suggest a patience parameter, ρ . In the experimentsthat follow, we used ρ = To explore the usefulness of a surprise threshold, we made useof NILMTK, an open-source toolkit developed for NILM research[4, 5]. NILMTK includes implementations of some benchmark al-gorithms including traditional(1)

Denoising Autoencoders (DAE) : treat load disaggregation asnoise reduction problem, in which the aggregate signal is seen as noisy version of an appliance signal. This specialkind of neural network is typically implemented followinga symmetrical architecture has originally been introducedto perform representation learning [8].(2)

Recurrent Neural Networks (RNN) have been successfully ap-plied to a variety of time series problems. For NILM, RNNshave been proposed in [24], where the nets were trained todetect signatures of appliance within smart meter data. Inthis work, the RNN architecture proposed by [29] is beingused, which incorporates Long short-term memory (LSTM)cells.(3)

Sequence-to-Sequence Optimization (Seq2Seq) is a techniqueusing neural networks, introduced in [43]. The basic idea ofthis approach is to learn the mapping between the aggreagteinput window and the output window, which is a sequenceof power consumption values associated with a certain ap-pliance.(4) The

Sequence-to-Point Optimization (Seq2Point) technique buildson neural networks and is closely related to Sequence-to-Sequence Optimization (S2S). The main diﬀerence betweenthese two techniques lies in the output layer of the architec-ture, where S2P was designed to forward the midpoint ofthe output window. [43](5) The

Window GRU architecture, introduced in [29], relies onGated Recurrent Units (GRU). Compared to architecturesbased on LSTMs, this architecture is simpler, integrates fewerneurons per layer and therefore, was shown to be more com-putationally eﬃcient while having lower memory demand.To establish a relationship between algorithm performance andthe proposed surprise metrics, three houses from the REFIT dataset [35]were selected for study using the above disaggregation methods.The included appliances in these experiments were the dish washer,the washing machine, the refrigerator, the kettle, and the toaster.The Mean-Absolute Error (MAE) was used as a performance met-ric, deﬁned by

MAE = N N Õ t = | ˆ x t − x t | , (14)where N is the number of samples, and ˆ x t is the predicted loadat sample t . For each house, the available data was split into atraining set and test set by a 90%/10% split. 15% of the trainingset was reserved for validation. The surprise metric was computedon the remaining training data, such that each algorithm was train-ing and validating on the same data. Each algorithm was trainedover 15 epochs using Adam optimization with a batch size of 1024samples. For a given house, each algorithm had its random seedﬁxed across surprise-based training set reductions, removing ini-tialization variability from their appliance-averaged performance.Preprocessing of the data such as normalization was handled inter-nally by NILMTK.Figures 2, 3, and 4 show the behaviour of the MAE for the aver-age appliance across the benchmark methods for houses 2, 3, and5, respectively. The postdictive and transitional surprise was com-puted using Jensen-Shannon divergence, deﬁned between two dis-tributions p and q by: onference’17, July 2017, Washington, DC, USA Jones et al. . . . . . . P o s t d i c t i v e ( B l u e ) a n d T r a n s i t i o n a l S u r p r i s e ( O r a n g e ) M A E ( a v e r a g e d o v e r a pp l i a n c e s ) DAESeq2PointSeq2SeqRNNWindowGRU

Figure 2: Appliance-averaged MAE performance, REFIT House 2 d J S ( p || q ) = d KL ( p || m ) d KL ( q || m ) , (15)where m is the point-wise mean of p and q , and d KL is the Kullback-Leibler divergence, given by d KL ( p || q ) = Õ x ∈ X p ( x ) · loд (cid:18) p ( x ) q ( x ) (cid:19) . (16)Given the max-value normalization, the postdictive and transi-tional surprise values can be interpreted as the fraction of the max-imum observed surprise, rather than the value of the JS-divergenceitself.Although of course no sharp transition exists between an opti-mally and sub-optimally sized training set, the behaviour of thesealgorithms’ MAE in the three REFIT houses suggest that perfor-mance can indeed stagnate. Additional similar data, especially inhouses 2 and 3, seem unlikely to appreciably improve performance.An example surprise threshold is shown in ﬁgures 2, 3, and 4 asa dotted grey line, indicating an approximate point where perfor-mance began to plateau. This cutoﬀ was chosen as a joint thresholdover postdictive and transitional surprise, deﬁned by: S o ( w : w + ρ ) ≤ .

01 & S t ( w : w + ρ ) ≤ . , (17) where again, w is the window size and ρ is the patience parameter.We used this threshold for further study regarding the potentialregularizing eﬀect of surprise-based training cutoﬀ.In [36], disaggregation performance on unseen homes in thesame dataset as well as diﬀerent datasets were examined. By theirchoice of architectures, the authors restricted the number of tun-able parameters relative to the existing literature. They also madeuse of early stoppage with an aggressive patience parameter toterminate training. With these complexity and temporal regular-ization methods, they showed intra- and inter-dataset transferabil-ity with minimal performance losses relative to their chosen base-line. Nevertheless, these methods still make use of all availabletraining data. Bayesian surprise metrics provide an attractive alter-native/supplement to early stoppage, which by contrast truncatethe training set entirely. We examined the MAE performance ofeach algorithm when trained on the full REFIT house 3 and thesurprise-based subset determined by the joint threshold in equa-tion 17. Table 1 shows the appliance-averaged MAE performanceof each benchmark method when tested on REFIT house 5. All butone method showed improved cross-house transferability with arestricted training set, giving some substance to the claim that trun-cating the training set may provide regularization against overﬁt-ting. xploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring Conference’17, July 2017, Washington, DC, USA . . . . . . P o s t d i c t i v e ( B l u e ) a n d T r a n s i t i o n a l S u r p r i s e ( O r a n g e ) . . . . . . . M A E ( a v e r a g e d o v e r a pp l i a n c e s ) DAESeq2PointSeq2SeqRNNWindowGRU

Figure 3: Appliance-averaged MAE performance, REFIT House 3Table 1: REFIT Cross-house ( → ) MAE for full and cutoﬀtraining Benchmark Method Full Training Cutoﬀ TrainingWindowGRU 37.83 ↓ DAE 34.78 ↓ RNN 32.54 ↓ Seq2Seq ↑ Seq2Point 26.85 ↓ Finally, to illustrate the usefulness of including the concept oftransitional surprise, we explored the performance of a popularsuper-state Hidden Markov Model [33]. Clearly, a Markovian modelshould suﬃce to show whether our Markovian notion of transi-tional surprise is useful. We used house 1 from the Rainforest Au-tomation Energy (RAE) dataset [34], which consists of two blocks:a 9 day block beginning on February 7, 2016, and a 63 day block be-ginning March 6, 2016. Block 1 was used as the test set, and block2 (and its surprise-based subset) was used for training the models.The seven appliances used for training were the clothes washerand dryer, refrigerator, dish washer, furnace/hot water unit, andthe heat pump. Figure 5 shows the Van Rijsbergen’s eﬀectiveness measure (de-ﬁned simply as 1 − F1-score) as a function of cutoﬀ point duringtraining. This measure decays slightly faster than that of the transi-tional surprise, but signiﬁcantly after the postdictive surprise hadconverged. This lends credence to the claim that postdictive sur-prise is an unreliable metric for terminating training in the gen-eral case. The diﬀerence in decay rate between transitional sur-prise and the eﬀectiveness measure is understandable given thatthe SSHMM by deﬁnition encodes the Markovian dynamics be-tween super-states of the user’s home. The super-state of the homeat a given instant in time can be thought of as the complete descrip-tion of the home, denoting the operational mode of each appliancein the house. Each instant in time increments the underlying tran-sition distributions between super-states of the home, rather thanindividual appliance states. This will in general encode the state dy-namics more eﬃciently since there is more information used pertime-step. Nevertheless, the basic notion of transitional surprise in-troduced here allows a useful overestimate of the learning rate ofthe system dynamics. Notably, the behaviour of the eﬀectivenessmeasure in this case calls into question the speciﬁc values given forthe joint threshold in equation 17. Here, a threshold on the transi-tional surprise of ≈ . onference’17, July 2017, Washington, DC, USA Jones et al. . . . . . . P o s t d i c t i v e ( B l u e ) a n d T r a n s i t i o n a l S u r p r i s e ( O r a n g e ) M A E ( a v e r a g e d o v e r a pp l i a n c e s ) DAESeq2PointSeq2SeqRNNWindowGRU

Figure 4: Appliance-averaged MAE performance, REFIT House 5 all available datasets is needed to further narrow down acceptablethreshold values.

Ultimately, the concept of surprise involves comparison over distri-butions as they are updated given new observations. The most use-ful such distributions are unavoidably model-speciﬁc. For example,surprise could be deﬁned relative to the latent space in methodssuch as the DAE, or it could be deﬁned relative to nonlinear auto-regressive dependencies in more complex graphical models. Nev-ertheless, there are features intrinsic to the data itself that couldbe used to predict the usefulness of more data in a model-agnosticway. This work explored a postdictive surprise deﬁned over thelikelihood of a non-parametric GMM. The mixture model was up-dated with windows of events deﬁned by ﬁrst-diﬀerences in theblock-ﬁltered raw signal exceeding a pre-speciﬁed threshold. Fur-thermore, we explored a transitional surprise deﬁned in a Markov-ian sense, which was described by the transitional relationshipsbetween latent states as determined by the state assignments ofthe GMM. This crude approximation to the system dynamics wasshown to be useful relative to a strictly postdictive notion of sur-prise, at least in an HMM-based application. An approximate jointthreshold was determined by examining the MAE performance ofﬁve benchmark methods supported by NILMTK over three REFIT homes. This threshold was used to explore the potential regular-izing eﬀect of a surprise-based training cutoﬀ. This is similar tothe use of early-stoppage, which is a common method to protectagainst over-ﬁtting and aid in the transferability of learned param-eters. Relative to training over the full REFIT house 3, training onthe surprise-based subset showed improved MAE for all but onemethod when testing over REFIT house 5. This supports the claimthat Bayesian surprise can be a useful metric in predicting over-ﬁtting and potentially improve generalization to unseen houses ordatasets.Further experiments may show that convergence of transitionaland postdictive surprise are only weakly indicative of a plateau inmodel performance, and that models continue to improve whenusing additional, repetitive data. In this case, it is unlikely thatresearchers would make use of a surprise-based cutoﬀ in their ﬁ-nal training of a particular model. However, during developmentit may be highly desirable to merely gauge the eﬀectiveness ofnew methods or network modiﬁcations without spending copiousamounts of time retraining using all available data. In these cases,truncating the training set using surprise-based methods allows asigniﬁcant reduction in research costs, both in terms of computa-tional time spent training and research time spent trying to opti-mize what may prove to be fruitless methods. xploring Bayesian Surprise to Prevent Overfitting and to Predict Model Performance in Non-Intrusive Load Monitoring Conference’17, July 2017, Washington, DC, USA . . . . . . P o s t d i c t i v e S u r p r i s e ( S o ) a n d T r a n s i t i o n a l S u r p r i s e ( S t ) . . . . . . E ﬀ e c t i v e n e ss M e a s u r e ( a pp l i a n c e a v e r a g e ) Figure 5: Eﬀectiveness measure averaged over 7 appliances as a function of surprise-based training cutoﬀ

Moreover, postdictive surprise using non-parametric mixturemodels naturally extends to online settings, where deployed NILMalgorithms quickly become obsolete without the ﬂexibility to adaptto new appliances or appliance replacements.Lastly, this work suggests a general rule of diversity over quan-tity of data. This may help inform the development of future datasets,improving time-to-publication for dataset producers as well as ex-pediting dataset availability for the research community as a whole.An important extension of the current work is to explore cross-dataset performance. Similarities between the two REFIT housesin table 1 is likely unrepresentative of the general use-case forNILM. Also left to future work is to explore alternative modelsfor transitional surprise such as constructing super-states from theobserved appliance modes. Additionally, future work may includesub-modelling for each component observed in the non-parametricmixture model. This would permit modelling multiple appliancemodes in the same range of power values, where Bayesian surprisecould further be computed over the sub-model parameters. This ex-tension would be highly valuable to an online setting to track newappliance mode activations.

REFERENCES [1] Pierre Baldi. 2002. A computational theory of surprise. In

Information, Codingand Mathematics . Springer, 1–25.[2] Pierre Baldi and Laurent Itti. 2010. Of bits and wows: A Bayesian theory ofsurprise with applications to attention.

Neural Networks

23, 5 (2010), 649–666.[3] Andrew Barto, Marco Mirolli, and Gianluca Baldassarre. 2013. Novelty or sur-prise?

Frontiers in psychology .[5] Nipun Batra, Rithwik Kukunuri, Ayush Pandey, Raktim Malakar, Rajat Kumar,Odysseas Krystalakos, Mingjun Zhong, Paulo Meira, and Oliver Parson. 2019.Towards Reproducible State-of-the-Art Energy Disaggregation. In

Proceedings ofthe 6th ACM International Conference on Systems for Energy-Eﬃcient Buildings,Cities, and Transportation (BuildSys) .[6] MJ Bayarri and J Morales. 2003. Bayesian measures of surprise for outlier detec-tion.

Journal of Statistical Planning and Inference

Bayesian Analysis

Energy and Build-ings

158 (2018), 1461–1474.[9] Michael Correll and Jeﬀrey Heer. 2016. Surprise! Bayesian weighting for de-biasing thematic maps.

IEEE transactions on visualization and computer graphics

23, 1 (2016), 651–660. onference’17, July 2017, Washington, DC, USA Jones et al. [10] Inês Domingues and Jaime S Cardoso. 2014. Using Bayesian surprise to detectcalciﬁcations in mammogram images. In . IEEE, 1091–1094.[11] M. DâĂŹIncecco, S. Squartini, and M. Zhong. 2020. Transfer Learning for Non-Intrusive Load Monitoring.

IEEE Transactions on Smart Grid

11, 2 (2020), 1419–1429.[12] Marco Fagiani, Roberto Bonﬁgli, Emanuele Principi, Stefano Squartini, and LuigiMandolini. 2019. A non-intrusive load monitoring algorithm based on non-uniform sampling of power data and deep neural networks.

Energies

12, 7 (2019),1371.[13] Mohammadjavad Faraji, Kerstin Preuschoﬀ, and Wulfram Gerstner.2018. Balancing New against Old Information: The Role of Puzzle-ment Surprise in Learning.

Neural Comput.

30, 1 (Jan. 2018), 34âĂŞ83.https://doi.org/10.1162/neco_a_01025[14] Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Non-parametric Problems.

Ann. Statist.

1, 2 (03 1973), 209–230.https://doi.org/10.1214/aos/1176342360[15] Emily Fox. 2009.

Bayesian Nonparametric Learning of Complex Dynamical Phe-nomena . Ph.D. Dissertation. Massachusetts Institute of Technology.[16] Eduardo Gomes and Lucas Pereira. 2020. PB-NILM: Pinball guided deep non-intrusive load monitoring.

IEEE Access

Sustainable Cities and Society (2020),102411.[18] Dilan Görür and Carl Edward Rasmussen. 2010. Dirichlet Process Gaussian Mix-ture Models: Choice of the Base Distribution.

J. Comput. Sci. Technol.

25, 4 (July2010), 653âĂŞ664.[19] A. Harell, S. Makonin, and I. V. BajiÄĞ. 2019. Wavenilm: A Causal Neural Net-work for Power Disaggregation from the Complex Power Signal. In

ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) . 8335–8339.[20] George W. Hart. 1985.

Prototype Nonintrusive Appliance Load Monitor . TechnicalReport. MIT Energy Laboratory and Electric Power Research Institute.[21] Erion Hasanbelliu, Kittipat Kampa, Jose C Principe, and James T Cobb. 2012.Online learning using a Bayesian surprise metric. In

The 2012 international jointconference on neural networks (IJCNN) . IEEE, 1–8.[22] Laurent Itti and Pierre F Baldi. 2006. Bayesian surprise attracts human attention.In

Advances in neural information processing systems . 547–554.[23] R. Jones, A. Rodriguez-Silva, and S. Makonin. 2020. Increasing the Accuracyand Speed of Universal Non-Intrusive Load Monitoring (UNILM) Using a NovelReal- Time Steady-State Block Filter. In . 1–5.[24] Jack Kelly and William Knottenbelt. 2015. Neural NILM: Deep Neural Net-works Applied to Energy Disaggregation. In

Proceedings of the 2nd ACM Inter-national Conference on Embedded Systems for Energy-Eﬃcient Built Environments(BuildSys) .[25] Jihyun Kim, Thi-Thu-Huong Le, and Howon Kim. 2017. Nonintrusive load mon-itoring based on advanced deep learning and novel signature.

Computationalintelligence and neuroscience

Scientiﬁc Data

7, 1 (2020), 1–17.[27] Christoph Klemenjak, Stephen Makonin, and Wilfried Elmenreich. 2020. To-wards comparability in non-intrusive load monitoring: on data and performance evaluation. In . IEEE, 1–5.[28] Antonio Kolossa, Bruno Kopp, and Tim Fingscheidt. 2015. A ComputationalAnalysis of the Neural Bases of Bayesian Inference.

NeuroImage

106 (02 2015),222–337. https://doi.org/10.1016/j.neuroimage.2014.11.007[29] Odysseas Krystalakos, Christoforos Nalmpantis, and Dimitris Vrakas. 2018. Slid-ing Window Approach for Online Energy Disaggregation Using Artiﬁcial NeuralNetworks. In

Proceedings of the 10th Hellenic Conference on Artiﬁcial Intelligence(SETN) .[30] Jeong Lee, Yanan Fan, and Scott A Sisson. 2015. Bayesian threshold selection forextremal models using measures of surprise.

Computational Statistics & DataAnalysis

85 (2015), 84–99.[31] J. S. Liu. 2008.

Monte Carlo Strategies in Scientiﬁc Computing . Springer.[32] Stephen Makonin, Bradley Ellert, Ivan V. Bajić, and Fred Popowich. 2016.Electricity, water, and natural gas consumption of a residential housein Canada from 2012 to 2014.

Scientiﬁc Data

3, 1 (2016), 160037.https://doi.org/10.1038/sdata.2016.37[33] S. Makonin, F. Popowich, I. V. Bajić, B. Gill, and L. Bartram. 2016. Exploit-ing HMM Sparsity to Perform Online Real-Time Nonintrusive Load Monitoring.

IEEE Transactions on Smart Grid

7, 6 (2016), 2575–2585.[34] Stephen Makonin, Z Jane Wang, and Chris Tumpach. 2018. RAE: The rainforestautomation energy dataset for smart grid meter data analysis. data

3, 1 (2018),8.[35] David Murray, Lina Stankovic, and Vladimir Stankovic. 2017. An elec-trical load measurements dataset of United Kingdom households froma two-year longitudinal study.

Scientiﬁc Data

4, 1 (2017), 160122.https://doi.org/10.1038/sdata.2016.122[36] D. Murray, L. Stankovic, V. Stankovic, S. Lulic, and S. Sladojevic. 2019. Transfer-ability of Neural Network Approaches for Low-rate Energy Disaggregation. In

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . 8330–8334.[37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.

Journal of Machine Learning Research

12 (2011), 2825–2830.[38] Lucas Pereira and Nuno Nunes. 2018. Performance evaluation in non-intrusiveload monitoring: Datasets, metrics, and toolsâĂŤA review.

Wiley Interdisci-plinary Reviews: data mining and knowledge discovery

8, 6 (2018), e1265.[39] Ananth Ranganathan and Frank Dellaert. 2009. Bayesian surprise and land-mark detection. In . IEEE, 2017–2023.[40] Boris Schauerte and Rainer Stiefelhagen. 2013. âĂĲWow!âĂİ Bayesian surprisefor salient acoustic event detection. In . IEEE, 6402–6406.[41] Jayaram Sethuraman. 1994. A Constructive Deﬁnition of the Dirichlet Prior.

Statistica Sinica

IEEE Transactions on Smart Grid

9, 5 (2018), 4669–4678.[43] Chaoyun Zhang, Mingjun Zhong, Zongzuo Wang, Nigel Goddard, and CharlesSutton. 2018. Sequence-to-Point Learning with Neural Networks for Non-Intrusive Load Monitoring. In

Proceedings of the 32nd AAAI Conference on Ar-tiﬁcial Intelligence (AAAI) .[44] Ahmed Zoha, Alexander Gluhak, Muhammad Imran, and Sutharshan Ra-jasegarar. 2012. Non-Intrusive Load Monitoring Approaches for DisaggregatedEnergy Sensing: A Survey.

Sensors (Basel, Switzerland)

12 (12 2012), 16838–16866.https://doi.org/10.3390/s12121683812 (12 2012), 16838–16866.https://doi.org/10.3390/s121216838