[PDF] Investigating the Performance Gap between Testing on Real and Denoised Aggregates in Non-Intrusive Load Monitoring

Abstract

Prudent and meaningful performance evaluation of algorithms is essential for the progression of any research field. In the field of Non-Intrusive Load Monitoring (NILM), performance evaluation can be conducted on real-world aggregate signals, provided by smart energy meters or artificial superpositions of individual load signals (i.e., denoised aggregates). It has long been suspected that testing on these denoised aggregates provides better evaluation results mainly due to the the fact that the signal is less complex. Complexity in real-world aggregate signals increases with the number of unknown/untracked load. Although this is a known performance reporting problem, an investigation in the actual performance gap between real and denoised testing is still pending. In this paper, we examine the performance gap between testing on real-world and denoised aggregates with the aim of bringing clarity into this matter. Starting with an assessment of noise levels in datasets, we find significant differences in test cases. We give broad insights into our evaluation setup comprising three load disaggregation algorithms, two of them relying on neural network architectures. The results presented in this paper, based on studies covering three scenarios with ascending noise levels, show a strong tendency towards load disaggregation algorithms providing significantly better performance on denoised aggregate signals. A closer look into the outcome of our studies reveals that all appliance types could be subject to this phenomenon. We conclude the paper by discussing aspects that could be causing these considerable gaps between real and denoised testing in NILM.

Full PDF

II NVESTIGATING THE P ERFORMANCE G AP BETWEEN T ESTINGON R EAL AND D ENOISED A GGREGATES IN N ON -I NTRUSIVE L OAD M ONITORING

Christoph Klemenjak

Institute of Networked and Embedded SystemsUniversity of KlagenfurtAustria [email protected]

Stephen Makonin

School of Engineering ScienceSimon Fraser UniversityCanada [email protected]

Wilfried Elmenreich

Institute of Networked and Embedded SystemsUniversity of KlagenfurtAustria [email protected]

October 6, 2020 A BSTRACT

Prudent and meaningful performance evaluation of algorithms is essential for the progression ofany research ﬁeld. In the ﬁeld of Non-Intrusive Load Monitoring (NILM), performance evaluationcan be conducted on real-world aggregate signals, provided by smart energy meters or artiﬁcialsuperpositions of individual load signals (i.e., denoised aggregates). It has long been suspected thattesting on these denoised aggregates provides better evaluation results mainly due to the the fact thatthe signal is less complex. Complexity in real-world aggregate signals increases with the number ofunknown/untracked load. Although this is a known performance reporting problem, an investigationin the actual performance gap between real and denoised testing is still pending. In this paper, weexamine the performance gap between testing on real-world and denoised aggregates with the aimof bringing clarity into this matter. Starting with an assessment of noise levels in datasets, we ﬁndsigniﬁcant differences in test cases. We give broad insights into our evaluation setup comprising threeload disaggregation algorithms, two of them relying on neural network architectures. The resultspresented in this paper, based on studies covering three scenarios with ascending noise levels, show astrong tendency towards load disaggregation algorithms providing signiﬁcantly better performanceon denoised aggregate signals. A closer look into the outcome of our studies reveals that all appliancetypes could be subject to this phenomenon. We conclude the paper by discussing aspects that couldbe causing these considerable gaps between real and denoised testing in NILM.

Introduction

Effective energy management in smart grids requires a fair amount of monitoring and controlling of electrical loadto achieve optimal energy utilization and, ultimately, reduce energy consumption [1]. With regard to individualbuildings, load monitoring can be implemented in an intrusive or non-intrusive fashion. The latter is often referred to asNon-Intrusive Load Monitoring (NILM) or load disaggregation. NILM, dating back to the seminal work presented in[2], comprises a set of techniques to identify active electrical appliance signals from the aggregate load signal reportedby a smart meter [3].Performance evaluation of NILM algorithms can be carried out in a noised or denoised manner, where the differencelies in the aggregate signal considered as input. Whereas noised scenarios employ signals (i.e. time series) obtained a r X i v : . [ c s . OH ] O c t PREPRINT - O

CTOBER

6, 2020Figure 1: Real and denoised aggregate in the case ofUK-DALE house 5 Figure 2: Real and denoised aggregate in the case ofREFIT house 2from smart meters, denoised testing scenarios consider superpositions of individual appliance signals (i.e., denoisedaggregates). Figure 1 and Figure 2 illustrate such real and denoised signals for two households found in NILM datasets.While a large proportion of contributions proposed for NILM is being evaluated following noised testing scenarios,exceptions to this unwritten rule can be observed [4]. The problem with this matter lies in the complexity of the testsetup, as denoised aggregates are suspected to pose simpler disaggregation problems [5]. Consequently, our hypothesisclaims that the same disaggregation algorithm applied to the denoised signal version of a real-world aggregate signalresults in considerably better performance, thus communicating a distorted picture of the capabilities of the presentedalgorithm.This paper presents a study with a focus on the difference of denoised and real-world signal testing scenarios in thecontext of performance evaluation in NILM. On the basis of test runs considering data of 15 appliances extracted fromthree datasets with considerably different noise levels, we strive towards bringing clarity on this widely disregardedquestion. We incorporate one basic as well as two load disaggregation approaches based on neural networks to obtain abroad understanding whether or not noise levels of aggregate power signals impact energy estimation performance.Finally, we discuss how the disaggregation performance is affected by signal noise levels with regard to differentappliance types.

Related Work

Despite the possibly far-reaching implications of this aspect for NILM, relatively little is understood about the actualperformance gap between real and denoised testing. In [5], the hypothesis of denoised testing resulting in betterperformance was expressed ﬁrst. Further, the authors introduce a measure to assess the noise level of aggregate signals.This measure has found application in a limited number of studies, in which the noise level was reported alongsidethe performance of load disaggregation algorithms on real-world aggregates [6], [7]. However, no comparison to thedenoised testing case has been conducted. In [8], the noise levels of several NILM datasets were determined. Theauthors report basic parameters of several NILM datasets and ﬁnd that noise levels in real aggregate signals varysigniﬁcantly among observed datasets.Few attempts have been made to evaluate NILM algorithms on both, real and denoised aggregates, such as presentedfor the AFAMAP approach in [9]. In subsequent work [10], an improved version of denoising autoencoders for NILMhas been proposed by means of comparison studies to the state of the art at that time. Although the authors have notinvestigated the performance gap between real and denoised, a tendency can be derived for this particular case in bothcontributions, conﬁrming the motivation for the studies presented in this paper.

Assessing Signal Noise Levels

NILM has been approached in a variety of ways that can be categorized into event detection and energy estimationapproaches [11]. In this investigation, we put an emphasis on the energy estimation viewpoint, as it can be seen as theprecursor of the event detection stage in the disaggregation process. We deﬁne NILM as the problem of generatingestimates [ˆ x (1) t , . . . , ˆ x ( M ) t ] of the actual power consumption [ x (1) t , . . . , x ( M ) t ] of M electrical appliances at time t givenonly the aggregated power consumption y t , where the aggregate power signal y t consists of y t = M (cid:88) i =1 x ( i ) t + η t (1)2 PREPRINT - O

CTOBER

6, 2020Table 1: Noise levels in NILM datasets

Dataset House Duration Meters Power Types NAR [days] [%]AMPds2 1 730 20 P, Q, S P, Q, S 17.8COMBED 1 28 13 P P 34.4ECO 1 236 7 P, Q P 67.0ECO 2 245 12 P, Q P 5.9ECO 3 57 7 P, Q P 97.0ECO 4 211 8 P, Q P 70.5ECO 5 219 8 P, Q P 84.7ECO 6 124 7 P, Q P 66.0iAWE 1 60 10 P, Q, S P, Q, S 50.0REFIT 1 639 9 P P 64.5REFIT 2 617 9 P P 65.1REFIT 3 614 9 P P 55.5REFIT 4 634 9 P P 52.5REFIT 5 648 9 P P 52.3UK-DALE 1 658 52 P, S P, S 33.3UK-DALE 2 110 18 P, S P 41.2UK-DALE 3 35 4 S P -UK-DALE 4 114 5 S P -UK-DALE 5 107 24 P, S P 27.5 that is M appliance-level signals x ( i ) t and a residual term η t . The residual term comprises (measurement) noise aswell as the sum of unmetered electrical load [8]. To quantify the share of unmetered load in an aggregate signal, thenoise-aggregate ratio NAR, deﬁned as:NAR = (cid:80) Tt =1 η t (cid:80) Tt =1 y t = (cid:80) Tt =1 | y t − (cid:80) Mi =1 x ( i ) t | (cid:80) Tt =1 y t (2)was introduced in [5]. This ratio can be computed for any type of power signal, provided that readings of the aggregateand individual appliances are available. A NAR of 0.15 reports that 15% of the observed power signal can be attributedto the residual term. Hence, the ratio indicates to what degree information on the aggregate’s components is available.To get an impression of NAR levels to be expected in real-world settings, we compute this ratio for households embeddedin the energy datasets AMPds2 [12], COMBED [13], ECO [14], iAWE [15], REFIT [16], and UK-DALE [17]. Thesedatasets were selected because of their compatibility to NILMTK, a toolkit that enables reproducible NILM experiments[18, 19]. We excluded from consideration the dataset BLUED [20] due to the lack of sub-metered power data, Tracebase[21] and GREEND [22] due to the lack of household aggregate power data. We summarize the derived values in Table1 in conjunction with further stats on the measurement campaign such as duration or number of submeters.Generally speaking, measurement campaigns strive to record the energy consumption and other parameters of interest inhouseholds or industrial facilities over a certain time period. Though sharing this common aim, considerable differencescan be observed in the way past campaigns have been conducted. As Table 1 shows, durations range from a couple ofdays to several years of data, which impacts the amount of appliance activations and events found in the ﬁnal dataset.Further, we identify considerable variations with regard to AC power types as well as the number of submeters installedduring campaigns. It should be pointed out that there seems to be a lack of consistency in the sense that not onlymeasurement setups differ between two datasets but also within some of the campaigns considered by our comparison(e.g., UK-DALE).As concerns the noise aggregate ratio (NAR), we observe considerable variations across datasets and households.Interestingly, the NAR ranges between a few percent, as it is the case for household 2 in the ECO dataset, and excessive84.7% in household 5 of same dataset. Further, there are indications that the number of submeters used in the courseof dataset collection can but do not necessarily have an impact on the noise level of the household’s aggregate signalsince it is decisive what kind of appliances are left out during a measurement campaign (low-power appliances vs. bigconsumers). As concerns house 1 to house 5 in REFIT, we consistently observe moderate to high noise levels, whichmay be the result of the low number of submeters incorporated in the measurement campaign. On the other hand, itshould be noted that the measurement campaign conducted to obtain REFIT shows remarkable consistency in the sensethat the exact same number of submeters has been applied to every single household in the study and, more importantly,3 PREPRINT - O

CTOBER

6, 2020the same AC power type has been considered at aggregate and appliance level at every site. In contrast to that, Table1 reveals that in the case of house 3 and 4 in UK-DALE, apparent power was recorded on aggregate level, whereasactive power was considered on appliance level only. As our deﬁnition of NAR demands for the same AC power typeon aggregate and submeter level, no such ratio could be computed in those cases. The same applies to all sites of theREDD [23] dataset, according to the NILMTK dataset converter . For this reason, REDD has not been considered inthis study. Evaluation Setup

To gain a comprehensive understanding of the impact of noise on the disaggregation performance of algorithms, weselected three households with ascending levels of residual noise: household 2 of the ECO dataset [14] with a NAR of5.9%, household 5 of the UK-DALE dataset [17] with a NAR 27.5%, and household 2 of the REFIT dataset [16] with aNAR of 65.1 %. This way, we incorporate one instance each for disaggregation problems with low, moderate, and highnoise levels. For every household considered, we selected ﬁve electrical appliances and spent best efforts to consider awide range of appliance types. We extracted 244 days for ECO, 82 days for UK-DALE and 275 days for REFIT whileapplying a sampling interval of

10 s . The amount of data used per household was governed by availability in the case ofECO and UK-DALE, as can be learned from Table 1. As concerns REFIT, the considered time window was March1st to December 1st of 2014. We split datasets into training set (70%), validation set (15%), and test set (15%). Thissplitting was applied to all three households. We considered the smart meter signal as present in datasets and obtainedthe denoised version of the aggregate by superposition of the individual appliance signals following: y t = M (cid:88) i =1 x ( i ) t (3)For experimental evaluations, we utilize the latest version of NILMTK. The toolkit integrates several basic benchmarkalgorithms as well as load disaggregation algorithms based on Deep Neural Networks (DNN). In the course ofexperiments, we consider the traditional CO approach and two approaches based on DNNs:• The Combinatorial Optimization (CO) algorithm, introduced in [2], has been used repeatedly in literature toserve as baseline [18]. The CO algorithm estimates the power demand of appliances and their operational mode.Similar to the Knapsack problem [24], estimation is performed by ﬁnding the combination of concurrentlyactive appliances that minimizes the difference between aggregate signal and the sum of power demands.•

Recurrent Neural Networks are a subclass of neural networks that have been developed to process time seriesand related sequential data [25]. First proposed for NILM in [26], we employ the implementation presented in[27], which incorporates Long Short-Term Memory (LSTM) cells. Provided a sequence of aggregate readingsas input, the RNN estimates the power consumption of the electrical appliance it was trained to detect for eachnewly observed input sample.• The

Sequence-to-point (S2P) technique, relying on convolutional neural networks, follows a sliding windowapproach in which the network predicts the midpoint element of an output time window based on an inputsequence consisting of aggregate power readings [28]. The basic idea behind this method is to implement anon-linear regression between input window and midpoint element, which has been applied successfully forspeech and image processing [29]. In a recent benchmarking study of NILM approaches, S2P was observed tobe amongst the most advanced disaggregation techniques at that time [30].While the CO approach does not need to be parametrized, we set the number of training epochs to 25 during trainingof neural networks. Further, we employ an input sequence length of 49 for LSTM inspired by [27] and 99 for S2P assuggested in [18].In this study, we utilize two error metrics to assess the performance of load disaggregation algorithms. The ﬁrst is awell-known, common metric used in signal processing, the Mean Absolute Error (MAE), deﬁned as:MAE ( i ) = 1 T · T (cid:88) t =1 (cid:12)(cid:12)(cid:12) ˆ x ( i ) t − x ( i ) t (cid:12)(cid:12)(cid:12) (4)where x t is the the actual power consumption, ˆ x t the estimated power consumption, and T represents the number ofsamples. The best possible value is zero and, as we estimate the power consumption of appliances, it is measured in https://github.com/nilmtk/nilmtk/tree/master/nilmtk/dataset_converters/redd/metadata PREPRINT - O

CTOBER

6, 2020Table 2: Mean Absolute Error (MAE) in Watts for real and denoised testing

CO LSTM S2PAppliance Real Den Real Den Real Den audio system 37.7 32.3 6.4 5.6 6.8 5.9dishwasher 43.1 40.2 7.5 3.9 5.8 3.6fridge 41.8 49.8 9.5 11.7 7.5 8.5kettle 17.3 42.7 4.2 2.5 3.2 1.3 E C O ( ) NA R = . % lamp 62.2 47.1 28.9 16.4 28.4 16.5dishwasher 87.6 95.1 12.2 3.6 8.3 4.1electric oven 50.7 39.2 20.7 9.0 17.4 9.1electric stove 131.3 40.1 7.3 6.1 6.4 4.4fridge 161.4 141.4 25.7 21.1 20.6 16.4 UK - DA LE ( ) NA R = . % washing machine 74.8 51.4 28.6 14.8 17.3 13.7dishwasher 96.0 41.3 31.4 9.0 25.3 8.5fridge 57.8 22.8 23.0 10.4 23.9 12.4kettle 79.1 9.2 9.8 3.3 9.7 3.2microwave 70.8 46.6 2.7 1.5 2.8 1.1 R E F I T ( ) NA R = . % washing machine 101.5 41.0 21.6 12.6 24.1 11.3 Table 3: Normalised Disaggregation Error (NDE) for real and denoised testing

CO LSTM S2PAppliance Real Den Real Den Real Den audio system 1.83 1.74 0.48 0.42 0.47 0.44dishwasher 0.96 0.78 0.44 0.26 0.34 0.22fridge 1.8 2.07 0.45 0.5 0.38 0.36kettle 1.3 1.66 0.51 0.34 0.48 0.22 E C O ( ) NA R = . % lamp 1.27 1.18 0.74 0.5 0.74 0.52dishwasher 1.44 1.55 0.76 0.39 0.61 0.34electric oven 1.51 0.98 0.66 0.38 0.53 0.33electric stove 2.9 1.82 0.66 0.58 0.6 0.42fridge 3.16 3.24 0.64 0.6 0.55 0.46 UK - DA LE ( ) NA R = . % washing machine 1.47 1.16 0.63 0.35 0.42 0.32dishwasher 1.06 0.74 0.56 0.19 0.48 0.18fridge 1.4 1.81 0.7 0.48 0.68 0.48kettle 1.33 0.43 0.48 0.2 0.46 0.2microwave 4.54 3.84 0.85 0.45 0.84 0.36 R E F I T ( ) NA R = . % washing machine 2.04 1.42 0.82 0.5 0.75 0.45 Watts. As second metric, we incorporate a metric deﬁned by NILM scholars in [31], the Normalized DisaggregationError (NDE), deﬁned as: NDE ( i ) = (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) Tt =1 (ˆ x ( i ) t − x ( i ) t ) (cid:80) Tt =1 ( x ( i ) t ) (5)In contrast to the MAE, the NDE represents a dimensionless metric and, more importantly, the NDE belongs to the classof normalized metrics. This allows for fair comparisons of disaggregation performance between appliance types [8]. Results

We summarize the outcome of our investigations in Table 2 for the MAE and Table 3 with regard to the NDE. Forseveral appliances per household, we compare the disaggregation performance of CO, LSTM, and S2P when applied tothe real-world aggregate signal, denoted as

Real , and the denoised aggregate signal

Den , respectively.In virtually all cases, we observe a strong tendency towards disaggregation algorithms providing better performance ondenoised aggregate signals. In the context of error metrics such as MAE and NDE this means that the error observed5

PREPRINT - O

CTOBER

6, 2020Figure 3: Performance gap with regard to MAE for ECOhouse 2on the real aggregate is larger than the error on the denoised aggregate. This holds true for almost all households andappliances considered, though some exceptions were identiﬁed: we spot a few cases in Table 2, namely the fridgeand kettle in ECO as well as the dishwasher in UK-DALE showing the opposite trend for the CO algorithm. Sameapplies to all fridges with regard to the NDE metric, as Table 3 reports. It should be pointed out that in those cases,the performance of CO on the real-world and denoised aggregate signal shows a considerable gap when compared toLSTM and S2P. Therefore and because of CO being a trivial benchmarking algorithm, we claim that these cases can beneglected.As concerns LSTM and S2P, we identify a single contradictory observation, namely in the case of the fridge in ECO’shousehold 2. In this particular case, we observed that testing on the real-world aggregate signal results in marginallybetter performance. One explanation for this could be the extremely low NAR in this scenario, 5.9%, and the fridgebelonging to the category of appliances with a recurrent pattern [30].Having identiﬁed a clear tendency towards CO, LSTM, and S2P providing signiﬁcantly better performance in thedenoised signal case i.e. lower MAE and NDE, we draw our attention to the open question whether or not there existsa link between noise level and the magnitude of the performance gap between

Real and

Den . To investigate furtherin this, we deﬁne the performance gap to be the distance between the error on the real aggregate signal and the errorobserved signal when testing on the denoised aggregate signal: ∆ MAE = MAE real − MAE denoised (6) ∆ NDE = NDE real − NDE denoised (7)We derive ∆ MAE for the cases presented in Table 2 and illustrate an excerpt of found gaps in Figure 3 for ECO, Figure4 for UK-DALE, and Figure 5 for REFIT, where the focus of this discussion lies on the two approaches based on neuralnetworks.We observe clear gaps for both NILM approaches based on neural nets, LSTM and S2P. The illustrations show thatneither approach seems to be resilient to noise. This is particularly interesting as approaches relying on LSTM cells aswell as sequence-to-sequence learning have received increased interest lately [30, 32, 33, 34, 35]. Further, we identifyhigher performance gaps in test cases on REFIT’s house 2 compared to house 5 of UK-DALE in this study. This isparticularly apparent when comparing the performance gap for the dishwasher across households, where we measure a ∆ MAE many times higher in case of REFIT. Also, we observe performance gaps twice as high for the fridge on REFITcompared to UK-DALE. The only exception to this trend represents the case of LSTM for washing machines, wherethe performance gap of the LSTM network is smaller on REFIT than on UK-DALE.Nevertheless, it should be stressed that comparisons based on not-normalized metrics can, but not have to be, misleadingin some cases since two appliances of the same kind (i.e., two dishwashers) may differ signiﬁcantly in terms of powerconsumption. Furthermore, metrics are designed to measure speciﬁc aspects of algorithms and hence, consideringseveral metrics during performance evaluation results in a broader understanding of the capabilities of algorithms.For these reasons, we also derived performance gaps with regard to NDE, ∆ NDE, for the test cases presented in Table3 and illustrate derived gaps in Figure 6 for UK-DALE and Figure 7 for REFIT.In the case of fridges, we observe substantially lower performance gaps on UK-DALE for both networks. We suspectthat is a result of the comparably high amount of noise in REFIT 2, disaggregating the real-world aggregate signalrepresents a bigger challenge than in the case of the denoised counterpart, especially when estimating the powerconsumption of low-power household appliances such as fridges.6

PREPRINT - O

CTOBER

6, 2020Figure 4: Performance gap with regard to MAE for UK-DALE house 5 Figure 5: Performance gap with regard to MAE for RE-FIT house 2Figure 6: Performance gap with regard to NDE for UK-DALE house 5 Figure 7: Performance gap with regard to NDE for REFIThouse 2Interestingly, not only we observe considerable performance gaps when estimating the power consumption of low-powerappliances but also for appliances with moderate or high power consumption such as dishwashers and washing machines,as can be learned from Figure 8 and Figure 9. In both cases, UK-DALE and REFIT, we measure the highest ∆ NDEin the case of the dishwasher. A comparison of performance gaps for dishwashers in Figure 8 reveals that while wemeasure similar performance gaps in UK-DALE and REFIT, the performance gap in the case of ECO is signiﬁcantlysmaller. We hypothesize this is the result of the marginal noise level measured in house 2 of ECO. More importantly,we observe that also in cases of marginal noise levels, an apparent difference in terms of disaggregation error can beobserved between real and denoised testing in this example.A recent benchmarking study involving eight disaggregation algorithms found that S2P outperformed competing neuralnetwork architectures and concluded that S2P ranks amongst the most promising NILM approaches [30]. As concernsperformance of NILM algorithms interpreted as disaggregation error between estimated power consumption and truepower consumption of appliances, we ﬁnd that S2P outperforms LSTM in 11 of 15 cases for the MAE metric and in 14of 15 cases when the NDE metric is considered. Furthermore, in the vast majority of test runs, the S2P approach showslower performance gaps than the network composed of LSTM cells in the sense of ∆ MAE and ∆ NDE.Figure 8: Performance gap with regard to NDE for dish-washers Figure 9: Performance gap with regard to NDE for wash-ing machines7

PREPRINT - O

CTOBER

6, 2020

Discussion

Figure 10: An excerpt of estimates provided by S2P for the fridge in REFIT house 2 when applied to the real aggregate.Figure 11: An excerpt of estimates provided by S2P for the fridge in REFIT house 2 when applied to the denoisedaggregate.Insights obtained from testing on three households with considerably different NAR levels reveal that in the majority oftest runs, testing on the denoised aggregate signal leads to substantially lower estimation errors and therefore, higherestimation accuracy. A few cases showing the contrary trend were observed but can be reasonably explained. As thisapparent performance gap can be attributed to a variety of aspects, we suspect two of them having a decisive impact onthis matter:First, denoised aggregates are obtained by superposition of individual appliance signals. As such, they contain fewerappliance activations and consumption patterns than aggregates obtained from smart meters, respectively. Particularlywhen estimating the power consumption of low-power appliances, such activations have the potential to hinder loaddisaggregation algorithms from providing accurate power consumption estimates. Such cases were repeatedly observedduring our studies on REFIT, where a NAR of 65.1 % was measured. As depicted in Figure 10 and Figure 11, we detectedseveral cases where concurrent operation of appliances with moderate or high power consumption (i.e. dishwasher,electric stove, or washing machine) resulted in signiﬁcant deviations when estimating the power consumption of thefridge. Not only we observed such cases for the basic benchmarking algorithm CO but also for the advanced NILMapproaches RNN and S2P, which leads to the presumption that though having seen remarkable advances in the the stateof the art, at least a part of those algorithms may still be prone to noise levels in aggregate signals.Second, we observe a substantially higher number of false positive estimates in predictions based on real-worldaggregate signals than in estimates generated from denoised aggregate signals. False positives in this context mean thatthe NILM algorithms predicted the appliance to consume energy at times this was not the case. Such false positivesimpact the outcome of performance evaluations two-fold, as they increase the disaggregation error and decrease theestimation accuracy of NILM algorithms, respectively. We observed repeatedly that in the real-world case, the numberof false-positive estimates is considerably higher than in the denoised case. We presume that those false positives arethe result of algorithms confusing appliances with similar power consumption levels.Based on the insights gained in this study, we can, however, not conﬁrm a clear link between noise level, measured inNAR, and the magnitude of the performance gap between testing on real and denoised aggregates. We suspect this isdue to the fact that every load disaggregation problem bears individual challenges to load disaggregation algorithms,making a comparison between moderate and high noise levels cumbersome. Though such a positive correlation betweennoise level and the magnitude of the performance gap could not be conﬁrmed by our evaluation, we demonstrated that it8

PREPRINT - O

CTOBER

6, 2020has to be expected that testing on denoised aggregates results in lower disaggregation errors in the majority of test runs.Yet, we would like to stress the need for further investigation into the complexity of load disaggregation problems.

Conclusions

Motivated by the use of both, real and denoised aggregates in the evaluation of NILM algorithms in related work, wehave investigated the performance gap observed between artiﬁcial sums of individual signals and signals obtainedfrom real power meters. First, we utilized a noise measure, the noise-aggregate ratio NAR, to determine the noiselevel of real-world aggregate signals found in energy datasets. We ﬁnd that noise levels vary substantially betweenhouseholds. We give insights on the experimental setup employed in our studies, comprising one basic and two moreadvanced NILM algorithms applied to data from three households with ascending noise levels. Our results show that invirtually all evaluation runs, a signiﬁcant performance gap between the real and the denoised signal testing case can beidentiﬁed, provided a sufﬁciently high noise-aggregate ratio. Though some exceptions were observed, those cases canbe well explained. Hence, we claim that testing on denoised aggregate signals can lead to a distorted image of the actualcapabilities of load disaggregation algorithms in some cases, and ideally, its application should be well-consideredwhen developing algorithms for real-world settings.

References [1] R Gopinath, Mukesh Kumar, C Prakash Chandra Joshua, and Kota Srinivas. Energy management using non-intrusive load monitoring techniques-state-of-the-art and future research directions.

Sustainable Cities and Society ,page 102411, 2020.[2] George W. Hart. Prototype Nonintrusive Appliance Load Monitor. Technical report, MIT Energy Laboratory andElectric Power Research Institute, 1985.[3] Hajer Salem, Moamar Sayed-Mouchaweh, and Moncef Tagina.

A Review on Non-intrusive Load MonitoringApproaches Based on Machine Learning , pages 109–131. Springer International Publishing, Cham, 2020.[4] Fernando Marcos Wittmann, Juan Camilo López, and Marcos J Rider. Nonintrusive load monitoring algorithmusing mixed-integer linear programming.

IEEE Transactions on Consumer Electronics , 64(2):180–187, 2018.[5] Stephen Makonin and Fred Popowich. Nonintrusive load monitoring (nilm) performance evaluation.

EnergyEfﬁciency , 8(4):809–814, 2015.[6] Stephen Makonin, Fred Popowich, Ivan V Baji´c, Bob Gill, and Lyn Bartram. Exploiting hmm sparsity to performonline real-time nonintrusive load monitoring.

IEEE Transactions on Smart Grid , 7(6):2575–2585, 2015.[7] B. Zhao, K. He, L. Stankovic, and V. Stankovic. Improving event-based non-intrusive load monitoring usinggraph signal processing.

IEEE Access , 6:53944–53959, 2018.[8] Christoph Klemenjak, Stephen Makonin, and Wilfried Elmenreich. Towards comparability in non-intrusive loadmonitoring: on data and performance evaluation. , 2020.[9] Roberto Bonﬁgli, Emanuele Principi, Marco Fagiani, Marco Severini, Stefano Squartini, and Francesco Piazza.Non-intrusive load monitoring by using active and reactive power in additive factorial hidden markov models.

Applied Energy , 208:1590–1607, 2017.[10] Roberto Bonﬁgli, Andrea Felicetti, Emanuele Principi, Marco Fagiani, Stefano Squartini, and Francesco Piazza.Denoising autoencoders for non-intrusive load monitoring: improvements and comparative evaluation.

Energyand Buildings , 158:1461–1474, 2018.[11] Lucas Pereira and Nuno Nunes. Performance evaluation in non-intrusive load monitoring: Datasets, metrics, andtools - a review.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , 8(6), 2018.[12] Stephen Makonin, Bradley Ellert, Ivan V. Bajic, and Fred Popowich. Electricity, water, and natural gas consumptionof a residential house in Canada from 2012 to 2014.

Scientiﬁc Data , 3(160037):1–12, 2016.[13] Nipun Batra, Oliver Parson, Mario Berges, Amarjeet Singh, and Alex Rogers. A comparison of non-intrusive loadmonitoring methods for commercial and residential buildings. arXiv:1408.6595 , 2014.[14] Christian Beckel, Wilhelm Kleiminger, Romano Cicchetti, Thorsten Staake, and Silvia Santini. The eco data setand the performance of non-intrusive load monitoring algorithms.

Proceedings of the 1st ACM Conference onEmbedded Systems for Energy-Efﬁcient Buildings , pages 80–89, 2014.9

PREPRINT - O

CTOBER

6, 2020[15] Nipun Batra, Manoj Gulati, Amarjeet Singh, and Mani B Srivastava. It’s different: Insights into home energyconsumption in india. In

Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-EfﬁcientBuildings , pages 1–8, 2013.[16] David Murray, Lina Stankovic, and Vladimir Stankovic. An electrical load measurements dataset of unitedkingdom households from a two-year longitudinal study.

Scientiﬁc data , 4(1):1–12, 2017.[17] Jack Kelly and William Knottenbelt. The uk-dale dataset, domestic appliance-level electricity demand andwhole-house demand from ﬁve uk homes.

Scientiﬁc data , 2(1):1–14, 2015.[18] Nipun Batra, Rithwik Kukunuri, Ayush Pandey, Raktim Malakar, Rajat Kumar, Odysseas Krystalakos, MingjunZhong, Paulo Meira, and Oliver Parson. Towards reproducible state-of-the-art energy disaggregation. In

Proceedings of the 6th ACM International Conference on Systems for Energy-Efﬁcient Buildings, Cities, andTransportation , pages 193–202, 2019.[19] Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt, Alex Rogers, Amarjeet Singh,and Mani Srivastava. Nilmtk: an open source toolkit for non-intrusive load monitoring.

Proceedings of the 5thinternational conference on Future energy systems , pages 265–276, 2014.[20] Kyle Anderson, Adrian Ocneanu, Diego Benitez, Derrick Carlson, Anthony Rowe, and Mario Berges. BLUED: afully labeled public dataset for Event-Based Non-Intrusive load monitoring research. In

Proceedings of the 2ndKDD Workshop on Data Mining Applications in Sustainability (SustKDD) , Beijing, China, August 2012.[21] Andreas Reinhardt, Paul Baumann, Daniel Burgstahler, Matthias Hollick, Hristo Chonov, Marc Werner, andRalf Steinmetz. On the accuracy of appliance identiﬁcation based on distributed load metering data. In , pages 1–9. IEEE, 2012.[22] Andrea Monacchi, Dominik Egarter, Wilfried Elmenreich, Salvatore D’Alessandro, and Andrea M Tonello.Greend: An energy consumption dataset of households in italy and austria. , pages 511–516, 2014.[23] J. Zico Kolter and Matthew J. Johnson. Redd: A public data set for energy disaggregation research. In

Workshopon Data Mining Applications in Sustainability (SIGKDD), San Diego, CA , volume 25, pages 59–62, 2011.[24] A. Rodriguez-Silva and S. Makonin. Universal Non-Intrusive Load Monitoring (UNILM) Using Filter Pipelines,Probabilistic Knapsack, and Labelled Partition Maps. In , pages 1–6, 2019.[25] R Di Pietro and GD Hager. Handbook of medical image computing and computer assisted intervention.

Chapter ,21:503–519, 2019.[26] Jack Kelly and William Knottenbelt. Neural NILM: Deep neural networks applied to energy disaggregation.In

Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efﬁcient BuiltEnvironments (BuildSys) , 2015.[27] Odysseas Krystalakos, Christoforos Nalmpantis, and Dimitris Vrakas. Sliding window approach for online energydisaggregation using artiﬁcial neural networks. In

Proceedings of the 10th Hellenic Conference on ArtiﬁcialIntelligence (SETN) , 2018.[28] Chaoyun Zhang, Mingjun Zhong, Zongzuo Wang, Nigel Goddard, and Charles Sutton. Sequence-to-point learningwith neural networks for non-intrusive load monitoring. In

Proceedings of the 32nd AAAI Conference on ArtiﬁcialIntelligence (AAAI) , 2018.[29] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalch-brenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprintarXiv:1609.03499 , 2016.[30] Andreas Reinhardt and Christoph Klemenjak. How does load disaggregation performance depend on datacharacteristics? insights from a benchmarking study. In

Proceedings of the Eleventh ACM International Conferenceon Future Energy Systems , e-Energy ‘20, pages 167–177, New York, NY, USA, 2020. Association for ComputingMachinery.[31] J Zico Kolter and Tommi Jaakkola. Approximate inference in additive factorial hmms with application to energydisaggregation. In

Artiﬁcial intelligence and statistics , pages 1472–1482, 2012.[32] Maria Kaselimi, Nikolaos Doulamis, Anastasios Doulamis, Athanasios Voulodimos, and Eftychios Protopapadakis.Bayesian-optimized bidirectional lstm regression model for non-intrusive load monitoring. In

ICASSP 2019-2019IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 2747–2751. IEEE,2019. 10

PREPRINT - O

CTOBER

6, 2020[33] Maria Kaselimi, Nikolaos Doulamis, Athanasios Voulodimos, Eftychios Protopapadakis, and Anastasios Doulamis.Context aware energy disaggregation using adaptive bidirectional lstm models.

IEEE Transactions on Smart Grid ,2020.[34] Lukas Mauch and Bin Yang. A new approach for supervised power disaggregation by using a deep recurrent lstmnetwork. In , pages 63–67.IEEE, 2015.[35] Ke Wang, Haiwang Zhong, Nanpeng Yu, and Q Xia. Nonintrusive load monitoring based on sequence-to-sequencemodel with attention mechanism. In