[PDF] A Non-Intrusive Load Monitoring Approach for Very Short Term Power Predictions in Commercial Buildings

Abstract

This paper presents a new algorithm to extract device profiles fully unsupervised from three phases reactive and active aggregate power measurements. The extracted device profiles are applied for the disaggregation of the aggregate power measurements using particle swarm optimization. Finally, this paper provides a new approach for short term power predictions using the disaggregation data. For this purpose, a state changes forecast for every device is carried out by an artificial neural network and converted into a power prediction afterwards by reconstructing the power regarding the state changes and the device profiles. The forecast horizon is 15 minutes. To demonstrate the developed approaches, three phase reactive and active aggregate power measurements of a multi-tenant commercial building are used. The granularity of data is 1 s. In this work, 52 device profiles are extracted from the aggregate power data. The disaggregation shows a very accurate reconstruction of the measured power with a percentage energy error of approximately 1 %. The developed indirect power prediction method applied to the measured power data outperforms two persistence forecasts and an artificial neural network, which is designed for 24h-day-ahead power predictions working in the power domain.

Full PDF

AA Non-Intrusive Load Monitoring Approach for Very Short Term PowerPredictions in Commercial Buildings

Karoline Brucke, Stefan Arens ∗ , Jan-Simon Telle, Thomas Steens, Benedikt Hanke, Karsten von Maydell,Carsten Agert Abstract

This paper presents a new algorithm to extract device proﬁles fully unsupervised from three phases reactiveand active aggregate power measurements. The extracted device proﬁles are applied for the disaggregationof the aggregate power measurements using particle swarm optimization. Finally, this paper provides anew approach for short term power predictions using the disaggregation data. For this purpose, a statechanges forecast for every device is carried out by an artiﬁcial neural network and converted into a powerprediction afterwards by reconstructing the power regarding the state changes and the device proﬁles. Theforecast horizon is 15 minutes. To demonstrate the developed approaches, three phase reactive and activeaggregate power measurements of a multi-tenant commercial building are used. The granularity of datais 1 s. In this work, 52 device proﬁles are extracted from the aggregate power data. The disaggregationshows a very accurate reconstruction of the measured power with a percentage energy error of approximately1 %. The developed indirect power prediction method applied to the measured power data outperforms twopersistence forecasts and an artiﬁcial neural network, which is designed for 24h-day-ahead power predictionsworking in the power domain.

Keywords:

Non-intrusive load monitoring, energy disaggregation, power prediction, unsupervisedlearning, neural networks

1. Introduction

Due to a higher share of renewable energies andthe increasing electriﬁcation of our society, the elec-tricity grid is facing more challenges such as insta-bilities and sudden increases in energy supply ordemand. A possible solution to avoid overloadingwithout a massive increase in grid infrastructure isenergy management on both, the supply and thedemand side of the electrcity grid [1]. Energy man-agement relies among other things on high qual-ity forecasts of the electricity supply and demandfor diﬀerent time horizons of seconds to months[2, 3, 4]. Predictions for seconds to minutes are re-ferred to as very short term prediction. The size ofsystems of predictions diﬀer from the high voltage ∗ Corresponding author, email adress: [email protected] ∗∗ All authors were with: DLR Institute of Networked En-ergy Systems, Carl-von-Ossietzky-Str. 15, 26129 Oldenburg,Germany grids to the device level [5, 6, 7]. Power predic-tions on the demand side include random humanbehavior and thus show erratic and highly volatilepatterns. Especially, very short term predictionsare highly inﬂuenced by randomness and are morediﬃcult to carry out than long term predictions [8].This behavior is particularly evident in buildingslike households and industrial or commercial build-ings. However, commercial and industrial buildingsshow a more repetitive demand than householdsdue to the division of working time and non-workingtime trough for instance shift work and repetitivetasks. Thus, there is much potential to carry outhigh quality power predictions in commercial build-ings which additionally mostly have a higher elec-tricity demand than households as [10] shows forGermany.For power predictions, mostly machine learningmethods, in particular artiﬁcial neural networks(ANNs), are applied in order to learn interrelationsbetween past and future consumption data [11, 12]. a r X i v : . [ ee ss . SP ] J u l ower prediction methods in general mainly workdirectly with the quantity to be predicted and areoften not able to predict sudden events resulting insharp power rises [8, 12, 9]. In this work, we in-troduce a diﬀerent approach of power predictionsbased on non-intrusive load monitoring (NILM),which was ﬁrstly described by Hart in 1992 [13].NILM is also called energy disaggregation becauseit divides aggregate consumption data into the con-tributions of single devices. Energy disaggregationaims for the description of the state of every appli-ance in a building without a massive increase in me-tering infrastructure and was carried out by variousmethods in the past [14, 15]. Most disaggregationmethods work with model building based on priorknowledge and also labeled data sets as [16] whichare mostly not present in reality. Therefore, NILMapproaches that work without labeled data sets arerequired.Disaggregation produces additional knowledge ofthe building which can be used for diﬀerent pur-poses like power predictions. In [17], the authorsincorporate appliance usage patterns to improveperformance of load forecasting and in [18] the au-thors use NILM and a subsequent clustering of sim-ilarly behaving appliances as a preprocessing stepfor their forecast algorithm.In this work, we present a new approach of powerpredictions using the state data of devices producedby our NILM approach. Thus, we develop a state prediction of devices and carry out a power predic-tion by reconstructing the aggregate power from thestate data with respect to the according device pro-ﬁles. Therefore, we present an unsupervised NILMapproach whose produced state data is used forshort term power predictions. For this purpose, weﬁrstly state our developed algorithm which is ableto extract device proﬁles from aggregate consump-tion data unsupervised with methods from machinelearning and statistics. No prior knowledge of thebuilding is required beforehand to calculate the de-vice proﬁles. Afterwards, we disaggregate the ag-gregate signal using particle swarm optimization asdeveloped by the authors of this paper and exten-sively covered in [19].Figure 1 shows this procedure visually with pastaggregate consumption data being disaggregated tothe device level. Afterwards the state of devicesgets predicted and the aggregate power consump-tion is reconstructed according to the state predic-tion.In Section 2 we present the used data set. There- Figure 1: Graphical representation of power predictionsbased on power disaggregation. The aggregate power sig-nal (top) is disaggregated until t = 0. This results in thepower contributions of diﬀerent devices (bottom). From t = 0 the state of device is predicted and thereafter theaggregate power signal gets reconstructed from state data ofsingle devices after, the methodology is outlined in Section 3,whereas we start by describing the assumed dis-aggregation problem in Section 3.1. Afterwards, thedeveloped and used methods are presented includ-ing device proﬁle extraction, disaggregation withparticle swarm optimization and very short termpower prediction using an artiﬁcial neural network.In order to show results, the developed methodsare exemplary applied to the power data of a com-mercial building in Section 4. After a discussion inSection 5 of results we ﬁnish with a conclusion inSection 6.

2. Data Description

In this work, we use power data of one measur-ing point in a multi-tenant commercial building asa test data set for our developed methods. Thegranularity is 1 s. The data represents a produc-tion facility and workshop and contains six features:Three phases of active and reactive power. The sixfeatures of the power measurements are referred toas P . . . P whereas P . . . P represent the three ac-tive power phases and P . . . P represent the threereactive power phases respectively. The data set in-cludes the power measurements from December 1 st ,2018 until March 29 th , 2019. On average there are0.0023 % of data points missing. Gaps are ﬁlled bythe last known value. We use the power analyzerUMG 604 PRO from Janitza Electronics. Accord-ing to the manufacturer the measuring error is less2 able 1: Data analysis of the measured power data set Property Value P min [kW] 2.261 P max [kW] 98.95Energy mean per Day [kWh] 534.5 P mean [kW] 22.27than 0.4 % which we neglect in this work [20]. Ta-ble 1 shows four key indicators of the used dataset.

3. Methodology

In the ﬁrst of the following sections, the as-sumed formulation of the disaggregation problemis stated. Sencondly, the device proﬁle extractionmethod is presented. Thereafter, the used disaggre-gation method PSO is described brieﬂy in the sub-sequent section. The disaggregation based powerprediction procedure is outlined in the last of thefollowing sections.

We assume a very similar formulation of thedisaggregation problem as in [19]. The aggregatepower at time t ∈ { , , . . . , T } called P ( t ) ∈ R isassumed to be a linear combination of device pro-ﬁles according to their state changes as describedin the following equation: P ( t ) = (cid:88) i, ˜ ts i (˜ t )=1 s i (˜ t ) l i ( t + ˜ t )+ (cid:88) i, ˜ ts i (˜ t )= − s i (˜ t ) (˜ t,T ) ( t ) p i + (cid:15) ( t ) (1)The device proﬁle of device i ∈ { , , . . . , M } con-tains a dynamic proﬁle l i and a power value of thestable operating state p i ∈ R with τ i being the(typical) time until this state is reached. This be-havior is shown in Figure 2 S ∈ { , , − } T × M denotes the so-called state-changes-matrix with s i ( t ) being the t th row andthe i th column of S . If s i ( t ) = 1, the device i isswitched on at time t and for s i ( t ) = − s i ( t ) = 0, thestate of device i remains the same. (cid:15) ( t ) is referred Figure 2: Graphical representation of a theoretical deviceproﬁle including dynamic behavior in the beginning and thestable operating state after τ i .Figure 3: Graphical representation of the separation of com-plex appliance signatures into simple device proﬁles. Theleft proﬁle contains repetitive patterns and is divided intothree characteristic simpler proﬁles that represent the char-acteristic patterns. to as always-on-component or noise. Given theseassumptions for the aggregate power signal, the fol-lowing optimization problem has to be solved:min S E (cid:18) P, P S (cid:19) (2) P denotes the measured aggregate power signal. P S denotes the reconstructed or approximated poweraccording to Equation 1 using the state changesmatrix S and the device proﬁles l i . E ( P, P S ) rep-resents an error function of P and P S . The statechanges matrix S and the device proﬁles l i have tobe found in order to minimize the error E . For device proﬁle extraction, we assume a devicehaving a binary state, i.e. the device is only inthe state ON or OFF. Stand-by modes or diﬀerentoperational modes of one appliance would be de-scribed as individual device proﬁles. This appliesfor complex programs of some appliances as well.Fig. 3 shows a graphical and generic representationof the division of a complex appliance signature intosimpliﬁed device proﬁles.3he device proﬁle extraction algorithm ﬁrstly de-tects times of events in the aggregate power signalby identifying peaks in the derivative of the aggre-gate power signal. Then, events are clustered us-ing the k-means algorithm to determine the char-acteristics when switching the speciﬁc device typeson or oﬀ. Afterwards, the clusters are cleaned andmerged. In order to determine the typical run-timeof the device, i.e. the length of the device proﬁle,the clusters are split using Gaussian Mixture Mod-els according to the characteristic ON-duration. Fi-nally, median blending is used to extract the deviceproﬁles from the aggregate power signal.

We start by identifying when state changes ofdevices take place. For this purpose, we use thederivative of the measured aggregate power signal P which is denoted by ∆ P : { , . . . , T − } → R andis calculated according to Equation 3 where t + 1denotes the subsequently measured point in timewith respect to t . Due to the measuring frequencyof 1 Hz, the relation is simpliﬁed to:∆ P ( t ) = P ( t + 1) − P ( t )( t + 1) − t = P ( t + 1) − P ( t )1 s (3)We assume that a state change takes places, whena sharp increase or decrease in the measured poweris observable. These inﬂection points in the ag-gregate power signal result in maxima or minimain the derivative. Maxima are referred to as ON-events and minima as OFF-events in the following.In order to identify events, we take the sum of ac-tive phases P tot ∈ R T with P tot = P + P + P into account. We perform a peak analysis in thederivative of the sum of the three phases of activepower ∆ P tot with ∆ P tot ∈ R T − . For the peakanalysis, we take all values of ∆ P tot into account,which are above a threshold value ε threshold , thus | ∆ P tot ( t ) | ≥ ε threshold . The threshold can to bechosen with respect to the given power data. Thechoosing process of a peak threshold could be auto-mated in the future. We assume, that the processof switching on or oﬀ a device is completed within1 sec. When ∆ P tot ( t ) is an event, we denote therespective time by t p and call t p event-time. We in-troduce the following peak criterion which deﬁnes time t to be or not to be an ON-event time t p : t = t p ⇔ ∆ P tot ( t − < ∆ P tot ( t ) ∧ ∆ P tot ( t + 1) < ∆ P tot ( t ) ∧ ∆ P tot ( t ) ≥ ε threshold (4)Equation 4 accordingly applies for OFF-events withreversed signs. The set of N events is referred to as D = { ∆ P ( t p , ) , . . . , ∆ P ( t p ,N ) } . The relation between active and reactive powershows to be distinctive for a speciﬁc device type[21]. Therefore, we assume the increase or decreasein the three phases of active and reactive powerat the time of an event to be characteristic forthe speciﬁc device type. With this assumption,we can cluster the extracted events ∆ P tot ( t p ) ac-cording to their values in all six power features todistinguish the device types. For the cluster anal-ysis we use the well known k-means cluster algo-rithm. It is assumed that the speciﬁc patterns ofan ON-event correspond to those of an OFF-eventwith reversed signs. Clustering is therefore onlyperformed for the ON-events and the OFF-eventsare assigned to the cluster centers with reversedsigns with the smallest deviation afterwards. Thek-means cluster algorithm divides a given data set D = { ∆ P ( t p , ) , . . . , ∆ P ( t p ,N ) } into K Clusters insuch a way, that the Euclidean distance of each datapoint to the nearest cluster center is minimized.The number of clusters K has to be given. Thiscan be formalized as:min r nk ,(cid:126)c k N (cid:88) n =1 K (cid:88) k =1 r nk | ∆ P ( t p ,n ) − (cid:126)c k | (5) r nk = 1 if the event ∆ P ( t p ,n ) belongs to the clus-ter k and r nk = 0 for all other clusters. Clustercenters are denoted by (cid:126)c k ∈ R and the accordingcluster is a set of assigned events denoted by c k .The k-means cluster algorithm solves the minimiza-tion problem using the expectation-maximization-method [23]. In order to determine the optimalnumber of clusters K opt for the given events, theCalinski-Harabasz-Score (CH) is used, which is de-ﬁned by [24]:CH = N − KK − (cid:80) c k ∈ C | c k || (cid:126)c k − (cid:126)D | (cid:80) (cid:126)c k (cid:80) ∆ P ( t p ,i ) ∈ c k | ∆ P ( t p ,i ) − (cid:126)c k | (6)4 denotes the number of events. The center of thewhole data set D is denoted by (cid:126)D and C denotesthe set of clusters c k . The cardinality of cluster k is denoted by | c k | . The CH gets maximum for theoptimal K and calculates a ratio between the sep-aration of the clusters and the compactness withineach cluster. It is multiplied by the pre-factor N − KK − to prevent overﬁtting, because a larger number ofclusters K must not always result in a higher valueof the CH than a smaller number of cluster.In order to obtain K opt , we perform a k-meansclustering for K ∈ { . . . } and calculate CH everytime. We choose 50 as the upper limit to conﬁne thecomputing time. An adaptive method to increase K until the CH is decreasing again would be possibleas well.After the ﬁrst clustering of the extracted events,we perform a cleaning step of the clusters analo-gously to [21]. For this, we deﬁne outlier events∆ ˜ P ( t p ) to be out of a 2 σ -area of the respective clus-ter where σ denotes the standard deviation of therespective cluster. All outliers get clustered againwith ﬁxed ˜ K = 10. A second CH-analysis would bepossible for the outlier events as well but this step issimpliﬁed since this cleaning step is optional in theprocedure of the extraction of device proﬁles. Withthe presented clustering procedure, the character-istic increase or decrease in all six power featureswhen switching a device on or oﬀ is known. In order to improve the clustering of the ex-tracted events, we perform a merging step of clus-ters based on a similarity measure. The similarityof two clusters is evaluated by means of the Pearsoncorrelation coeﬃcient ρ ∈ [ − ,

1] and absolute per-centage error (APE) calculated for every combina-tion of two cluster centers. The Pearson correlationcoeﬃcient of two cluster centers (cid:126)c i and (cid:126)c j is deﬁnedby the following equation [25]: ρ ( (cid:126)c i , (cid:126)c j ) = σ (cid:126)c i ,(cid:126)c j σ (cid:126)c i σ (cid:126)c j (7)where σ (cid:126)c i ,(cid:126)c j denotes the co-variance of (cid:126)c i and (cid:126)c j .The APE is deﬁned by the following equation:APE( (cid:126)c i , (cid:126)c j ) = | (cid:126)c i − (cid:126)c j || (cid:126)c i | (8)If ρ ( (cid:126)c i , (cid:126)c j ) and APE( (cid:126)c i , (cid:126)c j ) are above/below a giventhreshold, cluster i and j are merged. For this,a new cluster is created and the cluster members Figure 4: Graphical representation of the division of an ON-duration distribution with Gaussian Mixture Models. of cluster i and j are assigned to this new clusterwith the cluster center (cid:126)c i, new = 1 / · ( (cid:126)c i + (cid:126)c j ). Thiscalculation of the new cluster center is also appliedif the cardinalities of c i and c j are diﬀerent.The thresholds are chosen to be: ρ ( (cid:126)c i , (cid:126)c j ) > . ∧ AP E ( (cid:126)c i , (cid:126)c j ) < . c i satisﬁes the conditionin Eq. 9 with multiple other clusters. If that situ-ation applies, c i is only merged with the cluster ofhighest similarity. The newly created cluster c i, new is not merged again with other clusters. For the calculation of device proﬁles, the typi-cal run-time i.e. time in the state ON, is required.Therefore, we determine the time between an ON-event in a speciﬁc cluster and the next OFF-event inthat cluster for every ON-event in that cluster. Weperform this calculation for every cluster of events.The calculated time is referred to as ON-durationin the following. If there are more ON-events thanOFF-events, we neglect the surplus ON-events andvice-versa. For every cluster we present all de-termined ON-durations in a frequency distributionand observe multiple maxima at diﬀerent times.In reality, the ON-duration depends on the kindof use of the individual device, for example if a de-vice is capable of running diﬀerent programs or ifthe same device type is used for diﬀerent tasks.In this work, Gaussian Mixture Models (GMMs)are used to divide clusters according to character-istic ON-durations within a speciﬁc cluster. Fig-ure 4 shows an exemplary distribution of the ON-duration distribution of a cluster with a ﬁttedGMM which divides the distribution in two sub-distributions.GMMs determine the properties of sub-distributions in an overall distribution, using5nly observations of the overall distribution B = ( (cid:126)x , . . . , (cid:126)x N ) [26]. The a posteriori probabilityfor the GMM is calculated as follows: p ( θ | B ) = m (cid:88) i =1 π i N ( (cid:126)x | (cid:126)µ i , Σ i ) (10)where p ( θ | B ) describes the probability of the modelparameters θ given the data set B . The param-eters θ i = ( π i , (cid:126)µ i , Σ i ) denote the mixing coeﬃ-cients, the mean values and covariance matrices ofthe i th of m Gaussian distributions. The meanvalues of the Gaussian distributions represent themean ON-duration, which will be referred to as d in the following. Therefore, the ON-duration ofdevice i is denoted by d i . The number of sub-distributions m has to be given beforehand. The maximum-likelihood -method is used together withthe expectaion-maximization -algorithm to obtainan optimal estimation of θ [27]. In order to deter-mine the optimal number of Gaussian distributions m opt in the GMM of each cluster, the Bayesian In-formation Criterion (BIC) is used. The BIC is ameasure for comparing diﬀerent models. It is de-ﬁned by the following equation [28]:BIC ≈ M ln N − ln p ( B | θ ) (11) N denotes the number of data points in data set B and M the number of parameters in θ . Accord-ing to this deﬁnition, the BIC is to be minimized.As soon as ∆ BIC > M m and M m +1 , the model M m +1 is selected andthe corresponding m is called m opt . The limit for∆ BIC to select m opt has to be determined empiri-cally. In general, m should be increased as long as∆ BIC is negative for two subsequent models.Given this procedure, every cluster k is dividedin m groups . The groups that emerge from onecluster share the cluster center (the characteristicsat an ON-event and OFF-event) but diﬀer in theircharacteristic ON-duration. An event is assignedto a group if the associated Gaussian distributionis maximum for the ON-duration of this event.From in total K clusters emerge M = (cid:80) Kk =1 m opt ,k groups which will be denoted as G i . The ON-duration of G i is referred to as d i . For the ﬁnal calculation of the device proﬁles, me-dian blending is used for all groups. Median blend-ing is a method of noise reduction which we will

Figure 5: Graphical representation of the developed algo-rithm for device proﬁle extraction. On the left, the opera-tions are depicted while the data of every step in the algo-rithm is presented on the right. use in order to reduce the noise of the aggregatepower signal to isolate the device proﬁle from thisbackground [29]. For every element ∆ P ( t p ) ∈ G i we store and normalize the aggregate power signalfrom t p . . . t p + d i . The normalized power signal isdenoted by P norm . Normalization is carried out bydividing by the maximum power value in the storedpart of the aggregate power signal. Then, the me-dian for every point in time in the saved aggregatepower signal is calculated in all six power features.In order to scale the normalized proﬁle back to ab-solute power values, we use the cluster center ofthe respective cluster. It represents the character-istic increase in power per second when switchingon a speciﬁc device type. Therefore, we integratethe cluster center by multiplying with one second.Finally, we scale back the median values by mul-tiplying (cid:126)c k and the normalized l i . We deﬁne thepower proﬁle of device il i : { , . . . , d i } → R , (12)where d i denotes the ON-duration, by l i ( t ) = (cid:126)c k · median { P norm ( t p + t ) | t p is an ON-event of the device } (13)for every t ∈ { , . . . , d i } . Prerequisite for this pro-cedure are enough events in G i in order to reducethe noise of the aggregate power signal.6 .3. Disaggregation Procedure The disaggregation is carried out by particleswarm optimization (PSO) as described in [19]which is a for disaggregation improved version ofthe original description of PSO by Hart in [13]. ThePSO is a metaheuristic used for multidimensionaloptimization problems as the above presented dis-aggregation problem. In this work, we use PSOto determine the state changes matrix S . For thispurpose, the extracted device proﬁles are used. ThePSO aims for minimizing the following error mea-sure [19]: E [ a,b ) ( P, P S ) = α · b − (cid:88) t = a ( (cid:126)P S ( t ) − (cid:126)P ( t )) + β · b − (cid:88) t = a (∆ (cid:126)P S ( t ) − ∆ (cid:126)P ( t )) (14)with α + β = 1 weighting the two summands. Thealgorithm we use in this work to carry out the dis-aggregation is extensively presented in [19]. In [19]is assumed that a device proﬁles consists of tran-sient or dynamic behavior and a stable state reachafter a speciﬁc time τ . Thus, we assume the ex-tracted load proﬁles to represent the dynamic be-havior of the device. The power value of the stablestate is assumed to be the last non-zero power valueof the speciﬁc device proﬁle. In this work, we present a new method for powerforecasts that is based on a forecast of state changesof unknown devices. The power forecast is car-ried out by reconstructing the state changes fore-cast according to Equation 1. The weekends ofthe used data set show very regular power curveswith highly repetitive patterns. Thus, it is as-sumed, that persistence forecasts are suﬃcient forweekend days. Therefore, we only consider workingdays since they show more complex power demandswith many sharp increases and decreases which weare aiming for to predict. For this purpose, weuse an artiﬁcial neural network (ANN). ANNs havebeen widely used as a very powerful method fortime series prediction in diﬀerent ﬁelds regardingthe power grid [30, 31, 32]. Especially, for load andenergy forecasts ANNs are preferred due to the non-linearity and randomness within power data [8]. Weare aiming for the ANN learning an interrelation-ship between the last hour of state changes and the

Figure 6: Graphical representation of the time features giventhe ANN as input. state changes within the next 15 minutes. We usea feed forward, fully connected ANN based on thesupported models and functions of keras [33]. Inthe following, the feature selection and the hyper-parameter optimization of the ANN are described.Finally, the training procedure of the ANN is out-lined.

The feature selection for the ANN determines theinput and target data, also called output data. Allinputs and outputs have to be normalized to a rangeof − . . . t = − s . . . t = 0 s to be input data for a state changes prediction for t = 0 s . . . t = 900 s . Secondly, we chose the datafrom t = 0 s . . . t = 900 s of seven days before asinput data. Thus, for a prediction on a Thursdayat 11 am, the state changes data from Thursday10 am - 11 am as well as the state changes datafrom the Thursday of one week before from 11 am- 11:15 am are selected. This feature selection hasproven to be helpful, due to the regularity in indus-trial and commercial data related to the weekday[34]. For M given device types, the input data setcontains 2 M + 3 columns. The target data containsonly the future state changes data of the M devices.Thus, there are M columns in the target data set.The number of rows is determined by the size of7 igure 7: Graphical representation of the integration of statechanges for the data preparation for the training of the ANN. the training data set, thus the number of rows cor-responds to the number of time steps in the trainingdata set.During training, the diﬀerence between outputdata and target data is quantiﬁed by calculatingan error measure. A large proportion of the statechanges data is zeros. Thus, there is a local op-timum in the error measure of the ANN to onlypredict zeros so no state changes at all. Therefore,we perform an additional preprocessing step for thestate changes data: We transform the state changesdata to state data via integration. Figure 7 showsa graphical representation of the integration proce-dure of state changes. Essentially, the state changesare added up for every device. This step bypasses adata structure that contains many zeros. After inte-gration, the state data is normalized to the range of[ − . . . The hyperparameter optimization was carriedout with the help of talos and the supported ran-dom search [35]. Hyperparameters are all parame-ters of an ANN that are not adapted during train-ing but have to be set beforehand. The followinghyperparameters are considered for optimization:The number of neurons in the hidden layers setsthe width of the ANN whereas the number of hid-den layer determines the depth of the ANN. Thenumber of neurons in the hidden layers has not tobe the same for all layers, thus the width of theANN can vary.

Dropout describes the percentage ofneurons that are neglected randomly in every hid-den layer during a training step in order to increasethe robustness and decrease over-ﬁtting of the ANN[36]. The learning rate is a measure for the stepsize made in the training process in every iteration.

Table 2: Hyperparameter and their chosen values for thestate forecast using an ANN

Hyperparameter Name Selected ValueNeurons in hidden layer 214Number of hidden layers 3Dropout 5 %Learning rate 0.01Batch size 2048Activation function ReluA larger learning rate decreases the training timebut increases the risk of not fully converging intoan error minimum and vice-versa for smaller learn-ing rates. The batch size determines the number ofsamples of the training data set that are processedat once. Thus, the parameters of the ANN are notadapted after every single sample passed the ANNbut only after as many samples as the batch size.As activation function the relu function proved tohave the best outcome in this work. The chosen val-ues of the optimized hyperparameters are presentedin Table 2.Given these hyperparameters, the chosen modelhas 137868 trainable parameters.We chose the mean squared logarithmic error(MSLE) as error measure which is deﬁned as fol-lows:MSLE( y target , y Out ) =1 N N (cid:88) i =0 (cid:32) log( y target ,i + 1)log( y Out ,i + 1) (cid:33) (15)Due to the logarithmic character of the MSLE,it penalizes deviations at small values more heavilythan error measures like root mean squared erroror mean absolute error. This showed an improvedtraining process given the structure of data in thiswork. In order to evaluate and compare the resultsof prediction, we use the same error measures as forthe validation of the disaggregation results. The training is performed with an Intel i7-6700kprocessor, 16GB of RAM and a Geforce GTX 1050graphic card with 768 CUDA cores. The data setfor training includes 55 working days from January2019 to March 2019, thus it contains 4752000 rowsand 2 M + 3 columns. During training, 95% of thedata get used for training the network and 5% get8sed as a validation data set. As soon as the er-ror on the independent validation set increases, thetraining is stopped. This is carried out by the earlystopping option of keras [33]. As a postprocessingstep, we calculate the derivative, thus the reverseprocedure of the shown itegration. The output val-ues of the used ANN are ﬂoats and not integers asassumed in Equation 1. Thus, we interpret the out-puts as probabilities of state changes of the devices.In order to reconstruct the power, we allow ﬂoatsand calculate a weighted sum and not a discretesum. Therefore, Equation 1 changes as follows: P ( t ) = (cid:88) i, ˜ ts i (˜ t ) > . s i (˜ t ) l i ( t + ˜ t )+ (cid:88) i, ˜ ts i (˜ t ) < − . s i (˜ t ) (˜ t,T ) ( t ) p i + (cid:15) ( t ) (16)with s i ( t ) ∈ R . We deﬁne a threshold of 0.1 totake an element of the prediction into account forreconstruction. As always-on-component (cid:15) , we giveeach short term prediction the last measured powervalue. Thus, for a prediction from t = 0 to t = 900we set (cid:15) = P ( −

4. Results

In this section we present the results of the ap-plication of the developed methods to the abovedescribed data set. For the device proﬁle extrac-tion we use the data from January and February2019. Thereafter, we disaggregate the whole dataset. In order to train the forecast algorithm, weuse data from January until March 2019. The test-ing of the forecast algorithm is carried out usingthe last two days in the data set: 28 th and 29 th March, 2019. Since the forecast horizon is 15 min,we are able to perform an evaluate 188 single powerpredictions on the test data set. In order to vali-date the results of disaggregation and prediction,we use the error measures root mean squared error(RMSE), mean absolute error (MAE), mean abso-lute percentage error (MAPE) and the percentageenergy diﬀerence (Energy E ) as in [19]. In Figure 8 an exemplary cluster analysis of oneday of data (December 4 th , 2018) is shown for the Figure 8: Clustering of ON-events and OFF-events individu-ally for December 4 th , 2018. Presented are two of six powerderivative features. elements ∆ P ( t p ) and ∆ P ( t p ). ON-events are de-picted as well as OFF-events and the symmetry tothe central point zero is clearly visible. It is ap-parent, that the relation of the increase in activeand reactive power is not randomly distributed, butforms clusters. ON-events and OFF-events are clus-tered individually. The cluster forming behaviorbecomes clearer taking all six features of the powerderivative into account. Therefore, all six featuresof six exemplary cluster centers are presented inFigure 9.It is visible, that the clusters have very distinctcharacteristics regarding the relation between thesix features. Whereas Example 1, 3 and 5 onlyshow an increase in one phase of power, the otherfour examples seem to represent three-phase con-nected devices. They approximately have the samepower derivative at an ON-event in all three phasesin active and reactive power. The relation betweenactive and reactive power is very distinct. Whilethe Examples 1, 3 and 5 show almost no increase inreactive power when switched on, Example 4 and 6have signiﬁcant reactive power increases. For Ex-ample 4 the increase in reactive power is even higherthan the increase in active power.To show the separation of clusters according totheir ON-duration, Figure 10 shows two exemplaryON-duration distributions with the respective ﬁttedGMMs. Cluster 15 from Figure 10 gets divided intotwo groups with approximate ON-durations of 200 sand 1000 s. On the other hand, Cluster 14 getsdivided into three groups with approximate ON-durations of 250 s, 900 s and 1900 s.Given the examples for diﬀerent steps of the de-veloped algorithm, we show four exemplary device9 igure 9: Six examples of cluster centers and their charac-teristics in all features of the power derivative.Figure 10: Two examples of GMMs for ON-duration distri-butions proﬁles in Figure 11. In total, we extracted 52device proﬁle from the aggregate power data withthe developed algorithm. The depicted proﬁles arerepresentative for all extracted device proﬁles sincethey show the main patterns and behaviors of thedevice proﬁles we extracted. Both upper illustra-tions in Figure 11 show the most frequent typeof device proﬁles: A three phase connected devicewith a transient behavior in the beginning and af-terwards a stable operating state where the relationbetween active and reactive power remains approx-imately the same. Additionally, Device Proﬁle 4and 42 show, that the relation of active and reac-tive power is characteristic for the speciﬁc device.The length of Proﬁle 4 and 42 diﬀers as well.Device Proﬁle 6 shows no dynamic behavior inthe beginning of the proﬁle, but it consists of aconstant component and an oscillating or randomcomponent. Obviously, Device Proﬁle 6 representsa single-phase connected device, since all power val-ues other than P are close to zero. Proﬁle 6 has ahigh ON-duration compared to Device Proﬁle 4 and42 with d ≈ s . An exception of the deviceproﬁles is represented by Device Proﬁle 37, which Figure 11: Four examples of device proﬁles extracted unsu-pervised from the aggregate power signal. Solid lines repre-sent active power and dashed lines represent reactive power. shows a decreasing behavior with many little butsharp increases and decreases in all power features.This kind of device proﬁles is least frequent.

In total, we disaggregated power data of 119 daysfrom December 2018 to March 2019. Figure 12shows an exemplary day of data with the sum ofactive phases on the left and the sum of reactivephases on the right. Beneath, the respective abso-lute error is shown. It is visible, that the PSO isable to reconstruct the shape of the aggregate powersignal over the duration of one whole day includingthe repetitive patterns during the night and mostof the peaks. Nevertheless, there are error peaksof up to almost 20 kW which corresponds to ap-proximately 25 % of the measured power at the re-spective time. Nevertheless, these high error valuesoccur not frequently and they are of very short du-ration. During working time, the absolute error ishigher than at night but there is no constant oﬀsetbetween measured and reconstructed power. Theerror of the reactive power is larger than the errorof the active power. At the end of the presentedday, noise is present in the reconstructed power.Table 3 shows the error values regarding all con-sidered error measures for the working days ofMarch 2019. The mean values of the daily evalua-tion of the reconstruction after disaggregation arepresented as well as the respective standard devi-ation of the mean. The reproducibility of resultsis shown by the standard deviations of the meanerror values which are approximately 10% of the10 igure 12: Disaggregation results for 4 th December, 2018.On the left is the sum of active power shown and on theright side is the sum of reactive power shown. At the bot-tom, the respective absolute diﬀerence between measuredand reconstructed power illustrated.Table 3: Error characteristics between the sum of activepower measured and reconstructed after disaggregation forMarch 2019. The values are means of daily error evaluationsand standard deviations of the mean values. st , 2019- March 31 th , 2019)RMSE [W] 1565 ± ± ± E [%] 0.897 ± ± Given the data produced by the disaggregation,an ANN is trained according to the in Section 3described pre- and postprocessing of the data. Thedata set for testing the ANN performance consistsof the 28 th and 29 th March, 2019. Therefore, we cancalculate 188 power predictions of 15 minutes eachfor the test data set. The ANN is given the ﬁrsthour of the test set as input data. To put the resultsinto perspective, we compare the error measures onthe test set with error measures for diﬀerent persis-tence forecasts. Lastly, we compare the developedshort term prediction based on state changes datawith prediction results of an ANN mainly basedon past power data with a granularity of 5 minfrom [34]. In [34], the authors used the same data as in this work and optimized a Long-Short-Term-Memory Neural Network for a 24-h-day-ahead pre-diction. Although the prediction horizon and thegranularity are diﬀerent, the power prediction from[34] represents the standard prediction procedureand therefore acts as a benchmark prediction. Allerror measures are calculated for the sum of ac-tive power phases. Table 4 shows the means andstandard deviation of multiple error measures forthe predictions. The ﬁrst persistence forecast usesthe power values from seven days ago whereas thesecond persistence forecast uses the power valuesof the preceding 15 min. That means, for a pre-diction from t . . . t + 900 s the power values from t −

900 s . . . t are taken. In comparison, Table 5shows the respective means and standard devia-tions of the error measures for the ANN predictionsbased on disaggregation data. The ANN outper-forms both persistence forecasts regarding the meanerror values of all calculated error measures. Espe-cially, the MAPE and the error in daily consumedenergy is signiﬁcantly smaller. Table 4: Multiple error measures between measured and pre-dicted power for two diﬀerent persistence forecasts. Pre-sented are the means and standard deviations of the errors.They are calculated for 188 individual predictions of 15 min-utes each for the test data set 28 th - 29 th March, 2019.

Persistence Persistence7 days before 15 minRMSE [W] 6148 ± ± ± ± ± ± E [%] 35.25 ± ± Table 5: Multiple error measures between measured and pre-dicted power of the described ANN. Presented are the meansand standard deviations of the errors. They are calculatedfor 188 individual predictions of 15 minutes each for the testdata set 28 th - 29 th March, 2019.

Power predictionwith ANNRMSE [W] 3478 ± ± ± E [%] -0.15 ± igure 13: Comparison of 24h-day-head power forecast andvery short term power prediction based on state forecasts sured power and both ANN predictions are shown.The 24 h day-ahead prediction is similar to a rollingaveraged power value whereas the short term pre-diction based on state changes data shows more theerratic behavior during working time with sharp in-creases and decreases in the power. For the modelof the 24 h day-ahead prediction we can calculatethe RMSE and MAE which results in RMSE =5124 W and MAE = 4507 W. Both values are sig-niﬁcantly higher than the mean error values of theshort term prediction of the disaggregation basedANN.

5. Discussion

For the extraction of device proﬁles, the maindistinguishing factor is the behavior at an ON-event. The used peak criterion to determine theON-behavior is very simple and neglects peaks witha width of multiple time-steps. This problem couldbe solved with a more sophisticated peak criterionin future work.The k-means clustering algorithm is used to de-termine clusters in the six-dimensional space of re-active and active power phases. Also other publica-tions use a clustering to diﬀerentiate between devicetypes like [21, 22] but they use at maximum twofeatures and not six features like in this work. Ingeneral, a clustering is more precise, the more char-acteristic features are present [23]. Thus, we canassume that we reach a higher precision in dividingthe events into clusters of device types. Other prop-erties measured by power analyzers could be usedadditionally to distinguish between diﬀerent devicetypes in future work. Nevertheless, the numberof necessary features should be limited regardinga realistic application in real world energy manage-ment systems and the availability of high resolution power analyzers.During clustering, we assume that the clustercenters of OFF-events are the reversed cluster cen-ters of ON-events. When clustering is performed forON-events and OFF-events individually, the OFF-event cluster centers with reversed signs lie withina 0.25- σ area of the ON-event cluster center with σ denoting the standard deviation of the respectivecluster. Therefore, the assumption can be justiﬁed.Additionally, Figure 9 shows the symmetry to thecentral point of ON- and OFF-events.In order to determine the ON-event behavior weperform a peak analysis based on the assumption,that the switching procedure of a device is ﬁnishedwithin one second. In reality, most devices show atransient behavior mostly in the shape of an expo-nentially decreasing oscillation [37]. Some publica-tions distinguish between diﬀerent transient behav-iors and thus diﬀerent device types [38, 39]. But,compared to the measuring frequency, these pro-cesses happen on shorter timescales (within mil-liseconds) and can be neglected here. Only witha measuring frequency of kHz, the characteristictransient behavior would be observable [37]. Butan exhaustive installation of measuring infrastruc-ture which is able to measure in kHz is unlikely.Thus, the presented approach using a measuringfrequency of 1 Hz is more realistic to be applied inlocal energy and power management systems. Withthe measuring frequency of 1 Hz in this work an ON-event looks approximately like a step function in theaggregate power signal. Nevertheless, in most de-vice proﬁles in Figure 11, a transient behavior canbe observed in the ﬁrst few seconds of the accord-ing proﬁles. Thereafter, most devices reach a stablestate for as long as the proﬁle persists. Thus, thedivision of the device proﬁles in a stable state and adynamic behavior for the stated formulation of thedisaggregation problem can be applied here.The ﬁnal step of the device proﬁle extraction pro-cedure is the median blending. In general, moresamples to perform median blending with result inmore precise results. Thus, it is very importantto perform the extraction procedure on a suﬃcientamount of data. Especially, devices which are onlyswitched on rarely result in less accurate proﬁles.The chosen normalization is carried out by meansof a division by the maximum power value in everysample of P norm . With a high base load, this pro-cedure could average out characteristic ﬂuctuationsof the device proﬁle. Therefore, another normaliza-tion method might be appropriate if the individual12roﬁles are of great importance and an allocationto real measured proﬁles is of interest. However, wefocused on the high quality very short term powerprediction with the focus of attention on the aggre-gate power signal. Thus, small ﬂuctuations of in-dividual device proﬁles were of minor importance.The improvement of the median blending procedureor the application of other noise reduction methodsfor device proﬁle extraction could be examined infuture work.In total, 52 device proﬁles are extracted for ourdataset of a commercial consumer. Since no addi-tional knowledge of the used data is present, we cannot validate this number of device proﬁles. But, thereason for this number of proﬁles is the division byON-behavior and additionally the division of clus-ters into groups with similar run-time. Even a sim-ple ohmic consumer type could therefore result inmultiple device proﬁles.A direct validation of the device proﬁles was notpossible in this work due to a lack of data of thecorrect device proﬁles. Additionally, an assignmentof extracted device proﬁles to measures, complexappliance signatures would be diﬃcult since the ex-tracted proﬁles only represent operational modes ofappliances. But the good results in disaggregationand forecast show that the extracted device proﬁlesare a satisfactory representation of the real deviceproﬁles.The extraction procedure has similarities tonon-negative blind sources separation in acoustics,where the individual components and the the mix-ing procedure are unknown [40]. Since all methodsused for the device proﬁle extraction are from statis-tics and unsupervised machine learning, no hyper-parameters have to be optimized to apply the algo-rithm to a diﬀerent data set. The needed hyperpa-rameters as the number of clusters K or the numberof Gaussian distributions in the GMMs are deter-mined using statistical scores or criteria. Therefore,the device proﬁle extraction algorithm can be ap-plied without changes to other data sets. The trans-ferability has to be examined systematically in thefuture.Figure 12 and Table 3 show, that the disaggre-gation of this work reaches a very accurate recon-struction of the measured power. The results areconsistently good in all six phases. Thus, we canassume that the device proﬁles are a good represen-tation of the real devices and also the separationaccording to the ON-event behavior seems valid.Since the PSO is a metaheuristic, incorrect assign- ments of devices to events are possible. Neverthe-less, the disaggregation procedure produces addi-tional knowledge of the building or the respectivedata set without a costly model building and adap-tion to the data. The aim of this work is the use ofthis additional knowledge for the purpose of a veryshort term power prediction and the examinationif this additional knowledge provides beneﬁts forsuch an application. The disaggregation procedurecan be justiﬁed, if a disaggregation based predic-tion method outperforms classic prediction meth-ods working in the power domain.The conducted very short term forecast usingstate changes data shows signiﬁcantly better resultsthan multiple persistence forecasts and a forecastusing a LSTM network which is optimized for 24hour prediction with a resolution of 5 min. Thus,the LSTM predicts 288 values compared to the 900of our short term forecast. To be mentioned is,that the maximum accuracy of predictions is the ac-curacy of the reconstruction of the disaggregation.Thus, error values smaller than the reconstructionerror values can only be undercut by chance, butnot systematically. The used model is a very sim-ple ANN for a high number of input and targetfeatures. Thus, further optimization regarding themodel of the neural network and maybe the useof LSTM layers or convolutional layers could resultin better forecasts. The ANN is optimized for theused data. Therefore, results could be worse, whenapplied to another data set of state changes. Thedeveloped forecast does not rely on a exhaustiverollout of measuring frequency as [9] and thus iseasily transferable also with limited measuring in-frastructure. Nevertheless, the transferability hasto be examined systematically in the future.It is to be assumed that a certain proportionof state changes of devices during working time ispurely coincidental. However, randomness cannotbe predicted by any model. In order to assess thechances of success of applying the presented ap-proach to other power data, the randomness of thedata could be determined in advance using appro-priate methods. For example, the approximatedentropy method described in [41] could be used,which has already been applied to i.e. stock pricesin [42]. Additionally, instead of a deterministic pre-diction, one could perform a probabilistic predic-tion and/or work with conﬁdence intervals for thepredicted power values. This procedure could helpin management decision making. In this work, weshowed the advantages of the state changes data for13ower predictions. But the additional knowledgefrom device proﬁle extraction and disaggregationcould also be applied for other tasks like behavioralanalysis, state analysis of the building, checkingthe health status of residents or employees or giverecommendations for an intelligent power consump-tion regarding the availability of renewable energy.With more variable market-based electricity tariﬀseven new business models would be possible usingthe presented approach in energy management sys-tems.

6. Conclusions

In this work, we developed an algorithm for ex-tracting device proﬁles from aggregate power datain six dimensions fully unsupervised. Since themethod relies on statistical and unsupervised ma-chine learning methods, it extracts repetitive pat-terns in the aggregate power data. Therefore, theextracted proﬁles are not necessarily full appliancesignatures but one operational mode of one device.The direct validation of device proﬁles was not pos-sible due to a lack of measured or correct deviceproﬁles. The transferability of the proposed deviceproﬁle extraction is really high in theory, since nohyperparameters have to be optimized beforehandbut this has to be proven in future work. The dis-aggregation uses the extracted device proﬁles andshows a very accurate reconstruction. Thus, thedevice proﬁles seem to represent real appliance sig-natures suﬃciently well. As the ﬁnal application ofthe conducted NILM approach, the very short termprediction of power outperformed all compared pre-dictions. Although, many publications developedor carried out various NILM algorithms, a broadapplication of those methods for other purposes isstill missing. In this work, we showed the advan-tages of the additional knowledge of NILM for veryshort term power predictions. Our results and ap-proaches for predictions could be combined withshort term or long term power predictions work-ing directly in the power domain. Especially forenergy management systems, such combined andhigh quality predictions would be very valuable fordecision making processes.

Acknowledgments

The authors acknowledge the ﬁnancial sup-port of the Federal Ministry for Economic Aﬀairs and Energy of the Federal Republic ofGermany for the project

EG2050: EMGIMO:Neue Energieversorgungskonzepte fr Mehr-Mieter-Gewerbeimmobilien (03EGB0004Gand 03EGB0004A)

References [1] G. Strbac and A.M. Khambadkone, ”Demand sidemanagement: Beneﬁts and challenges”,

Energy Policy ,36(12)(2008), pp. 4419-4426[2] D. Tran and A.M. Khambadkone, ”Energy managementfor lifetime extension of energy storage system in micro-grid applications”,

IEEE Transactions on Smart Grid ,4(3)(2013), pp. 1289-1296[3] D. Arcos-Aviles, J. Pascual, F. Guinjoan, L. Marroyo,P. Sanchis and M. Marietta, ”Low complexity energymanagement strategy for grid proﬁle smoothing of a res-idential grid-connected microgrid using generation anddemand forecasting”,

Applied Energy , 205(2017), pp. 69-84[4] C. Wan et.al., ”Photovoltaic and solar power forecastingfor smart grid energy management”,

CSEE Journal ofPower and Energy Systems , 1(4)(2015), pp. 38-46[5] L. Pedersen, J. Stang and R. Ulseth, ”Load predictionmethod for heat and electricity demand in buildings forthe purpose of planning for mixed energy distributionsystems”,

Energy and Buildings , 40(7)(2008), pp. 1124-1134[6] M. Beccali, et.al., ”Forecasting daily urban electric loadproﬁles using artiﬁcial neural networks”,

Energy conver-sion and management , 45(18-19)(2004), pp. 2879-2900[7] H. Li, et.al., ”A hybrid annual power load forecastingmodel based on generalized regression neural networkwith fruit ﬂy optimization algorithm”,

Knowledge-BasedSystems , 37(2013), pp. 378-387[8] K. Lang, M. Zhang, Y. Yuan, and Y. Xijian, ”Short-term load forecasting based on multivariate time seriesprediction and weighted neural network with randomweights and kernels”,

Cluster Computing , vol. 22(2019),pp. 1258912597[9] A. M. Alonso, F. J. Nogales and C. Ruiz, ”A Sin-gle Scalable LSTM Model for Short-Term Forecast-ing of Massive Electricity Time-Series” , arXiv preprintarXiv:1910.06640 , 2020[10] Federal Environment Agency, ”Evaluation tables forEnergy balance of the Federal Republic of Germany1990 to 2018”, 2019, retrieved from: [11] H.S. Hippert, C.E. Pedreira and R.C. Souza, ”Neu-ral networks for short-term load forecasting: A reviewand evaluation”,

IEEE Transactions on power systems ,16(1)(2001), pp. 44-55[12] M.Q. Raza and A. Khosravi, ”A review on artiﬁcialintelligence based load demand forecasting techniques forsmart grid and buildings”,

Renewable and SustainableEnergy Reviews , 50(2015), pp. 1352-1372[13] G.W. Hart, ”Nonintrusive appliance load monitoring”,

Proc. of the IEEE , vol. 80(1992), pp. 1870-1891[14] A. Faustine, et. al., ”A survey on non-intrusive loadmonitoring methodies and techniques for energy dis- ggregation problem.” arXiv preprint arXiv:1703.00785 (2017).[15] A. Zoha, et. al., ”Non-intrusive load monitoring ap-proaches for disaggregated energy sensing: A survey.” Sensors , 12(12) (2012), pp. 16838-16866.[16] J.Z. Kolter and M.J. Johnson, ”REDD: A public dataset for energy disaggregation research”,

Proc. SustKDDWorkshop on Data Mining Appl. in Sustain. , 2011[17] S. Welikala, et. al., ”Incorporating appliance usage pat-terns for non-intrusive load monitoring and load forecast-ing.”

IEEE Transactions on Smart Grid , 10(1)(2017),pp. 448-461.[18] M. Wurm and V.C. Coroama, ”Grid-level short-termload forecasting based on disaggregated smart meterdata”

Computer Science-Research and Development ,33(1-2)(2018), pp. 265-266.[19] K. Brucke, S. Arens, J. Telle, S. Schlters, B. Hanke,K. Maydell and C. Agert, ”Particle Swarm Optimizationfor Energy Disaggregation in Industrial and CommercialBuildings” , arXiv preprint arXiv:2006.12940 , 2020[20] Janitza electronics GmbH,

Power Quality AnalyserUMG 604-PRO - User manual and technical data , 2017[21] S.K.K. Ng, J. Liang and J.W.M. Cheng, ”AutomaticAppliance Load Signature Identiﬁcation by StatisticalClustering”, , Hong Kong, pp. 1-6[22] M. Zeifman, S.R. Shaw and J.L. Kirtley, ”Disaggrega-tion of home energy display data using probabilistic ap-proach”,

IEEE Transactions on Consumer Electronics ,58(1)(2012), pp. 23-31[23] C.M. Bishop, ”K-means Clustering”,

Pattern Recogni-tion and Machine Learning , Springer, New York, 2006[24] T. Cali´nski and J. Harabasz, ”A dendrite method forcluster analysis”,

Communications in Statistics - Theoryand Methods , 3(1)(1974), pp. 1-27[25] R.G. McClarren, ”Pearson Correlation”,

UncertaintyQuantiﬁcation and Predictive Computational Science ,Springer, Cham, Switzerland, 2018[26] C.M. Bishop, ”Mixture of Gaussians”,

Pattern Recog-nition and Machine Learning , Springer, New York, 2006[27] C.M. Bishop, ”EM for Gaussian Mixtures”,

PatternRecognition and Machine Learning , Springer, New York,2006[28] C.M. Bishop, ”Model Comparison an BIC”,

PatternRecognition and Machine Learning , Springer, New York,2006[29] S. Amri, W. Barhoumi, E. Zagrouba ”Unsupervisedbackground reconstruction based on iterative medianblending and spatial segmentation”,

Proceedings ofIEEE International Conference on Imaging Systems andTechniques, IST 2010 , Thessaloniki, Greece, pp. 411-416[30] S. Al-Dahidi, O. Ayadi, M. Alrbai and J. Adeeb, ”En-semble approach of optimized artiﬁcial neural networksfor solar photovoltaic power prediction”,

IEEE Access ,vol. 7(2019), pp. 81741-81758[31] T. Kim and S. Cho, ”Predicting residential energy con-sumption using CNN-LSTM neural networks”,

Energy ,vol. 182 (2019), pp. 72-81[32] A. Torabi, S.A.K. Mousavy, V. Dashti, M. Saeedi andN. Youseﬁ, ”A new prediction model based on cascadeNN for wind power prediction”,

Computational Eco-nomics , vol. 182(3) (2019), pp. 1219-1243[33] F. Chollet, ”Keras”, 2015, retrieved from https://github.com/fchollet/keras [34] T. Steens, J. Telle, B. Hanke, K. Maydell, C. Agert,G. di Modica, B. Engelb and M. Grottke, ”A Fore-cast Based Load Management Approach For CommercialBuildings Comparing LSTM And Standardized LoadProﬁle Techniques” , arXiv preprint arXiv:2007.06832 ,2020[35] Autonomio Talos [Computer software], 2019, retrievedfrom http://github.com/autonomio/talos [36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskeverand R. Salakhutdinov, ”Dropout: A Simple Way to Pre-vent Neural Networks from Overﬁtting”,

Journal of Ma-chine Learning Research , 15(2014), pp. 1929-1958[37] G. Balzer and C. Neumann, ”Switching operations [inGerman]”,

Switching and balancing operations in elec-trical networks [in German] , Berlin, Germany, Springer,2018[38] S.B. Leeb, S.R. Shaw and J.L. Kirtley, ”Transient eventdetection in spectral envelope estimates for nonintrusiveload monitoring”,

IEEE Transactions on Power Deliv-ery , 10(3)(1995), pp. 1200-1210[39] S.B. Leeb, S.R. Shaw and J.L. Kirtley, ”Load identiﬁca-tion in nonintrusive load monitoring using steady-stateand turn-on transient energy algorithms”,

The 2010 14thInternational Conference on Computer Supported Coop-erative Work in Design , , Shanghai, China, pp. 27-32[40] M. Pal, R. Roy, J. Basu and M.S. Bepari, ”Blind sourceseparation: A review and analysis”, , Gurgaon, 2013,pp. 1-5[41] S. Pincus, ”Approximate entropy as a measure of sys-tem complexity”,

Proceedings of the National Academyof Sciences , 8(6)(1991), pp. 2297-2301[42] A. Delgado-Bonal, ”Quantifying the randomness of thestock markets”,

Scientiﬁc reports , 9(1)(2019), pp. 1-11, 9(1)(2019), pp. 1-11