[PDF] A Data Imputation Model based on an Ensemble Scheme

Abstract

Edge Computing (EC) offers an infrastructure that acts as the mediator between the Cloud and the Internet of Things (IoT). The goal is to reduce the latency that we enjoy when relying on Cloud. IoT devices interact with their environment to collect data relaying them towards the Cloud through the EC. Various services can be provided at the EC for the immediate management of the collected data. One significant task is the management of missing values. In this paper, we propose an ensemble based approach for data imputation that takes into consideration the spatio-temporal aspect of the collected data and the reporting devices. We propose to rely on the group of IoT devices that resemble to the device reporting missing data and enhance its data imputation process. We continuously reason on the correlation of the reported streams and efficiently combine the available data. Our aim is to `aggregate' the local view on the appropriate replacement with the `opinion' of the group. We adopt widely known similarity techniques and a statistical modelling methodology to deliver the final outcome. We provide the description of our model and evaluate it through a high number of simulations adopting various experimental scenarios.

Full PDF

AA Data Imputation Model based on an EnsembleScheme

Panagiotis Fountas

Department of Informatics and TelecommunicationsUniversity of Thessaly

Lamia, Greeceemail: [email protected]

Kostas Kolomvatsos

Department of Informatics and TelecommunicationsUniversity of Thessaly

Lamia, Greeceemail: [email protected]

Abstract —Edge Computing (EC) offers an infrastructure thatacts as the mediator between the Cloud and the Internet of Things(IoT). The goal is to reduce the latency that we enjoy whenrelying on Cloud. IoT devices interact with their environment tocollect data relaying them towards the Cloud through the EC.Various services can be provided at the EC for the immediatemanagement of the collected data. One signiﬁcant task is themanagement of missing values. In this paper, we propose anensemble based approach for data imputation that takes intoconsideration the spatio-temporal aspect of the collected dataand the reporting devices. We propose to rely on the group ofIoT devices that resemble to the device reporting missing dataand enhance its data imputation process. We continuously reasonon the correlation of the reported streams and efﬁciently combinethe available data. Our aim is to ‘aggregate’ the local view onthe appropriate replacement with the ‘opinion’ of the group.We adopt widely known similarity techniques and a statisticalmodelling methodology to deliver the ﬁnal outcome. We providethe description of our model and evaluate it through a highnumber of simulations adopting various experimental scenarios.Keywords: Internet of Things, Edge Computing, Data Imputation

I. I

NTRODUCTION

The Internet of Things (IoT) consists of a huge infrastruc-ture where numerous devices can interact with users and theirenvironment to collect and process data. IoT devices createstreams of data towards the Cloud infrastructure where ad-vanced processing takes place. Between the Cloud and the IoT,one can meet the Edge Computing (EC) where various nodeswith different computational capabilities are present. Thesenodes can perform the ﬁrst stage of processing for the data thatIoT devices collect. For instance, simple processing activitieslike data ﬁltering, novelty detection, etc or simple analyticstasks can be realized at the EC. It becomes obvious that thediscussed processing can support the necessary services toeliminate the latency in responses to effectively deal with realtime applications. Computing and processing of data are nowclose to the IoT devices giving the opportunity to performanalytics over the distributed streams. We have to notice thatthe discussed data streams are geo-located towards the ECnodes where distributed datasets are also formulated to becomethe subject of the aforementioned processing activities.When focusing on data processing, one major researchproblem is the management of missing values. Any fault in the collected data may jeopardize the quality of outcomes ofany processing activity [17]. Researchers have already focusedon this problem and proposed a set of models to supportthe efﬁcient provision of replacements [13]. Some of theproposed techniques involve data exclusion, missing indicatoranalysis, mean substitution, single imputation, multiple im-putation techniques, replacement at random, etc. All of themconclude to speciﬁc formulations to deliver the replacementsfor every missing value. They try to respond to the criticalresearch question of how we can manage/replace a missingvalue when observed in a data stream. The majority of theproposed schemes deals with the statistical information thata stream ‘conveys’ adopting it to estimate the most probablevalue to replace the missing one. Obviously, we have to detectthe hidden distribution relying behind the collected data, then,to easily replace the missing value.The current paper consists of the continuation of our pre-vious work in the ﬁeld and advances the state of the artby proposing an ensemble based approach to perform theenvisioned data imputation scheme. Initially, we propose touse a statistical learning method to estimate missing valuesbased on the ‘experience’ of the device reporting them. Werely on a linear regression model [12] due to its simplicity andthe ability to be adopted to support real time applications. Ouraim is to estimate missing values based on past observations.Additionally, we rely on a ‘group’ oriented approach, i.e.,we get similar IoT devices to contribute on the estimationof missing values. Through the term ‘similar’, we denotedevices that exhibit the same spatio-temporal characteristicswith the device reporting every missing value. We considerdevices that are in a close distance and report similar datafor the phenomenon under consideration. Hence, we are ableto support a group- and data-aware model that results theﬁnal replacements. We also perform an ‘aggregation’ of the‘local’ view, i.e., the view of the device reporting the missingvalue on the replacements with the view of the group andpropose a dynamic adaptation mechanism for the weightadopted to deliver the ﬁnal outcome. Our model can be easilyincorporated in a monitoring mechanism placed at the ECbeing directly connected with IoT devices to perform theenvisioned data imputation on the ﬂy. The aforementionedsimilarity between devices and their reports is realized by the a r X i v : . [ c s . D C ] J u l nown cosine similarity [15] and the Mahalanobis distance[15] to expose the correlation between data streams. We haveto notice that this is a continuous process, thus, we rely on asliding window approach to focus only on the the most recentreports. The Mahalanobis distance is adopted to ‘calibrate’the results retrieved by the Cosine similarity applied over themost recent reports of IoT devices. Eventually, we are able todetect the similarity between devices’ reports relying over theircurrent and past behaviour. Both, the Mahalanobis distanceand the Cosine Similarity are widely used in many types ofproblems. They are characterized by simplicity and the abilityof providing fast results which is critical for (near) real timeapplications. The differences of the current approach with ourprevious efforts [11] are as follows: • we extend our previous model and adopt a statisticalmodelling process, i.e., a regression analysis to estimatemissing values based on the past observations of devices; • we combine the ‘local’ view of each device with the viewof the group of devices based on a dynamic adaptationscheme for weights adopted to conclude the aggregationof the proposed replacements; • we adopt the geometrical mean for aggregating the ‘opin-ion’ of devices participating in the group of similar peers.The remaining paper is organized as follows. Section IIreports on the related work while Section III presents theproblem under consideration and gives insights into our model.Section IV discusses the proposed solution and provides therelevant formulations. Section V presents our experimentalevaluation efforts and gives numerical results for outliningthe pros and cons of our model. Finally, in Section VI, weconclude our paper and describe our future research plans inthe domain. II. R ELATED WORK

Offering novel applications in IoT is gaining signiﬁcantattention in recent years. The reason is that these applicationsare provided close to end users, thus, we can enjoy real timeresponses. Any application is performed over data, i.e., a dataprocessing activity that is adopted to create knowledge in var-ious forms [9]. Due to the presence of numerous devices, thevolumes of the collected data becomes huge [4]. However, IoTdevices usually exhibit limited computational capabilities thatare not appropriate to perform complex processing activities.On the other side, Cloud offers a vast processing infrastructurebut when we rely on it, we enjoy increased latency in theprovision of responses. The research community proposed theadoption of EC and Fog Computing to limit the latency byincorporating processing nodes with enhanced computationalcapabilities (compared to IoT devices) close to IoT and users.A comparison between EC, Fog computing (FC), cloudletsand mobile EC is presented in [8]. The authors provide acomparative analysis of the three implementations togetherwith the necessary parameters that affect nodes communication(e.g., physical proximity, access mediums, context awareness,power consumption, computation time). Data storage technologies become the means for settingup the basis to perform the desired processing tasks [31].Any storage mechanism should be optimized to decide thenecessary models that maximize the access rate to data andthe performance in general. The stored data can be structuredor unstructured [21]. This means that, when processing shouldtake place, we have to combine multiple data sources (e.g.,databases). The authors of [14] propose a system to facilitateand support a set of services at the edge of the network.They envision a set of clusters and adopt a controller toadd devices to these clusters, thus, the system can performa resource allocation and assign the desired processing tasks.The discussed storage models should exhibit the necessarysecurity levels to avoid any malfunctions or unauthorizedaccess. The blockchain technology can assist towards this di-rection [34]. In [10], the authors present a scheme for securitymanagement in an IoT data storage system incorporating a datapre-processing task realized at the EC. Another distributed datastorage mechanism is provided by [39]. The authors proposea multiple factor replacement algorithm to manage the limitedstorage resources and data loss.A critical part in the pre-processing of the collected data isthe necessary processing for data imputation. The aim is to ef-ﬁciently handle possible missing values. The speciﬁc researchsubject is widely studied by the research community and alarge of technologies have been proposed. These techniquesare adopted in various applications domains exhibiting itssigniﬁcance for building novel applications. In some of theﬁrst attempts, data imputation has been adopted to managesensory data [19]. Researchers propose the use of statisticalmetrics for concluding the replacements of missing values.The simplest approach is to replace a missing value with themean of the incoming samples. However, the adoption of themean cannot take into consideration the variance of data ortheir correlation [25] being also affected by extreme values.To take into consideration data correlations, more advancedstatistical learning schemes can be also utilized. This way,we can incorporate the statistical dependencies of data intoour decision making mechanism [23], [40]. Example modelscan be found in [7], i.e., Auto-Regressive Integrated MovingAverage and the feed forward prediction based method. Theaforementioned statistical models are applied over historicalvalues, thus, we have to take into consideration the processingtime for concluding the ﬁnal outcome. Apart of the timerequirements, our decision making should also take into ac-count, the prediction error that may be present. Moreover,to focus on recent data, we can rely on a sliding windowapproach aiming to reduce the processing time and be alignedwith ‘fresh’ information. Probabilistic approaches focus on theextraction of the distribution of data adopted to ‘generate’the ﬁnal replacement [40]. Other efforts deal with the jointdistribution on the entire data set assuming a parametricdensity functions (e.g., a multivariate normal) on the datagiven with estimated parameters [18]. The technique of leastsquares provides individual univariate regressions to imputefeatures with missing values on all of the other dimensionsased on the weighted average of the individual predictions[2], [29]. Extensions of the least squares method consist ofthe Predictive-Mean Matching method (PMM) where replace-ments are random samples drawn from a set of observed valuesclose to regression predictions [3] and Support Vector Re-gression (SVR) [38]. Other imputation models involve randomforests [36], K-Nearest Neighbors (K-NN) [37], sequential K-NN [22], singular value decomposition and linear combinationof a set of eigenvectors [37], [26] and Bayesian PrincipalComponent Analysis (BPCA) [27], [28]. Probabilistic Prin-cipal Component Analysis (PPCA) and Mixed ProbabilisticPrincipal Component Analysis (MPPCA) can be also adoptedto impute data [40]. All the aforementioned techniques try todeal with data that are not linearly correlated providing a more‘generic’ model. Formal optimization can be also adopted toimpute missing data with mixed continuous and categoricalvariables [1]. The optimization model incorporates variouspredictive models and can be adapted for multiple imputations.It becomes obvious that the deﬁnition of replacementsinvolves an uncertainty about the ﬁnal adopted value. Thisuncertainty can be managed by technologies like Fuzzy Logic(FL). The aim is to avoid adopting crisp thresholds whendeciding the ﬁnal replacement. Machine learning (ML) andComputational Intelligence (CI) can also assist in the analysisof data and the replacement of missing values opening theroom for deﬁning intelligent schemes. Some example effortsare as follows. A hybrid method that adopts the Fuzzy C-means (FCM) algorithm combined with a Particle SwarmOptimization (PSO) model and a Support Vector Machine(SVM) is presented in [35]. Other models involve Multi-layerPerceptrons (MLPs) [30], Self-Organizing Maps (SOMs) [6],and Adaptive Resonance Theory (ART) [5]. The use of NeuralNetworks (NN) can lead to capture many kinds of relationshipsamong data and they allow quick and easy modelling of theenvironment [24].III. P

RELIMINARIES & P

ROBLEM D ESCRIPTION

Our scenario involves a set of IoT devices that interact withend users and their environment collecting data or performingsimple processing activities. The description of the scenario‘borrows’ the notation of [11] as both papers deal withthe same problem. Any processing at the devices targets toproduce knowledge locally, close to end users. These devicescan be directly connected through the network with edgenodes to transfer the collected data in an upwards mode.The ﬁnal goal is to transfer data to the Cloud infrastructure.Edge nodes act as ‘sinks’ that store the information providedby the IoT devices (see Figure 1 [11]). We consider thatdevices report multivariate vectors in the following form ( x ) = (cid:104) x , x , . . . , x M (cid:105) where M represents the number ofdimensions. Edge nodes can convey the necessary componentsto fuse the incoming vectors and store them in the appropriateformat to become the subject of further processing. Eachvector is annotated by the index of the j th device reportingit to the corresponding edge node and the reporting timeinstance t , i.e., ( x ) j [ t ] = (cid:68) x j [ t ] , x j [ t ] , . . . , x jM [ t ] (cid:69) . Without loss of generality, we consider that N devices are connectedwith an edge node. After the reception of data, edge nodesshould perform the desired pre-processing to prepare the datato become the subject of various tasks. In this paper, our focusis on the processing of missing values and the applicationof a data imputation model. We propose the use of multipletechnologies to conclude the replacement for each missingvalue. We have to notice that missing values can refer in entirevectors or in the inputs for speciﬁc dimensions.We also focus on the most recent measurements reported byIoT devices, i.e., we consider the W latest reports (see Table I).Actually, in Table I, we describe the two-dimensional structurefor the j th individual IoT device. It is a strategic decision toincorporate in our model only fresh information. Any datareceived out of the window W is considered as obsolete. Thisis a typical scheme of a sliding window approach. W dealswith the interval where data can be adopted to assist in thedelivery of missing values replacements.Fig. 1: The architecture considered in our scenario.TABLE I: An example of the collected IoT devices reports. M th dimensiont=1 x j [1] x j [1] ... x jM [1] t=2 x j [2] x j [2] ... x jM [2] ... ... ... ... ...t=W x j [ W ] x j [ W ] ... x jM [ W ] The proposed model is ‘ﬁred’ when a missing value isdetected. Then, we provide the means for calculating thereplacement based on the local ‘experience’, i.e., the pre-vious reports of the same device where a missing value isdetected and the experience of the group, i.e., the replacement‘proposed’ by the group of devices exhibiting similar datain the spatio-temporal axis. Edge nodes rely on an ensemblescheme that identiﬁes the correlation between data reported bydevices combining current and historical information deﬁnedin W . Our model consists of the following parts: (i) amodule responsible to apply a linear regression model andestimate the replacement based on the local knowledge/data(past data reported by the device where a missing valueis detected); (ii) a module that realizes the correlations ofthe collected vectors from all the available devices; (iii) anensemble scheme to detect the ﬁnal correlation between IoTdevices as ‘exposed’ by their reports; (iv) a module thatproduces the ﬁnal replacement taking into consideration thelocal view (the outcome of the regression process) as well ashe view of peer device (devices with increased similarity withthe device reporting the missing value) adopting a weightedscheme; (v) a module that dynamically adapts the weightsfor the local and the group view based on the deviation ofreports of the device where the missing value is present. Thedeviation is calculated in the window W . For the replacement‘proposed’ by the peer devices, we intent to rely only ondata reported by correlated devices assuming that their spatio-temporal contextual information will positively affect the ﬁnalcalculations. For instance, if IoT devices monitor the samephenomenon in the same area, they should report similarvalues. Hence, any missing information in individual reportscan be easily managed adopting the ‘view’ of the remainingdevices in the group.IV. D ATA I MPUTATION BASED ON D ATA S TREAMS C ORRELATION

A. The envisioned setup

Our Prediction Based Model (PBM) adopts an ensemblescheme to extract the correlation between the IoT devicesupon the following popular metrics: (i) the Cosine Similarity(CS) and (ii) the Mahalanobis Distance (MD). The use of theaforementioned metrics does not differ in the PBM comparedto our previous effort, i.e., the Distance Based Model (DBM)[11]. The CS is adopted to calculate the similarity between thelast reported data vectors between two different devices. Ad-ditionally, the MD targets to calculate the multi-dimensionaldistance between the reports of different IoT devices. In otherwords, we can say that the CS is applied at the ‘vector’ levelwhile the MD is applied at the ‘device’ level. We have tonotice that the latter metric considers the entire ‘history’ ofreports as depicted by the latest W vectors. The outcomes ofthe CS and MD metrics are smoothly combined to deliver theﬁnal correlation between IoT devices reports, thus, to proceedwith the calculation of the replacement values. For achievingthis, we consider the top- k similarity outcomes based on thedevices correlation. B. Similarity between data streams

The CS consists of a simple, however, powerful metric forthe delivery of similarity between two two non-empty vectors.The metric is deﬁned as the cosine of the angle between them.The CS is particularly used in positive space where the cosineof the angle is bounded in [0,1] and this value is inverselyproportional to the angle between the vectors. So, the higherthe angle between them, the lower the similarity becomes. Thefollowing function holds true: CS (( x ) i [ t ] , ( x ) j [ t ]) = ( x ) i [ t ] · ( x ) j [ t ] (cid:107) ( x ) i [ t ] (cid:107)·(cid:107) ( x ) j [ t ] (cid:107) = (cid:80) Ml =1 ( x ) il [ t ]( x ) jl [ t ] (cid:113)(cid:80) Ml =1 (( x ) il [ t ]) · (cid:113)(cid:80) Ml =1 (( x ) il [ t ]) (1)In Eq(1), i & j are the indexes of vectors/IoT devices fed intothe CS function. CS is applied over pairs of IoT devices fora speciﬁc time instance t , i.e., the latest/current report. The CS gives us a view on the devices reporting similarvalues for the phenomenon under consideration. When amissing values is detected, we perform set of calculations, i.e.,we feed the CS function with the latest reports (i.e., vectors) ofthe available devices. For simplifying calculations and avoidany undesired errors, we do not take into consideration thedimension(s) where missing value is present. The CS is appliedfor the remaining dimensions of the discussed vectors.The MD measures the correlation between multivariatevectors as the distance between an observation point (whichcan be a set of observations or a single value) and a dis-tribution. Another case of the MD application is to detectthe correlation between two multivariate vectors delivered bythe same dataset. Let −→ x and −→ y be two multivariate vectorsreported by two IoT devices. The following equation holdstrue: M D ( −→ x − −→ y ) = (cid:113) ( −→ x − −→ y ) T S − ( −→ x − −→ y ) (2)In our case, the MD is applied on pairs of devices overthe W latest vectors that each IoT device reports to the edgenode. Using this approach, we pay attention, not only on thelatest reports but also on historical values of the devices. Theresults that will arise from the application of MD function arecombined with the CS results to calculate the ﬁnal correlationresult. C. Local estimation of imputed values

For the presentation of the approach, let us focus ona speciﬁc dimension considering, without loss of general-ity, that the incoming values are depicted by the variables x [1] , x [2] , . . . , x [ W ] . Let us consider that at a time instance W + 1 , we observe a missing value, thus, we have to es-timate x [ W + 1] . We consider x [ W + 1] as the dependentvariables and try to detect the linear relationship between x [1] , x [2] , . . . , x [ W ] and x [ W + 1] . The ‘typical’ approach isto adopt the following equation: x [ W + 1] = f ( X, B ) + (cid:15) = b + b x [1] + b x [2] + . . . , b W x [ W ] + (cid:15) where b is the intercept, b i are the weights for each independentvalues and (cid:15) is the error. Our aim is to ﬁnd the appropriateweights { b i , i = 0 , , , . . . , W } that minimize the sum ofsquared errors, i.e., (cid:80) ( x [ W + 1] − f ( X, B )) . The leastsquares method is widely adopted because the estimatedfunction f (cid:16) X, ˆ B (cid:17) approximates the conditional expectation E ( x [ W + 1] | X ) . At this point, we omit the performed cal-culations that are widely studied in the research community.The interested reader can refer in [12] for further details. Inany case, adopting the discussed calculations we can easilyretrieve x [ W + 1] combining it with the proposed replacementby the group of similar devices. D. Our ensemble approach

In order to combine the correlation metrics as exposed bythe CS and the MD, we adopt a simple data-aware correlationmodel that uses the MD to ’calibrate’ the result of the CS.Hence, we create a correlation scheme based on the correlationexposed by the latest reports being affected by their historicalcourse’. More speciﬁcally, every result of the CS is weighted(i.e., multiplied) by w deﬁned as w = MD . Our the goal,by adopting this weight, is to assess a reward on IoT deviceswith high historical correlation. Hence, when the MD outcomeincreases, we observe a decrement of the ﬁnal w realization,i.e., we eliminate the weight, thus, the ﬁnal correlationsbetween two different devices. It is a strategic decision adoptedin our model to pay less attention into IoT devices withlow historical correlation (i.e., a high MD) regardless of thecorrelation detected by the CS. On the other hand, a lowMD value indicates a strong historical correlation, thus, we‘reward’ the similarity of the latest vectors exposed by theCS. In some manner, the MD metric conﬁrms or rejects thecorrelation depicted by the latest reports.The ﬁnal correlation outcome F C is derived by the follow-ing equation: F C = w · CS (( x ) i [ t ] , ( x ) j [ t ]) , ∀ i, j, i (cid:54) = j (3)Afterwards, we rely on the top- k F C outcomes. The IoTdevices that correspond to the top- k values formulate the groupof the most similar reports. The discussed group becomesthe basis for the calculation of the ﬁnal replacement foreach observed missing values. The replacements are calculatedtaking into consideration the group ‘view’ and the ‘local’ view.For aggregating the view of each device participating in thegroup, we adopt the Weighted Geometric Mean (WGM) [20].The WGM is calculated as follow: W GM = (cid:32) k (cid:89) i =1 x MD i i (cid:33) (cid:80) ki =1 MDi (4)In Eq(4), x i represents the report of every top- k correlateddevice for the speciﬁc dimension where a missing value isdetected and M D i is the MD between the IoT device inwhich we detect the missing value with the i th top- k correlateddevice.We perform a set of calculations for delivering F C , WGMand the estimated value delivered by the linear regressionmodel, i.e., x [ W + 1] . For aggregating the ‘group’ with the‘local’ view, we adopt a dynamic weighted scheme upon theWGM and the x [ W + 1] . Actually, we propose a heuristic forcalculating the weight of x [ W + 1] as follows: w local = 11 + e ασ − β (5)where σ depicts the deviation of the W latest reports of thedevice where a missing values is detected. For the realizationof w local , we rely on the aforementioned sigmoid function,i.e., we eliminate the weight of the ‘local’ view when σ isover a predeﬁned threshold. The rationale is that when σ ishigh, there are disturbances in the distribution of data, thus,we cannot support an efﬁcient decision making based onlyon the ‘local’ view. Then, w local is retrived to be close tozero. Our PBM algorithm adopts w local and calculates theﬁnal replacement value as follows: P D = w local · x [ W + 1] + (1 − w local ) · W GM (6)It becomes obvious that the ﬁnal

P D is the result of anensemble scheme that combines multiple metrics and buildsupon the view of the team of similar devices (as exposed bytheir data) and the view (exposed by historical reports) of thedevice where a missing value is present. The interesting is thatwe rely on a dynamic scheme for selecting the top- k similardevices and take into consideration the distribution of datapresent in the dimension where the missing value is observed.V. E XPERIMENTATION S ETUP & P

ERFORMANCE A SSESSMENT

A. Experimental Setup and Performance Metrics

We report on the performance of the proposed model relatedto its ability to correctly replace the detected missing values.Our experimental evaluation relies on two real traces, i.e.,(i) the GNFUV Unmanned Surface Vehicles Sensor Data Set[16] and (ii) the Intel Berkeley Research Lab dataset . Theformer dataset (i.e., the GNFUV) consists of values of mobilesensor readings (humidity, temperature) from four UnmannedSurface Vehicles (USVs) moving in the sea according to aGPS predeﬁned trajectory. The Intel dataset contains millionsof measurements (temperature, humidity, light) retrieved by 54sensors deployed in a lab. From this dataset, we get 15,000measurements such that 15 sensors produced 1,000 measure-ments. The aforementioned traces are adopted to simulate thestreams reported by a set of IoT devices.The evaluation of the PBM is performed in two axes, i.e.,its ability to eliminate the prediction error and the capabilityof reducing the time required to conclude a replacement. Theprediction error is estimated by the difference of the ﬁnal out-come and the actual value provided any of the above datasets.For the second axis, we rely on a wide set of experiments andget the mean time to depict the performance of the model.Our experimentation involves the ‘creation’ of missing valuesin the available datasets by randomly annotating V % reports(i.e., values in our datasets), from the total, as missing inputs.The following table presents the realization of each parameteras adopted into our experimentation evaluation.TABLE II: Realization of our Parameters Short Description ValuesPercentage of missing values V = { , , } Size of sliding window W = 10 Number of correlated devices adopted in our model k = 4 Total number of IoT devices N ∈ { , , } Number of dimensions M = { , } Parameters adopted by our smoothing function α = , β = 2 The performance of our model is delivered through widelyadopted metrics like the Mean Absolute Error (MAE) and the Intel Lab Data, http://db.csail.mit.edu/labdata/labdata.html oot Mean Square Error (RMSE). The following equationshold true:

M AE = (cid:80) nl =1 | P D l − a l | n (7) RM SE = (cid:114) (cid:80) nl =1 ( P D l − a l ) n (8)In Eq(7) and Eq(8), n is the number of missing values, P D l is the value predicted by our model and a l is the actualvalue as registered in the adopted datasets. We also compareour scheme with two other models, i.e., a scheme proposedin one of our previous research efforts, i.e., the DBM [11]and a baseline model, i.e., a model that replaces missingvalues with the mean of the incoming reports for the speciﬁcdimension where a missing value is observed. The mean iscalculated upon the devices exhibiting a high correlation withthe device reporting a missing input. We name this modelas the Averaging Model (AM). Through the above evaluationprocess we try to ﬁnd out if the proposed model is capable tosupport real time applications while providing efﬁcient resultsin estimation of missing values.

B. Performance Assessment

In Figures 2 & 3, we present our results for M = 4 and M = 9 , respectively when V = 1 %. As we can observe, theperformance of the PBM is affected by the number of devicesfor both experimental scenarios resulting an increment in theerror realizations. The DBM and the AM are also negativelyaffected by the increment in the number of dimensions andthe number of devices. The comparative assessment revealsthat the PBM performs better (except one case, i.e,. M = 9 , N = 7 ) than the remaining models.Fig. 2: MAE and RMSE for V = 1 % and M = 4 .In Figures 4 & 5, we present our results for V = 5 %and M ∈ { , } . Now, we increase the number of missingvalues present into our datasets, i.e., devices’ reports. In thisset of experiments, the PBM clearly outperforms the remainingmodels no matter the number of dimensions and devices. ThePBM manages to achieve a very low error when called toprovide the replacements. An interesting observation is that Fig. 3: MAE and RMSE for V = 1 % and M = 9 .the PBM is negatively affected by M & N while the DBMand the AM are positively affected by the same parameters.In any case, the difference in the performance is high whenwe focus on a low number of dimensions and devices.Fig. 4: MAE and RMSE for V = 5 % and M = 4 .Fig. 5: MAE and RMSE for V = 5 % and M = 9 .In Figures 6 & 7, we increase the number of missing valuesssuming V = 10 and provide our experimental outcomesfor M ∈ { , } . We observe a similar performance as in theprevious experimental scenario. Current results conform theability of the PBM to outperform in case of a high numberof missing values. The error achieved by the PBM faces aslight increment when M = 4 and is decreasing when M = 9 (always considering that N increases from 5 to 15 devices).The remaining models exhibit worse performance than thePBM in all the experimental scenarios.Fig. 6: MAE and RMSE for V = 10 % and M = 4 .Fig. 7: MAE and RMSE for V = 10 % and M = 9 .In Figures 8 & 9, we present our results related to themean time required to calculate the ﬁnal replacement for eachmissing value. in this set of experiments, we get V = 5% and M ∈ { , } , respectively. As we can observe, the number ofthe devices negative affect the mean time per missing value forboth for PBM and DBM. This is natural, as we have to processmore streams with a clear impact on the conclusion time.When we consider the PBM performance, we observe that ithas better results for error metrics than the remaining models,however, it requires more time to conclude the replacements.In the worst case, the PBM requires 500 ms to conclude areplacement (compared to 300 ms of the DBM), i.e., it exhibitsa throughput of two replacements per second. In any case, these outcomes reveal that the PBM can be adopted to supportreal time applications with a signiﬁcant improvement in theperformance (as our results for the delivered error exhibit)compared to the remaining models. It becomes obvious thetrade off between the improved performance and the increasedcalculation time. We can tolerate an increment in the adoptedcalculations, thus, the required time upon the elimination ofthe error in the prediction of the appropriate replacements.Fig. 8: Time requirements for V = 5 % and M = 4 .Fig. 9: Time requirements for V = 5 % and M = 9 .We have to notice that the PBM requires a ‘warm up’period, up to W , to collect the necessary data and performthe envisioned calculations. This step is necessary to feedour distance/similarity and regression schemes. Additionally,a potential limitation concerns the scenario where edge nodescollect multiple missing values from multiple IoT devicesat the same time instance. In this case, our similarity mod-els cannot efﬁciently conclude the replacements as the ﬁnalsimilarity may be wrongly concluded due to the limiteddimensions participating in the above described calculations.In any case, we consider such a scenario as rare to be met inreal applications.I. C ONCLUSION AND F UTURE W ORK