[PDF] A Probabilistic Approach for Data Management in Pervasive Computing Applications

Abstract

Current advances in Pervasive Computing (PC) involve the adoption of the huge infrastructures of the Internet of Things (IoT) and the Edge Computing (EC). Both, IoT and EC, can support innovative applications around end users to facilitate their activities. Such applications are built upon the collected data and the appropriate processing demanded in the form of requests. To limit the latency, instead of relying on Cloud for data storage and processing, the research community provides a number of models for data management at the EC. Requests, usually defined in the form of tasks or queries, demand the processing of specific data. A model for pre-processing the data preparing them and detecting their statistics before requests arrive is necessary. In this paper, we propose a promising and easy to implement scheme for selecting the appropriate host of the incoming data based on a probabilistic approach. Our aim is to store similar data in the same distributed datasets to have, beforehand, knowledge on their statistics while keeping their solidity at high levels. As solidity, we consider the limited statistical deviation of data, thus, we can support the storage of highly correlated data in the same dataset. Additionally, we propose an aggregation mechanism for outliers detection applied just after the arrival of data. Outliers are transferred to Cloud for further processing. When data are accepted to be locally stored, we propose a model for selecting the appropriate datasets where they will be replicated for building a fault tolerant system. We analytically describe our model and evaluate it through extensive simulations presenting its pros and cons.

Full PDF

AA Probabilistic Approach for Data Management inPervasive Computing Applications

Kostas Kolomvatsos

Department of Informatics and TelecommunicationsUniversity of ThessalyPapasiopoulou 2-4, 35131, Lamia, Greecee-mail: [email protected]

Abstract —Current advances in Pervasive Computing (PC)involve the adoption of the huge infrastructures of the Internet ofThings (IoT) and the Edge Computing (EC). Both, IoT and EC,can support innovative applications around end users to facilitatetheir activities. Such applications are built upon the collected dataand the appropriate processing demanded in the form of requests.To limit the latency, instead of relying on Cloud for data storageand processing, the research community provides a number ofmodels for data management at the EC. Requests, usually deﬁnedin the form of tasks or queries, demand the processing of speciﬁcdata. A model for pre-processing the data preparing them anddetecting their statistics before requests arrive is necessary. In thispaper, we propose a promising and easy to implement schemefor selecting the appropriate host of the incoming data basedon a probabilistic approach. Our aim is to store similar datain the same distributed datasets to have, beforehand, knowledgeon their statistics while keeping their solidity at high levels. Assolidity, we consider the limited statistical deviation of data, thus,we can support the storage of highly correlated data in the samedataset. Additionally, we propose an aggregation mechanism foroutliers detection applied just after the arrival of data. Outliersare transferred to Cloud for further processing. When data areaccepted to be locally stored, we propose a model for selecting theappropriate datasets where they will be replicated for buildinga fault tolerant system. We analytically describe our model andevaluate it through extensive simulations presenting its pros andcons.

Index Terms —Pervasive Computing, Internet of Things, EdgeComputing, Data Storage, Accuracy, Probabilistic Model

I. I

NTRODUCTION

The combination of the Internet of Things (IoT) and EdgeComputing (EC) provides a promising infrastructure for sup-porting innovative Pervasive Computing (PC) applications.The aim of PC is to offer ambient intelligence to end usershaving devices and services interacting with them. IoT de-vices are, usually, carried by end users or they are presentin the environment in close distance with them facilitatingthe envisioned interactions and the collection of data. Then,data become the subject of processing activities to createknowledge and support novel applications. At the EC, wecan detect numerous nodes that are, in an upwards mode,connected with Cloud for transferring data and ask for moreadvanced processing. A new trend is the collection and storageof data at the EC to eliminate the latency in the provision ofresponses in various requests (instead of always relying on Cloud). EC nodes have direct connection with IoT devicesand become the hosts of a high number of geo-distributeddatasets. Obviously, as new data arrive, one can observe the‘evolution’ of the local datasets as depicted by their statistics.This evolution is represented by updates in the statisticalinformation, e.g., the mean and standard deviation may bealtered as new information is retrieved by the IoT devices.Users/applications perform requests upon the distributeddatasets to create knowledge or ask for analytics. Requests canhave the form of tasks (e.g., apply a machine learning modeland report the outcome) or queries (e.g., report the list of datathat meet a speciﬁc condition). When a request is set, the ‘basecase’ model is to launch it across the network and search theinformation that end users/applications are interested in [29].However, this involves an increased messaging overhead paidwithout any reason for nodes/datasets that do not match tothe desired conditions set by the request. The most efﬁcientsolution is to have a view, beforehand, on the statistics ofthe available data and decide upon the matching between therequests and the distributed datasets. Through this approach,we can eliminate the cost of allocating tasks/queries to datasetsthat do not match to the deﬁned conditions as their executionwill return an empty set. A number of research efforts proposetechniques for the optimal tasks/queries allocation into anumber of processing nodes, e.g., [13], [15], [18], [19], [21],[22], [23].Another challenge is to keep the consistency and accuracyof data at high levels as the critical statistical information thatdepicts the quality of the collected data [8]. Accuracy refersto the closeness of estimates to the (unknown) exact or truevalues [32], i.e., it depicts the error between the observationand the real data. Envision a new data vector arriving from anIoT device to an EC node. The discussed vector may affect theaccuracy of the local dataset as it may not match to dataset’scurrent statistics (e.g., it may be an outlier to the speciﬁcdataset). We borrow the concept of ‘solidity’ to represent thecloseness of data into a dataset [20] and adopt it to build amodel for the efﬁcient allocation of data to the appropriatedatasets. A solid dataset exhibits a high accuracy realizedwhen the error/difference between the involved data is low,e.g., the standard deviation may be be limited. We have tonotice that accuracy is signiﬁcant as efﬁcient response plans, a r X i v : . [ c s . D C ] S e p or each type of tasks/queries, may be deﬁned. Apart frommaintaining the solidity of datasets, we have to focus on datareplication to support a fault tolerant EC infrastructure. Suchan approach will provide beneﬁts when connectivity is limitedand data can be processed to deliver the required responses.In the discussed ecosystem, there is no need for additionaldata migration that will burden the network. It is preferableto migrate tasks/queries instead of circulating huge volumesof data. Replication in combination with the distributed datastorage may assist in the elimination of the probability of dataloss and cope with IoT nodes failure as well [1].This paper targets to support the distributed nature of the ECecosystem and the provision of an ensemble model for outliersdetection combined with methodologies for the selection ofthe appropriate datasets where new data vectors should bestored. EC nodes adopt a monitoring scheme for examining theincoming data and a decision making model for performing theenvisioned allocations. For outliers detection, we rely on sim-ple, however, fast ensemble approach extending our previousefforts in the domain [25]. Instead of adopting an individualmajority voting method upon multiple outlier indicators, westudy a double majority scheme to conclude if the incomingdata are outliers. Furthermore, when data are not consideredas outliers, we study the time series of difference ‘quanta’,i.e., the time series of the difference between the incomingdata and the synopses of the available datasets as reportedby EC nodes. We consider that nodes at pre-deﬁned epochsshare the synopses of their datasets to become the basis for theproposed replication process. Evidently, these quanta show thestatistical ‘behaviour’ of data synopses exhibiting the trends inevery dataset. The new data are placed at the datasets wherethe similarity with the synopses is high, i.e., the differencequanta are limited. In any case, our replication model isbased on historical quanta realizations instead of relying tothe latest one. Compared with our previous efforts [20], [25],the proposed model exhibits the following differences: (i) weprovide an aggregation mechanism for the management ofmultiple outlier indicators upon multiple datasets instead ofusing a limited number of indicator functions like in [20],[25]. The proposed scheme can be applied upon any numberand any type of indicators; (ii) for the replication process,we rely on a probabilistic approach upon multiple historicalquanta instead of using an uncertainty driven model [20]. Theadoption of the uncertainty management scheme requires themanual deﬁnition of a rule base which is not necessary in thecurrent model. The following list reports on the contributionsof our paper: • we provide an ensemble model for the aggregation ofmultiple outlier indicators upon a double majority votingscheme; • we support the data replication process with a probabilis-tic model based on multiple historical synopses reportedby EC nodes. Our aim is to detect the similarity of theincoming data with the available datasets upon their pasttrends; • we present the outcomes of an extensive set of ex-periments to reveal the characteristics of the proposedapproach.The paper is organized as follows. Section II reports onthe related work in the domain. In Section III, we present thepreliminary information while in Section IV, we provide theanalytical description of our model. In Section V, we presentour experimental evaluation and in Section VI, we concludeour paper by providing our future research directions.II. R ELATED W ORK

Outliers detection and management is a widely studiedresearch domain. The interested reader can refer in [41] for anextensive survey for getting insights on the proposed method-ologies. These methods target to identify objects deviatingfrom a group of other objects. Deviating objects depict anabnormal behaviour compared to the natural evolution of thecollected data. If we compare univariate with multivariate data,one can argue that the latter scenario is more prone to outliers,i.e., the effects of outliers in the statistics of data have higherimpact than in the former case [33]. Recall that in the vastmajority of the application domains, data are reported in amultivariate ‘form’, i.e., tuples/vectors making the detectionof outliers a difﬁcult task [27]. For detecting outliers, wehave to rely on statistical metrics like the covariance matrix[14]. Mahalanobis distance is the most representative metricthat builds upon the covariance matrix adopted to detect thecorrelations between the multiple dimensions of data vectors[30]. Other techniques involve the Cook’s distance [9] and theleverage model [7]. The former metric estimates variations inregression coefﬁcients after removing each observation, oneby one. The latter model performs in a similar way to theMahalanobis distance as it is based on the sstudy of residualsand their distance from the mean vector. Additional techniquescan be found in the relevant literature like the χ metric foridentifying deviations from the multidimensional normality.An extension of the Mahalanobis distance is proposed in [27],i.e., a variant based on the minimum covariance determinant,a more robust process that is easy to implement. In [3], theinterested reader can ﬁnd a comparison of outlier detectionmethods.Data replication is usually adopted to achieve two goals, i.e.,the minimization of the latency in the provision of responsesand the support of fault tolerant systems. In the ﬁrst axis, weavoid data or requests migration while in the second axis, weare able to perform the desired processing even if a node isnot available. The replication of data is a technique adopted inWireless Sensor Networks (WSNs) where we need to uploadIoT data from a set of sensor gateways on distributed Cloudstorage [26]. In this scenario, we can consider multiple mini-Clouds as the hosts of data taking replication requirements foreach data item into account. A distributed algorithm for thereplication placement is proposed in [40]. The authors proposethe use of distributed storages and a method for improving theefﬁciency of objects replication. The main subject of [39] isthe reliability assessment of clustered and declustered replicalacements. In [4], the authors present a self managed key-value store which dynamically allocates the resources of a dataCloud to several applications, thus, it maintains differentiatedavailability guarantee for different application requirements.Other efforts formulate the problem around what, whenand where data replication should take place [11]. Such amodelling can assist in applying optimization schemes forconcluding the best possible action. Data replication may alsoassist in eliminating the need for migrating huge volumes ofdata as bulk data transfer protocols aim to do [36]. We canperform a selective replication under constraints avoiding tocirculate all the collected data in the network. Such a selectiveapproach can be realized online [37], however, under the goalof choosing the appropriate hosts to have data close to theappropriate users. In networks, where nodes exhibit limitedenergy resources, we have to balance the number of replicasto the energy consumption [5]. We can achieve the discussedgoal adopting compression and reduction techniques before thereplication takes place. This step may be adopted in the pre-processing stage where data are pre-processed before sendingthem to the storage node to reduce the energy consumptionwithout affecting the data quality requirements [6]. The dis-advantage of the discussed approaches is that they require acomplex process and increased computational resources forcompression/decompression. Modelling the energy footprint ofnodes may be also adopted to create replication plans that aimto reduce the energy consumption [38]. In [28], the authorsincorporate a load balancing approach when performing thedesired replication process. They assume that every node hasa local memory adopted to store neighbour nodes memorycontents. ProFlex [31] is proposed to deal with the communi-cation requirements for transferring data from ‘weak’ to strongnodes. TinyDSM [34] exhibits a reactive replication methodthat distributes replicas based on a random strategy accordingto the number and the density of replications. Finally, in [35],the authors propose a low-complexity distributed mechanismto increase the capacity of WSN-based distributed storage,optimizing communication and decreasing energy usage. Dataare collected periodically by the sink node being removedfrom nodes’ memories while, based on a greedy distributionstorage scheme, each node reports its memory condition toother nodes.III. P ROBLEM D ESCRIPTION & P

RELIMINARIES

The problem under consideration involves a set of IoTdevices reporting data to an EC node being member of agroup of N nodes, i.e., n , n , . . . , n N . EC nodes receive thedata and perform an initial pre-processing for deciding theirstorage or rejection. A rejection corresponds to the transferof data to Cloud while a decision related to storage ﬁresanother round of processing to deliver the appropriate nodewhere the incoming data should be replicated. EC nodesformulate local datasets, i.e., { D , D , . . . , D N } upon whichthey perform the requested processing activities. The discusseddatasets are updated upon the arrival of new multivariate datavectors, i.e., x = [ x , x , . . . , x M ] . Additionally, due to the proposed replication process, data vectors can be exchanged bypeer nodes as the result of the proposed decision mechanism. D i s exhibit speciﬁc statistical information and data synopsescan be extracted upon them to be disseminated in peers forsupporting the envisioned decision making. For instance, thesynopsis for the i th dataset can refer in a vectorial space via a M -dimensional vector, i.e., s i = [ s i , s i , . . . , s iM ] conveyingstatistical information for each of the adopted M dimensions.Synopses may represent linear multivariate regression coef-ﬁcients, clusters centroids (in a clustering scheme), the ﬁrstPrincipal Components (PCs) (in linear data compression) orvery simple statistical measures like the mean or standarddeviation. We target to keep EC nodes informed about thedata present to their peers, thus, support decision making uponfresh information for the collected data. Obviously, there is atrade off between the frequent data synopses calculation in theburden of network performance (a high number of messagesis required to transfer synopses to peers) compared to a lessfrequent synopses extraction in the burden of decision makingupon ‘obsolete’ data synopses. The study of the discussedtrade off is beyond the scope of this paper.Initially, every node n i should detect if x is an outliernot only compared with the local dataset D i but also withthe remaining repositories present at peer nodes. Our aim isto detect if x signiﬁcantly deviates from the ‘ecosystem’ ofdatasets, thus, no dataset could become the host of x . For this,we rely on an ensemble approach and study the involvement ofmultiple outlier detection methods. x is rejected when multipleindicators depict an outlier judgement in multiple datasets.The discussed synopses ‘participate’ in the delivery of theoutlier indication as represented by the indicators outcome.We consider that V outliers indicator functions are available,i.e., I j ( x ) = (cid:40) if x is an outlier with probability β j Otherwisefor all j ∈ { , , . . . , V } .Having calculated V indicators for a dataset D i , we proposea consensus model for deciding if x is an outlier or not.Our aim is to check if x is an outlier for a high number ofdatasets, thus, it signiﬁcantly deviates from the data ecosystemcurrently present at the EC. The ﬁnal decision is based ona double majority voting model, i.e., an ensemble schemetaking into consideration not only an individual dataset bythe entire ecosystem. When x is not an outlier, x is storedto the node n i , where it is initially reported, if it is highlycorrelated with D i . Apart from the local dataset, it shouldbe also replicated to peers exhibiting a signiﬁcant correlationwith x . The correlation is detected upon the latest W synopsesreports and the similarity between them and x . We deﬁnethe concept of the similarity quantum , i.e., the magnitude ofthe similarity (as exposed by the numeric difference) between x and every synopsis reported by peers. Upon the latest W similarity/difference quanta, we expose their distribution anddeliver the probability of having quanta upon a threshold. Thisrobability is adopted to rank the available datasets and decidewhere x will be replicated.IV. T HE P ROPOSED M ECHANISM

A. Probabilistic Outliers Detection

The preliminaries section deﬁnes the basis for our outliersdetection model upon V outliers detection schemes and N datasets. We focus on the ‘collection’ of our indicators anddeﬁne a two-dimensional matrix I , i.e., I ij ( x ) = (cid:40) if x is outlier with probability β ij Otherwise I is realized upon the immediate processing of x and theavailable synopses and becomes the basis for the subsequentdecision making. Every I ij can be the outcome of any outliersdetection technique (e.g., χ , Grubb’s test, clustering based)depending on the type of synopsis received by peers. Thesimplest outliers detection technique is based on the identi-ﬁcation of x as being ‘produced’ by the distributions depictedby the statistics of every dataset D i [25]. Let this probabilitybe P ( x , D i ) . In the group of N EC nodes, we can consider N · M distributions, i.e., an individual distribution for eachdimension of x . Let Θ ij , i = 1 , , . . . , N & j = 1 , , . . . , M represent every distribution for the j th dimension in D i witha probability density function (pdf) f Θ ij ( x ) . Hence, P ( x , D i ) can be deﬁned as follows: P ( x , D i ) = (cid:81) ∀ j f Θ ij ( x ) . Giventhe distributions Θ ij and constant weights w i > , the pdf ofthe mixture is f Θ ij ( x ) = (cid:80) ∀ i w i (cid:81) ∀ j f Θ ij .In general, I is a matrix hosting the outcomes of V · N Bernoulli trials with different success probabilities. Our aim,before we decide that x is an outlier, is to detect if multipleindicators agree upon this event for multiple datasets. Actually,we want to perform a double majority voting, i.e., the ﬁrstper column (multiple indicators for the same dataset) andthe second per row (multiple aggregated indicators for theecosystem of datasets). For the envisioned double majorityvoting, we adopt the δ -majority function upon V binaryvariables [10]. δ is the threshold over which we consider x as an outlier for a speciﬁc dataset. The following equationholds true: B ( I ij , j ) = 1 ⇐⇒ V (cid:88) i =1 I ij ≥ δ (1)where B ( · ) is the indicators aggregation function. In theabove equation, we can also incorporate the conﬁdence thatan outlier detectors has on the reported result. This way, wefocus on a ‘fuzzy’ outliers detection methodology that apartfrom the binary indication, every detector reports a real valuewhich is the probability of x to be an outlier or not. In thatcase, the ﬁnal outcome is delivered by the following equation B ( I ij , j ) = 1 ⇐⇒ (cid:80) Vi =1 w i I ij ≥ δ where w i is theprobability/conﬁdence reported by the i th detector. The second axis of aggregation is performed upon the B ( I ij , j ) = B j realizations. Again, a δ (cid:48) -majority function is adopted, i.e., B (cid:48) ( B ) = 1 ⇐⇒ N (cid:88) j =1 B j ≥ δ (cid:48) (2) B (cid:48) represents the view of our model that x should be charac-terized as an outlier or not.As I is the set of V · N Bernoulli trials with different successprobabilities, we can easily deliver the ﬁnal success probabilityfor any incoming data vector. The sum of the outcomes ofthe aforementioned Bernoulli trials can be adopted to deﬁnethe variable Z = (cid:80) Vi =1 (cid:80) Nj =1 I ij which follows a PoissonBinomial distribution with probability mass function (pmf) P ( Z = z ) = (cid:88) A ∈F m (cid:89) i ∈ A β ij (cid:89) j ∈ A c (1 − β ij ) (3)with A being the set of all subsets of m node indexesselected from { , , . . . , N } and A c is the complement of theset A . When all β ij s are identical, Z follows the binomialdistribution. Now, we can easily deﬁne the success probabilitythat x will be identiﬁed as outlier in the entire group ofdetectors and nodes. Initially, we have: The following equationholds true [25]: F ( z ) = N (cid:88) m = z  (cid:88) A ∈F m (cid:89) i ∈ A β ij (cid:89) j ∈ A c (1 − β ij )  (4) F ( z ) indicates the probability of having at least z outlieridentiﬁcation results out of N . In the above equation, thelowest value for z is z = (cid:26) N + 1 if N is even N +12 if N is odd (5)When N is high, it is not easy to calculate all the necessarysubsets for F m , thus, we have to rely on a computationallyefﬁcient method, i.e., on one of the methodologies presentedin [16]. For instance, we could adopt an approximation model(e.g., Poisson or Normal) to quickly deliver the ﬁnal probabil-ity of having x as an outlier, thus, rejected from further localprocessing. As far as the ‘individual’ outlier identiﬁcation pro-cess concerns, we can rely on widely adopted techniques, i.e.,statistical-based (parametric or non-parametric approaches),nearest neighbor-based, clustering-based, classiﬁcation-based(Bayesian network-based and support vector machine-basedapproaches), and spectral decomposition-based approaches.The proposed model can incorporate any number of outliersdetectors from the aforementioned categories. B. Statistical Management of Data Vectors

Assume that x arrives in the EC node n i . If x is not anoutlier, n i should decide to replicate the vector in a sub-set ofnodes. Our target is to apply the replication process only fornodes that are highly correlated with x . The replication of x in the entire ecosystem will ﬂood the network with messagesin addition to the ‘disturbance’ of the statistics of uncorrelatedatasets ‘invited’ to host x . In our scheme, we select the top- k similar datasets where x will be replicated. The parameter k is lower than N to reduce any negative effects in networkcommunications.Our replication process is realized upon the collected syn-opses s i , ∀ i instead of relying on a voting process adoptingtwo correlation metrics like the model presented in [25]. In thispaper, the novelty is that we detect the trend of the correlationbetween x and the available synopses s i , ∀ i estimating theunknown distribution that depicts their similarity. The latest W synopses reports { s ti } Wt =1 become the basis for our mechanism.We want to detect if x is similar to the latest W ‘views’ oneach dataset. We rely on the similarity between x and each s i depicted by the function g ( · ) . Let g ( · ) be the distance function,i.e., g ( x , s i ) = (cid:107) x , s i (cid:107) . For instance, if synopses are depictedby the mean of each dimension, g ( · ) can represent the likethe L p norm, p = 1 , , . . . with the x . When s i is depicted bythe centroids of a set of clusters, g ( · ) can be the distance withthe closest centroid or the accumulated distance with the setof the reported centroids. For simplicity in our notation, weconsider g i as the distance calculated by g ( x , s i ) .Based on the collected synopses, we can have a time seriesof distances for x , i.e., g ti , t = 1 , , . . . , W . Upon thesedistances, we can expose their unknown pdf targeting to extractthe probability of having the similarity between x and s i upon a pre-deﬁned threshold. We rely on the widely knownKernel density Estimation (KDE) [2] to derive the pdf of g i . Based on g ti , we estimate F G ( g i ) via estimating the pdf f G ( g i ) . Having W recent samples for g i , f G ( g i ) is estimatedas ˆ f G ( g i , W ) = W · h (cid:80) Wt =1 K (cid:16) | g i − g W − t +1 i | h (cid:17) , where h > is the bandwidth of the symmetric kernel function K ( u ) (integrating to unity). One of the most frequent adopted kernelfunction is the Gaussian, i.e., K ( u ) = √ π e − u . For savingtime and alleviate the complexity of the proposed model, werely on an incremental estimation of ˆ f G n. The pdf ˆ f G ( g i , t ) for t = 1 , . . . , W is incrementally estimated by its previousestimate ˆ f G ( g i , t − and the current value g ti , that is, werecursively obtain for t = 1 , . . . , W that: ˆ f G ( g i , t ) = t − th ˆ g i ( g i , t −

1) + 1 th K (cid:32) | g i − g W − t +1 i | h (cid:33) (6)If we apply the Gaussian function on the KDE, we obtain anestimation of the cdf ˆ F G ( g i , W ) = (cid:82) g i g min ˆ f G ( u, W ) du usingthe W values { g W − t +1 i } Wt =1 : ˆ F G ( g i , W ) = 1 W W (cid:88) t =1 (cid:32) erf (cid:32) g i − g W − t +1 i √ (cid:33)(cid:33) (7)where erf ( · ) is the error function. Hence, at time t , we obtainthe estimation of γ i = P ( g ti > (cid:15) ) ≈ − ˆ F G ( (cid:15), W ) . In the ﬁrstplace of our future research agenda is the incorporation of anestimation model for multiple time steps forward than W inthe decision making mechanism.After the calculation of γ i , we rely on an ordered listof the available peer nodes. We rank peers upon their γ i result to conclude the optimal solution for the speciﬁc setup.The Probability Ranking Principle [17] dictates that if peersare ordered by decreasing γ i over the available datasets and x , the effectiveness of the model is the best to be gottenfor those instances. From the ranked list, we select the top- k outcomes and the corresponding nodes to host x . Everyselected n j should, then, update the corresponding statistics oftheir datasets. We adopt a delay mechanism in the delivery ofmessages related to the new synopses taking into considerationthe difference between the new synopsis with the previousone and the remaining time till the expiration of the pre-deﬁned interval when nodes should report their synopses. Atime optimized process for delivering synopses while delayingthe ﬁnal decision is studied in [24] and is not subject of thecurrent effort.V. E XPERIMENTAL E VALUATION

A. Performance Metrics & Setup

The evaluation of the proposed model involves a set ofexperimental scenarios upon a real trace. The aim is toinvestigate the performance of our mechanism concerning thenumber of the detected outliers as well as the number of thestored data vectors that deviate from the statistics of everydataset. We also focus on the evaluation of the proposed modelconcerning its ability of replicating the incoming data to theappropriate datasets keeping their solidity at high levels. Oursimulations involve a high number of data vectors generatedin various nodes into the network. When a data vector isproduced, we adopt different distributions for producing thecorresponding values for each dimension. Our dataset is re-trieved by [12]. It contains 9358 instances of hourly averagedresponses from an array of ﬁve (5) metal oxide chemicalsensors embedded in an Air Quality Chemical MultisensorDevice. This device was placed in a highly polluted area inan Italian city. All values in the dataset are normalized in theunity interval.The performance of the proposed mechanism is evaluated bya set of metrics as follows: (i) the percentage of the detectedoutliers ω . As the adopted dataset does not contain outliervectors, we randomly ‘produce’ fake outliers by changing thevalues of the dataset. The aim is to identify if the proposedensemble outliers detection scheme is capable of rejectingvectors that may jeopardize the solidity of datasets. We haveto notice that ω refers in the detection of vectors that areconsidered as outliers for the entire ecosystem. For this, weincorporate into our ensemble outliers detection model withfollowing individual detection methods: (i) the probabilisticapproach discussed in Section IV, i.e., we consider the proba-bility of the incoming vector to be produced by the distributioncharacterizing every dimension of the available datasets; (ii)a statistical approach, i.e., we detect if the incoming vectorsigniﬁcantly deviates from the mean of each dimension; and(iii) the χ metric. In our experimental evaluation, we producea number of outliers equal to 1% of the retrieved values; (ii) the percentage of data vectors that deviate from the statistics ofeach dataset τ . τ depicts the average number of ‘local outliers’ ABLE IP

ERFORMANCE OUTCOMES FOR VARIOUS EXPERIMENTAL SCENARIOSAND W = 10 k = 2 k = 5 M = 2 M = 10 M = 2 M = 10 N ω τ ω τ ω τ ω τ

10 1.00 0.005 0.90 0.10 1.00 0.005 0.85 0.1150 1.00 0.08 1.00 0.28 1.00 0.10 1.00 0.30100 1.00 0.10 1.00 0.30 1.00 0.20 1.00 0.40 stored in every dataset after the application of our processingmodel compared to the total number of vectors. As a ‘localoutlier’, we deﬁne the data vector deviating three times thestandard deviation from the mean (under the assumption thatdata follow a Gaussian distribution). We try to detect if theproposed approach is capable of storing similar data vectors tothe same datasets. When τ is high means that multiple vectorsdo not match with the remaining vectors in the dataset. When τ → , our datasets do not contain any ‘outlier’ data; (ii) thesolidity of the formulated datasets as depicted by the mean µ and standard deviation σ . We target to a low σ , i.e., to deliversolid datasets. When σ is low means that the datasets areconcentrated around the mean, thus, we have a clear view ontheir dispersion. Such a result, as mentioned above, can assistin the efﬁcient assignments of queries into the appropriatedatasets. We perform a set of experiments for different N , k , M and W taking their values as follows: N ∈ { , , } , k ∈ { , } , M ∈ { , } and W ∈ { , } . In total, weconsider that 1,000 data vectors are received by the group ofnodes and report our results for the aforementioned metrics. B. Performance Assessment

In Tables I ( W = 10 ) & II ( W = 50 ), we present our resultsfor ω and τ metrics. We observe that the proposed model iscapable of detecting the generated outliers ( ω ∈ [0 . , . )while keeping similar data to the same datasets especiallywhen N is low. ω is generally retrieved to be equal to unityexcept when the number of EC nodes is low and the numberof dimensions is high. Additionally, an increased k leads toan increased τ due the fact that we replicate the incomingdata vectors to multiple EC nodes. For instance, if W = 10 ,we observe that the number of ‘local outliers’ are 0.5% when N = 10 and reach 30% & 40% when N = 100 . This meansthat in the scenario where we consider a high number ofnodes, the dispersion of datasets increases. This situation isalso affected by the increased number of nodes selected toreplicate every accepted data vector ( k = 5 ). In case where W = 50 , we get similar results that make us understandingthat the size of the adopted window for performing the KDEprocessing does not affect the performance of the proposedmodel.In Figure 1, we provide experimental outcomes for N = 10 taking into consideration the mean and the standard deviationof the retrieved datasets. We conform our ﬁndings as discussedabove, i.e., the number of dimensions affect the dispersionof data. We observe that σ is low when M is low as well. TABLE IIP

ERFORMANCE OUTCOMES FOR VARIOUS EXPERIMENTAL SCENARIOSAND W = 50 k = 2 k = 5 M = 2 M = 10 M = 2 M = 10 N ω τ ω τ ω τ ω τ

10 0.87 0.002 0.70 0.004 0.95 0.05 0.70 0.0250 1.00 0.10 1.00 0.10 1.00 0.06 1.00 0.12100 1.00 0.20 1.00 0.30 1.00 0.18 1.00 0.33

We also observe the local outliers in case where multipledimensions are adopted by our model. We conclude that theaggregation of the difference between the available synopsesand data vectors when M is high may jeopardise the solidityof datasets. However, the majority of σ realizations is retrievedbelow the median for a high number of datasets. Fig. 1. Performance outcomes related to the solidity of the retrieved datasetsfor N = 10 and k = 2 (up: M = 2 , down: M = 10 ) In Figure 2, we present our performance outcomes for N = 100 . A low M leads to a very low σ . Again, an increased M may lead to an increased σ as well. Now, σ is lower than inthe previous experimental scenario when M = 10 , however,we can observe some outlier values over the maximum σ realization. Similar outcomes are retrieved in the scenariowhere k = 5 (see Figure 3).The number of messages sent to the network depends on k . The lower the k is the lower the number of the requiredmessages. In our experiments, the total number of messagesare, in average, 1980 ( k = 2 ) or 4940 ( k = 5 ). If we adoptthe baseline model for replication purposes, i.e., we replicateevery accepted data vector to the entire network, we require9900 messages when N = 10 to 99000 when n = 100 (inaverage). Finally, the time required for delivering the ﬁnalresult (in seconds) is in the interval [0.004, 0.0006] when N ∈ { , , } . The lowest value in the aforementionedinterval is retrieved when all the adopted parameters get theirlowest realizations and the maximum value of the intervalorresponds to the maximum value for all parameters. Theretrieved average time requirements refer in every incomingdata vector and depicts the ability of the proposed mechanismto deliver the ﬁnal outcome in real time. In average, our modelcan serve, approximately, 160 to 250 data vectors per second. Fig. 2. Performance outcomes related to the solidity of the retrieved datasetsfor N = 100 and k = 2 (up: M = 2 , down: M = 10 )Fig. 3. Performance outcomes related to the solidity of the retrieved datasetsfor N = 100 and k = 5 (up: M = 2 , down: M = 10 ) VI. C

ONCLUSION