An Intelligent Edge-Centric Queries Allocation Scheme based on Ensemble Models
AAn Intelligent Edge-Centric Queries AllocationScheme based on Ensemble Models
Kostas Kolomvatsos Christos Anagnostopoulos Department of Informatics and Telecommunications, University ofThessaly, 35131 Lamia Greece School of Computing Science, University of Glasgow, G12 8RZGlasgow UK
Abstract
The combination of Internet of Things (IoT) and Edge Computing(EC) can assist in the delivery of novel applications that will facilitate endusers activities. Data collected by numerous devices present in the IoTinfrastructure can be hosted into a set of EC nodes becoming the subject ofprocessing tasks for the provision of analytics. Analytics are derived as theresult of various queries defined by end users or applications. Such queriescan be executed in the available EC nodes to limit the latency in theprovision of responses. In this paper, we propose a meta-ensemble learningscheme that supports the decision making for the allocation of queries tothe appropriate EC nodes. Our learning model decides over queries’ andnodes’ characteristics. We provide the description of a matching processbetween queries and nodes after concluding the contextual informationfor each envisioned characteristic adopted in our meta-ensemble scheme.We rely on widely known ensemble models, combine them and offer anadditional processing layer to increase the performance. The aim is toresult a subset of EC nodes that will host each incoming query. Apartfrom the description of the proposed model, we report on its evaluationand the corresponding results. Through a large set of experiments and anumerical analysis, we aim at revealing the pros and cons of the proposedscheme.
Currently, we are in the middle of a data management revolution brought bythe numerous devices present in the
Internet of Things (IoT) infrastructure.These devices are capable of interacting with their environment, perform someprocessing activities and report to the upper layers either the collected data orthe produced knowledge. On top of the discussed devices, the
Edge Computing (EC) structure, the
Fog layer and, finally, the
Cloud provide multiple storage1 a r X i v : . [ c s . D B ] A ug nd processing locations where knowledge can be delivered [52]. In this architec-ture, finding efficient techniques for data management becomes significant dueto the high volumes of data produced by numerous devices [66]. Recent studiesshow that data offloading to the Cloud requires 100 to 200 ms of additionallatency compared to EC technologies [15]. Hence, to reduce the latency, wecould store and process the collected data to EC nodes. However, due to theirlimited computational capabilities only a part of data can be stored and pro-cessed locally while the remaining will be sent to Cloud. Currently, local dataprocessing gains more attention as any solution has to build over the limiteddelay in the provision of the requested analytics. Pushing as much computingworkload as close to the edge as possible can bring serious benefits, particularlywhere communication costs are high or where instant action is needed . Edge Nodes (ENs) can maintain a local dataset forming a network of dis-tributed data repositories. Data may also be replicated for supporting faulttolerant applications. Queries/tasks generated by end users or applicationscould request for analytics over the collected data. As data are distributedamong ENs, it is imperative to define a mechanism that allocates the incomingqueries to the appropriate nodes. Such a decision should be made not onlybased on queries demand for data but also on various characteristics of the ENsthemselves.
Motivating Example . If a query asks for temperature recordings between10 and 20 degrees in the Celsius scale, it is useless to allocate the query to adataset where its statistics indicate temperature recordings above 20 degrees orbelow 10 degrees. Allocating the query to the specific dataset, we will waste re-sources and time for receiving an empty set. In addition, when multiple datasetsexhibit statistics ‘similar’ to the incoming query, a load balancing aspect shouldbe taken into consideration to avoid the creation of bottlenecks in the processingactivities of each EN.Traditionally, the demand for a low latency in the provision of responsesis achieved by employing more computing resources and parallelizing the ap-plication logic over the datacenter infrastructure [63]. In our case, the incom-ing queries can be distributed to the available ENs, however, such nodes areequipped with different resources and exhibit a different status. It is worthnoticing that a query/task could be allocated in multiple ENs imposing newrequirements related to the aggregation of the final responses. We should alsokeep in mind that the number of ENs is growing while the need for increaseddecentralization and autonomy is also imperative. To reduce the latency in theprovision of responses, it is not enough to perform a parallel execution on topof the available ENs but also to allocate the incoming queries/tasks to appro-priate ENs that will conclude the envisioned calculations in short time. Theturnaround time is affected by multiple parameters like the load of ENs (thehigher the load is, the higher the response time becomes), the number and thestatistics of data present in them (the size of the datasets affects the response Query Controller (QC) that is responsible to receivequeries (from this point forward, we refer to queries when we want to depict theexecution of a query/task) and proceed with their allocation. The QC can bepresent at the Cloud being directly connected with a set of
Query Processors (QPs) placed in every EN. QPs execute the incoming queries and return the finalresponse to the interested QC. Our scheme enhances the behaviour and the deci-sion making capabilities of QCs. We adopt the contextual information of queriesand ENs characteristics at the time when the decision should be concluded. Wepropose the use of multiple contextual vectors, i.e., our basis for ‘matching’queries with the available nodes. Our mechanism adopts a classification modelover the the complexity class of any incoming query and the estimation of thedistance between the query constraints and the data present at every EN. Inthe final decision making, we use a meta-ensemble model that builds on top ofmultiple ensemble schemes. This way, we build an ‘hierarchy’ of classificationmodels adopted to deliver better results than individual classification modules.The efficient allocation of queries to a number of nodes is also the subject ofour previous efforts presented in [39], [41] and [43]. In [39], every allocation isconcluded based on an optimal stopping scheme that examines all the availableENs concerning their ability to host and execute a query. The model proposedin [43] aims at the adoption of a learning scheme that will, finally, decide eachallocation when a query arrives. In [41], we adopt reinforcement learning andclustering to deliver the final allocation. Our motivation for extending our3revious work is related to our desire to avoid the drawbacks of adopting asingle solution. For instance, when relying on a specific learning model, wecan meet the following disadvantages [19]: (i) a single model can be viewedas searching a space for detecting the best possible hypothesis. A statisticalproblem arises when the amount of training data is too small compared to thesize of the hypotheses space; (ii) many algorithms are trapped to local optima.Even if there are enough training data, it may be computational expensive tofind the best hypothesis in the available space; (iii) in most applications, thetrue function cannot be represented by any of the hypotheses in the searchspace. We depart from our previous work and provide a scheme that avoidsthe drawbacks of individual models. The contributions of this paper depict thedifferences from the previous efforts being described in the following list: • we propose a meta-ensemble learning scheme for queries allocation to aset of ENs. The meta-ensemble learning scheme adopts multiple ensemblelearning (sub-)models to provide a powerful and efficient allocation model. • we propose a decision making mechanism that builds over the contextualdata related to queries and ENs characteristics. Through the formulatedcontextual vectors, our decision making model is capable of ‘matching’ theincoming queries with the appropriate ENs. • we propose a model for delivering the complexity of any incoming query.The complexity class depicts the ‘burden’ that a query will add in ENswhere it will be executed. • we adopt a model for estimating the distance of a query with datasetspresent in ENs. Such information is significant for the final allocationas we aim at allocating queries in datasets with which they exhibit theminimum distance.The paper is organized as follows. Section 2 reports on the prior work inthe field by presenting important activities related to our problem. Section3 presents the problem under consideration and some introductory informa-tion. Section 4 discusses how to model the incoming queries and the envisionedENs while Section 5 presents the proposed meta-ensemble learning scheme. InSection 6, we proceed with our experimental evaluation and in Section 7, weconclude our paper by presenting our future research plans. With the advent of IoT, numerous devices can interact with their environmentand the network in order to collect and process data. In many applicationdomains, data play a central role towards the generation and the provision ofknowledge that will support the efficient decision making. Example domains arefinancial services, life sciences, mobile services, etc. Apart from the discussed4evices, end users may also generate data like tweets, social networking interac-tions or photos [59]. Through analytics, one can support the efficient decisionmaking having a view on the ‘meaning’ of the collected data. Analytics aim todiscover patterns, especially in the case of unstructured data. Various tools forlarge scale data analytics have been proposed in the literature. The majorityof them concern batch oriented systems and they build over Hadoop [28]. Anumber of research efforts try to provide performance insights to the discussedframework [1], [20], [37]. Researchers provide new functionalities to increase theperformance of legacy systems. For instance, the authors of [33] propose theStarfish, a self-tuning tool for large scale data analytics.The aforementioned processing can be either centralized or distributed, i.e.,the processing takes place where data are initially collected. We can also iden-tify the need for streams processing to facilitate the online, (near) real timeprovision of responses in a set of analytics queries. Queries/tasks allocationand scheduling are important research subjects in multiple domains. Both sub-jects have a significant impact in the IoT and EC. IoT/EC nodes have limitedcomputational capabilities while being restricted by various energy constraints.It seems that streams processing is the appropriate methodology for deliveringreal time analytics. ENs can apply a decision making mechanism to processthe incoming tasks/queries. They should take into consideration tasks’/queries’specific characteristics in combination with their current internal status for anydecision making. Additionally, nodes may train and update machine learningmodels locally to serve the incoming tasks/queries. These local models can beaggregated in an upper layer [18]. This approach is appropriate to detect localdata shifts and create collaborative learning and model-sharing environments inwhich local models can quickly adapt to any changes in the collected data.A widely studied research subject is task scheduling in
Wireless Sensor Net-works (WSNs). Mapping and scheduling should take into consideration energyconstraints to secure an efficient execution [8], [61]. A pre-processor and ascheduler can be responsible for the final allocation. The pre-processor triesto identify the energy requirements of the incoming tasks/queries and, basedon energy monitoring activities, decides on the final scheduling. In any case,taking into consideration only a single parameter (e.g., remaining energy) indecision making can limit the reasoning capabilities. Another approach is tostudy a fair energy balance among sensors while minimizing the delay usinga market-based architecture [23]. Taking into consideration the defined con-straints, nodes may cooperate to conclude the final allocations [7]. Examplealgorithms involve tasks/queries clustering and node assignment mechanismsbased on tasks/queries duplication and migration schemes. The aim is to min-imize the execution time, thus, to deliver the final response with minimizedlatency. A model that could be adopted for such purposes is to cluster the net-work and build intra-cluster and inter-cluster scheduling relations. An
IntegerLinear Programming (ILP) formulation and a 3-phase heuristic are also adoptedto solve the allocation problem in [67]. In [65], the authors propose a modifiedversion of the binary
Particle Swarm Optimization (PSO). The method adoptsa different transfer function, a new position updating process and mutation for5he task/query allocation problem. Another PSO-based solution is presented in[55] which allocates tasks/queries into a number of robots trying to decrease thecommunication cost. It is worth noticing that PSO-based models suffer from alow convergence rate while they can be trapped to a local optimal especially incomplex problems. In [35], the authors present a task/query allocation mecha-nism of a dynamic alliance based on a Genetic Algorithm to acquire the balancebetween energy consumption and accuracy. In [16], the authors discuss threealgorithms to solve the discussed problem: a centralized, an auction-based, anda distributed algorithm. The distributed algorithm adopts a spanning tree overthe static sensors to assign tasks/queries.The continuous reporting of queries demanding for immediate responses isalso the subject of various research efforts. When large scale data applicationsinvolve continuous queries, for having near real-time responses, such applica-tions, usually, involve a large number of data partitions. Obtaining a responsein near real-time could be very difficult due to limitations defined by the amountof data and the underlying hardware performance. Querying data samples andthe provision of progressive analytics is an efficient solution for the describedproblem [3]. Specific sampling techniques have been proposed [14], [32]. Inprogressive analytics, the
Approximate Query Processing (AQP) technique cansecure the accuracy in early results and provide the corresponding confidence[14], [21], [54]. Users defining queries are not involved in the process, however,based on the retrieved confidence, they could rely on an intelligent mechanismfor handling early results. When accuracy is at acceptable levels (according thespecific application domain), the process could be stopped.An analytics provisioning system is presented in [13]. The system is based onthe Prism framework and allows users to communicate samples to the system.Queries are processed over the defined samples. The authors propose Now!,a progressive data-parallel computation framework for Windows Azure. Now!mainly works with streaming engines to support progressive SQL over large scaledata. It is worth noticing that the selection of samples affects the statisticalerror of the final (sub-)dataset, thus, it may have negative consequences inthe final decision making. In [17], the authors present an online MapReducescheme that supports online aggregation and continuous queries. For decreasingthe latency of the system, the authors propose to have the Map task sendingearly results to the Reduce tasks. This mechanism enables the generation ofapproximate results, which is particularly useful for interactive analytics. In[47], the authors present a continuous MapReduce model. The execution ofMap and Reduce functions is coordinated by a data stream processing platform.Latency is improved through a model where mappers are continually fed by dataand the retrieved results are transferred to reducers. CONTROL [32] is an AQPsystem intended to support progressive analytics. Users have the opportunity torefine answers and have online control of processing. DBO [36] is another AQPsystem capable of calculating the exact answer to queries over a large relationaldatabase. DBO can have an insight on the final response together with specificbounds for the accuracy of early results. As more information is processed, theDBO can provide provide more accurate results. Users can stop the process at6ny time, if the accuracy level is sufficient.
In this section, we provide a description of the interacting entities towards theefficient allocation of the incoming queries and present the problem under con-sideration. We also setup the basis for solving the problem and apply theenvisioned meta-ensemble classification scheme.
Consider the setting presented in Figure 1 where N ENs collect contextual dataand locally process them in light of producing knowledge; each EN is indexed inthe set N = { n , n , . . . , n N } . Contextual data are either collected by varioussensors/end devices (e.g., users’ smartphones) or are generated through localprocessing. Definition 3.1
Contextual data are depicted by the information that providescontext (a value for a specific attribute) to an event, entity or a processingactivity.
Every EN has specific characteristics related to its computational powerand a limited storage capacity. Hence, only a subset of contextual data couldbe stored locally. The remaining data are sent to the back end infrastructurepresent at Cloud. In this context, by supporting local data processing at theEN, it facilitates the minimization of latency in the provision of responses.Local data processing involves statistical reasoning, inferential analytics, andreal-time data management, e.g., estimation of the top- k lists over the incom-ing data streams [42]). A dataset DS i is available over which the local dataprocessing takes place. This dataset is continuously updated as fresh dataarrive through streams. The i th EN stores multivariate vectors in DS i , i.e., x = [ x , x , . . . , x L ] (cid:62) ∈ R L ( L is the number of dimensions - contextual at-tributes). As multiple devices may report vectorial data to multiple ENs, repli-cates among DS i and DS j , with i (cid:54) = j may be present. The management ofpotential replicas is beyond the scope of this paper. The whole data at thenetwork edge form the set DS = { DS , . . . , DS N } . In each EN, we introduce alocal QP responsible to (i) receive a stream of analytics queries (e.g., estimatingthe regression plane among contextual variables within a time frame) from endusers (e.g., applications, data analysts); (ii) execute them over DS i and (iii)send the results back to the requestor. Definition 3.2
An analytics query is a request for information responded bymeaningful patterns found in the available dataset while being extracted basedon a scientific process.
7e associate the i th QP with the i th EN, thus, we have N ENs/QPs in theset QP = { qp , qp , . . . , qp N } . We adopt a queue ion every QP which can han-dle a maximum number of queries. Without loss of generality, we consider thatqueues adopted in QPs have the same length. Each QP has specific character-istics encoded in the set C i = { c i, , c i, , . . . , c i,m } . For instance, C i = { l, s } with l representing the current load and s depicting the speed of the correspondingQP. Definition 3.3
The load is the quantity of analytics queries that can be carriedat one time by an EN.
Definition 3.4
The processing speed of an EN is its rate of execution of ana-lytics queries. l can be easily estimated through the current number of queries waiting inthe corresponding queue, while s indicates the throughput of a QP, i.e., thenumber of queries responded in a given time unit.In the upper layer, shown in Figure 1 (i.e., Fog/Cloud), there is a federationof QCs. QCs play the role of endpoints for end users or applications and achieveefficient provision of responses to incoming queries. QCs have direct access tothe ENs, thus, to their corresponding QPs and they interact with them to getresponses for the incoming queries. QCs, after receiving a query, should be ableto find the most appropriate subset of QPs for allocating the query. Afterwards,QCs should aggregate the partial results to derive the final aggregated response,which will be delivered to end-users/applications. With the term appropriate ,we refer to the subset of those QPs that, at the specific time the query is issued,they exhibit characteristics that will facilitate its efficient execution and returnthe result in the minimum turn around time. The efficient execution is realizedthrough the realization of various parameters that should be optimal for anyallocation. For instance, the response time should be limited, the outcomesshould match against the query constraints, the response time should fulfil thepre-defined time constraints (if set) and so on and so forth. It should be notedthat QCs, through their continuous interaction with the ENs and their QPs, canmaintain historical performance data as well as the statistics of data present ineach EN. Based on this context, QCs obtain a holistic overview on the currentstatus of QPs and the data that they are stored locally in ENs.The incoming queries are represented via a stream: Q = { q , q , . . . q j } . Attime instance t ∈ T = { , , . . . } a query q t arrives to a QC. q t belongs to a spe-cific query class , exhibiting specific characteristics, i.e., C q = (cid:110) c q , c q , . . . , c q | C q | (cid:111) .Suppose that all queries have the same number and types of characteristics. Forinstance, C q = { p, a } where p stands for the computational complexity and a depicts a deadline for delivering the final result. Definition 3.5
The computational complexity of an analytics query representsthe amount of the required resources to execute it.
8f we focus on a database management system, we can identify the followingquery classes: (i)
Selection Queries : The main representative of such queries isthe SELECT command. This type of queries aim to provide data (tuples) thatfulfil specific conditions; (ii)
Modification Queries : Such queries are adopted toperform changes/modifications in the underlying data, e.g., UPDATE. In mostof the cases, these queries are demanding and require significant computationaltime and resources; (iii)
Aggregation Queries : Usually, they are executed overother queries and apply algebraic operators, e.g., AVG, SUM, MIN, MAX, etc.According to [30], the generic characteristics of queries are: (i) the type of thequery, e.g., repetitive, ad-hoc; (ii) the query shape ; and (iii) the size of the query,e.g., simple, medium, complex. Based on the characteristics of each query class,specific execution plans could be defined in the form of processing trees [30]. In[62], one can find a study on the calculation of queries complexity, however, withthe focus being in the underlying databases. In the database community, thecomplexity of a query is, usually, measured through the resources required bythe database server for executing it. The most significant parameters for doingthat is to focus on the space and time required for executing the query during theoptimization phase. The query optimizer compares different execution strategiesand selects the one with the least expected cost (or the maximum expectedreward). In our context, we want to adopt additional parameters not beingbounded to the internal processes of any data management system but beingaligned with the needs of an EC setting. We focus on the upper layer of theaforementioned architecture and try to limit the time required for initiating theexecution of a query. We extend the aforementioned characteristics’ list andincorporate new, ‘high level’, parameters that depict the complexity and theneed for instant response.Figure 1: The connection of query controllers and edge nodes.9 .2 Problem Definition
For the description of our problem let us focus on an individual QC attachedto a stream of queries. Suppose that at time t , a query q t arrives at the QCdemanding for immediate processing. The QC should be based on q t ’s andENs characteristics to take the appropriate decision and find the best subset ofnodes to engage for the execution of q t . Among others, we are interested in dataconstraints as dictated by q t . For instance, in case of an SQL query, we focus onthe WHERE clause. These constraints define the conditions that should be metwhen we retrieve data to construct the final response for q t . To easily depictthe discussed constraints, we represent q t as a 2 L -dimensional vector w = [ { min , max } , . . . , { min L , max L } ] (cid:62) ∈ R L (1)such that { min i , max i } are the minimum and the maximum values defined forthe i -th dimension (attribute). Constraints should be matched against datapresent in ENs where, potentially, q t is directed to be executed. After thereception of q t , the QC creates N context vectors (one for each QP) referringto q t ’s characteristics and the current status of ENs/QPs. Context vectors havethe following form: v i = (cid:104) o, a, r i , l i , s i (cid:105) (2)In addition, o is the q t ’s expected complexity (elaborated later), a is the deadlineset for q t , r i is the information relevance of q t with DS i (the dataset present atthe i th EN) and l i and s i are the load and the speed of the i th QP, respectively.The aforementioned vectors represent the minimum sufficient statistics for bothan incoming query and each EN/QP based on which the QC should decide thefinal allocation. It should be noted that context vectors are easily concludedin short time either using: (i) pre-defined values (e.g., a is set during the re-ception of q t , si is defined for each EN beforehand - it depends on the internalcharacteristics of QPs); (ii) simple calculations (e.g., l i can be extracted by thenumber of queries waiting for processing in the corresponding queue); or (iii)our proposed models (e.g., for concluding o and r i ). Overall, context vectorsare fed into our proposed ensemble scheme for predicting the appropriatenessof each EN/QP to be the host of each query. Problem:
Given an incoming query q t represented by w and the associatedcontext vectors { v } Ni =1 , predict the most appropriate subset of QPs that shouldbe engaged for the execution of that query. In the remainder, we elaborate on the creation of the context vectors. Itis worth mentioning that QCs can have: (i) access to the statistical synopses(digests) of data present in ENs (i.e., the QC exploits the minimum sufficientstatistics of each dataset); (ii) access to statistical data related to the perfor-mance of ENs. Based on the above, our mechanism is, both, performance-aware and data-aware . We deal with the estimation of the ability of a QP to effi-ciently execute a query in the minimum time while delivering the outcomesthat perfectly match to queries’ constraints.10
Processors Characteristics and Queries Com-plexity
As noted, ENs maintain a queue where the incoming queries wait for processing.The size of the queue is adopted such that it can deliver up to l i queries per timeunit, i.e., the percentage of the maximum load that can be afforded by the i thEN. Without loss of generality, we get l i ∈ [0 ,
1] given that a maximum queuesize is adopted for such purposes. When l i →
1, the EN experiences a highload. The load is also connected with the throughput and the queries reportingrate. Additionally, s i depicts the speed of processing. A resource demandingquery, e.g., a join query, potentially requires more execution time and resourcescompared with a less resource demanding query like simple select commands. s i is, then, directly affecting the throughput, i.e., the number of queries executedin a time unit and assumes values in [0 , s i →
1, the i th EN exhibits thebest possible speed approaching the maximum theoretical performance. When s i →
0, the i th EN delays in the delivery responses to the the incoming queries. The classification of the computational complexity o of q t and its effect on l and s is the key element in our scheme. Various research efforts study the complexityof queries [6], [60], [62]. For realizing o , we rely on our previous effort presentedin [40]. We focus on a simple, however, efficient mechanism that will deliver thefinal outcome in the minimum time. We consider that | Θ | complexity classesare available, i.e., Θ = (cid:8) θ , θ , . . . , θ | Θ | (cid:9) . θ i is aligned with the complexityperformed by the operations of q t as required for producing the final result. Weassign q t to a complexity class θ based on a typical classification task. Thefinal complexity is defined based not only on quantitative characteristics, e.g.,number of constraints and conditions, but also on qualitative characteristics,e.g., type of operations. For handling this complicated process, we introducea fuzzy logic based approach and define a Fuzzy Classification Process (FCP).The FCP evaluates the membership of q t in each of the pre-defined complexityclasses. Hence, we can obtain an estimate of the computational burden added toa candidate EN. The FCP is executed over a set of historical executed queriesalong with their corresponding classes. This means that QCs maintain theprevious queries set realized by past interactions with ENs being dynamicallyupdated as new requests for processing arrive. The aforementioned set can beconcluded by servers’ logs reported after each processing activity. In any case,the creation of the historical set of queries and the pairing process with theavailable complexity classes is beyond of the scope of this work. We deliver ascheme that can be adopted in processing of real, unknown queries providing thefinal result in limited time. When focusing on a query decomposition process(it is the theoretical ground coming from the database community) to calculatethe complexity, we need from 3.0 to 22.65 seconds depending on the query and11he underlying DBMS [48]. This amount of time should be added to the timerequired for the allocation of a query and the waiting time in the queue ofthe selected EN. In our work, we specifically focus on the minimization of theallocation time together with the reduction of the waiting time. We try to limitthe time at every part of the envisioned processing, i.e., the initial allocation of q t combined with models proposed by the database community aiming at queryexecution optimization.For evaluating θ for every q t , we adopt widely known similarity techniques in-stead of relying on a machine learning model that requires a training phase. Wetry to minimize the complexity of our scheme, proposing the use of a simple, fast,however, efficient model that delivers the final outcome in the minimum time.Let the available dataset of training queries be DS Q containing tuples in theform (cid:104) p k , θ k (cid:105) , k ∈ { , , . . . , | DS Q |} . p k represents q t ’s statement along with itscomplexity class θ k ∈ Θ. An example statement could be: p k = SELECT NAME,PRICE FROM STOCKS WHERE PRICE < = 100 AND PRICE >= 10 . We, then, de-velop a function f over q t and, based on DS Q , deliver a vector encoding thesimilarity of q t with every complexity class in Θ: f ( q ; D Q ) → q s ∈ [0 , | Θ | (3) q s ’s components assume values in [0,1] demonstrating the degree of member-ship of q t to each complexity class, thus, forming the basis of our FCP. Forinstance, consider the vector q s = [0 . , . , . (cid:62) given the complexity classes:Θ = (cid:8) θ = O ( n log n ) , θ = O ( n ) , θ = O ( n ) (cid:9) . In this example, q s shows that q t belongs with 20% to θ , 80% to θ and 30% to θ .For calculating every component of q s , we adopt a set of metrics that deliverthe similarity between vectors. The interested reader can refer in [45] for moredetails. We propose an ensemble scheme for evaluating the similarity of q t withevery tuple (cid:104) p k , θ k (cid:105) ∈ DS Q . Recall that tuples present in DS Q depict a setof training queries along with their complexity classes. Our aim is to find thesimilarity of q t with every training query merging the results in a subsequent stepto deliver the final q t ’s similarity with each of the pre-defined complexity classes.The main theoretical foundation behind the adoption of an ensemble similarityscheme is related to the special requirements of the problem and the need foravoiding any wrong decisions when relying on a single metric. The ensemblescheme ‘combines’ the ‘opinions’ of multiple metrics while the aggregated resultis the one that will support the final outcome. Our ensemble scheme utilizes a set E = (cid:8) e , e , . . . , e | E | (cid:9) of similarity metrics. E can involve the Hamming distance[53], the Jaccard coefficient [51], and the Cosine similarity [53]. Any distancemetric available in the literature could be transformed to depict the similaritybetween q t and the p k from the tuple (cid:104) p k , θ k (cid:105) . For instance, if e d is the Euclideandistance between q t and p k , their similarity can be defined as e d . The adoptedsimilarity metrics are applied on each tuple ‘pre-classified’ to θ k aggregated todetermine the k -th component q sk of the complexity similarity vector. Formallythe ‘2D aggregation’ (see Figure 2 - retrieved by [40]) is calculated as follows: q sk = Ω( ω { e i ( q t , (cid:104) s k , θ k (cid:105) ) } , ∀ i , ∀ (cid:104) p k , θ k (cid:105) . ω realizes the envisioned ensemble12imilarity scheme while the aggregation operator Ω produces the q sk over multiple ω values. Figure 2: The envisioned similarity aggregation process.For ω , we consider that every individual result (i.e., e i ( q t , (cid:104) p k , θ k (cid:105) ) representsthe membership of q t to a ‘virtual’ fuzzy set. We have |E| membership degreescombined to get the final similarity for each individual tuple. For instance, ifwe get e = 0 . e = 0 . e = 0 . q t ‘belongs’ to the e fuzzy set by 0.2, tothe e by 0.5 and to the e by 0.3. ω is a fuzzy aggregation operator ; an |E| -placefunction ω : [0 , |E| → [0 , α ≥
0. The final ω value is defined as: ω = ˙ e · ¨ ea + (1 − a )( ˙ e + ¨ e − ˙ e · ¨ e )where ˙ e and ¨ e are two similarity values. As similarity metrics may ‘disagree’,we propose the use of the top- n similarity values based on their significancelevel. The Significance Level (SL) depicts if a similarity value is ‘representa-tive’ for many other outcomes. We borrow the idea from the density basedclustering [31] where centroids of the detected clusters are points that ‘at-tract’ many other data features around them in close distance. We proposethe use of the radius γ and calculate the SL for each similarity result as follows: SL e i = e − ( δ | d ( ei,ek ) ≤ γ |− δ ) , ∀ i , where δ and δ are parameters adopted tosmooth the sigmoid function. With the sigmoid function, we eliminate the SLof similarity values with a low number of ‘neighbors’ in the radius γ . The finalresults are sorted in a descending order of the SL and the top- n of them areprocessed with the Hamacher product to deliver the final ω .The Ω operator builds over ω values produced for each tuple in Q D classifiedin θ k . Let ω , ω , . . . , ω m are those values. For their aggregation, we rely on a13uasi-Arithmetic mean, i.e., q sk = (cid:2) m (cid:80) mi =1 ω αi (cid:3) α where α is a parameter that‘tunes’ the function. When α = 1, the function is the arithmetic mean, when α = 2, it is the quadratic mean and so on and so forth. After calculating thefinal values for each θ k (i.e., realized by Ω k ), we get q s = (cid:10) Ω , Ω , . . . , Ω | Θ | (cid:11) . We propose a distance model aiming at concluding r i adopted in the contextvector v . Recall that r i is adopted to depict the similarity between w anddata present in ENs. We consider that the dimensions of the collected/storedvectors are not correlated. At pre-defined intervals, ENs send to QCs thestatistics of local data expressed by two vectors, i.e., the vector of means µ = (cid:104) µ , µ , . . . , µ L (cid:105) and the vector containing the standard deviation for eachdimension, i.e., σ = (cid:104) σ , σ , . . . , σ L (cid:105) . Again, we have to recall that w repre-sents the intervals depicted by the constraints of q t . Our intention is to findthe overlapping between two sets of intervals, i.e., intervals defined by w andintervals represented by data statistics (the combination of µ and σ ). Actually, µ and σ can define the confidence interval for each dataset as follows. We have µ ± z · σ | DS i | for all the adopted dimensions with | DS i | depicting the cardinalityof the corresponding dataset. z represents the z-value retrieved by the standardnormal (Z-) distribution for our desired confidence level. The area between − z and + z is, approximately, the confidence percentage. For instance, for z = 1 . − .
28 and +1 .
28 is, approximately, 0.80.Based on the above analysis, we proceed to calculations related to the simi-larity between w and N vectors, i.e., f i = (cid:104){ µ i − z · σ i , µ i + z · σ i } , . . . , { µ iL − z · σ iL , µ iL + z · σ iL }(cid:105) , ∀ i . This way, we conclude N similarity values, one for each dataset, thus, wedeliver N context vectors v . For deriving the final r i , we have to find the finalsimilarity between L intervals, i.e., w and f i . Typical distance/similarity mea-sures (e.g., the Euclidean distance) cannot efficiently manage cases where w canbe completely contained in f i . We rely on the research performed in [29], wherea study on calculating the distance over interval data is provided. Based on thediscussed metrics, we propose the use of the overlapping metric ψ k to finallydeliver r i as follows: r i ( w , f i ) = h ( ψ ik ) , ∀ k ∈ { , , . . . , L } . In this equation, wepropose the use of the aggregation function h responsible to aggregate L distanceresults. Our calculations are exposed by the following equation (applied for eachdimension of the adopted dataset): ψ ik = 1 − (cid:107) w ∩ f i (cid:107) min {(cid:107) min k , max k (cid:107) , (cid:107) µ ik − σ ik ,µ ik + σ ik (cid:107)} where w k ∩ f ik = (cid:26) (max { min k , µ ik − σ ik } , min { max k , µ ik + σ ik } ) for max { min k , µ ik − σ ik } < min { max k , µ ik + σ ik } (4)The previous equation defines the interval depicting the ‘area’ between { min k , µ ik − σ ik } We consider that every σ i vector depicts the result z · σ | DS i | { max k , µ ik + σ ik } . For h , we adopt the Quasi-arithmetic mean [12], i.e., r i = (cid:32) L L (cid:88) k =1 ( ψ ik ) α (cid:33) − α Based on α , we can alter the results derived by the Quasi-arithmetic mean, e.g.,if α = 1, r i is calculated based on a ‘simple’ mean. We propose the use of a meta-ensemble learning scheme that will deliver the finalallocation of the incoming queries on top of multiple other ensemble schemes.Our ensemble scheme tries to ‘optimize’ the allocation of every q t to the availableENs. Let g () be a function that results the matching degree between q t and anEN. The higher the g () realization is, the higher the matching becomes. Ourselection mechanism deals with the maximization of g () based on queries andENs characteristics. In terms of the description of an optimization problemthat fits in our scenario, we can denote that our ensemble mechanism tries tomaximize g ( w , { v } Ni =1 ) subject to our models’ results for delivering l i , s i , q s and r i . Actually, our ensemble model is responsible to detect the most efficientallocations over a complicated ‘reasoning’ mechanism that adopts multiple MLschemes.Any ensemble method aims at the creation of multiple learning models com-bined to deliver the final decision. The ensemble approach manages to providebetter results when adopted for prediction being compared with other incremen-tal techniques [2]. Some approaches in ensemble learning are [9]: (i) Voting; (ii)Stacking; (iii) Bagging; (iv) Boosting. To have the successful application of theensemble learning model, we have to apply diversity of the obtained results.The required diversity is achieved by [9]: (i) the diversity of the models, i.e.,we have to use different algorithms or the same algorithms with different pa-rameters; (ii) the diversity of data, i.e., the training data should be differentfor each of the adopted models. The most critical issue is the combination ofthe results. Ensemble schemes attract the attention of researchers working inthe machine learning domain based on the view that ensembles are often muchmore accurate than the individual classifiers [22]. In our research, we rely on a wide range of classifiers. and build on the re-sult of already defined ensemble schemes to provide our meta-ensemble model.We adopt widely known ensemble learning models to provide a more powerfullearner. We adopt the three ‘basic’ ensemble models, i.e., (i) the AdaBoostmodel Y ; (ii) the Stacking model Y ; (iii) the Bagging method Y . Our meta-ensemble scheme receives the results of the aforementioned ‘sub-ensemble’ mod-els (i.e., Y , Y , Y ) and delivers the final allocation decision Y . It should be15oted that the aforementioned individual ensemble schemes are already focus-ing on combining the individual, ‘single’ learners, thus, we deliver another, ad-ditional, layer of processing. As the aforementioned schemes are ‘high-level’ensemble models, they can be applied over any individual learning algorithm.Below, we present the list of the adopted individual learners incorporating intoour ‘reasoning’ mechanism different types of schemes, e.g., decision trees, prob-abilistic models, neural networks. This exhibits the ability of our mechanismto be unbounded of the type of individual learners and its strength to combinevarious outcomes for the discussed allocations. Our meta-ensemble model per-forms a two-level aggregation of the outputs retrieved by the individual schemeswhen applied over the characteristics of queries and ENs trying to derive themost efficient allocations. We have to notice that the outputs of the adoptedmodels should be in the same interval to be easily aggregated for concludingthe final decision.In [56], the interested reader can study the theoretical background behindthe adoption of ensemble machine learning schemes. In short, there are threemain theories that explain the effectiveness of ensemble models. The first con-siders ensemble models under the perspective of of large margin classifiers [50].Characteristics of such schemes are the enlargement of margins, enhancementsof the generalization capabilities of Output Coding [5] and boosting based en-semble algorithms [58]. The second theoretical approach deals with study ofthe classical bias/variance decomposition of the error [27], showing that ensem-bles can reduce variance [10], [46] or both bias and variance [44], [11]. Finally,the third theory is a stochastic discrimination theory focusing on a set-theoreticabstraction to remove all the algorithmic details [38]. Adopting such an abstrac-tion, we are able to ‘see’ the classifiers as a combination of subsets of points ofthe feature space and their decisions are also point sets. The set of classifiersare, then, just a sample into the power set of the feature space.The Adaptive Boosting (AdaBoost) model [25] conveys the basic processingof the Boosting model while combining, in an adaptive way, multiple base learn-ers. Initially, the Boosting model builds a first, weak classifier and, accordingly,a succession of models are built iteratively, each one being trained on a datasetin which points misclassified by the previous model are given more weight. Allthe adopted models are weighted by their success to provide an accurate classi-fication result and their outputs are combined through voting. In any case, thesame training dataset is used over and over again. The Stacked Generalization(Stacking) model [64] is another meta-learner that aims at combining models ofdifferent types. Initially, it splits the training dataset in two disjoint sets and,then, trains several base learners on the first part of the dataset. Accordingly, ittests the base learners with the second part of the training dataset. Afterwards,it gets predictions in the testing phase as the input and correct responses as theoutput to train a higher level learner. The Bootstrap Aggregation (Bagging)[10] aims at incorporating multiple versions of the training set through the useof sampling with replacement. Every produced dataset is adopted to train adifferent leaning model. The final output is delivered by averaging the indi-vidual outputs or voting when the final results involves a classification process.16agging is efficient only when using unstable non-linear models because a smallchange in the training set can cause a significant change in the model.In the aforementioned meta-learning schemes, we incorporate the followinglearning algorithms/base learners (to have the necessary diversity in the selec-tion of algorithms): (i) C4.5 decision tree; (ii) Random tree; (iii) Naive Bayesmodel; (iv) Bayesian Network; (v) Multinomial Naive Bayes model; (vi) RandomForest; (vii) Logistic Model Tree; (viii) REPTree model; (ix) JRip algorithm;(x) Multilayer Perceptron. Our meta-ensemble model is adopted to realize the aforementioned function g (),i.e., to deliver the matching and a ranking of the available ENs for a query q t .The proposed model gets as input ENs and q t ’s characteristics exposed by w and { v } Ni =1 ) and results the most efficient allocation. The efficiency of the allocationis realized by the ‘appropriateness’ of ENs to host q t and provide the outcomein the minimum time with the best possible performance. The performance isdepicted by the matching between q t ’s constraints (i.e., w ) and the availabledatasets as well as the ability of ENs to minimize the response time (i.e., theexecution of q t should start immediately - we target to a low load - while beingcompleted as soon as possible - we target to a high processing speed). Fordetecting the most efficient assignment, we adopt the aforementioned ensembleschemes, we get their ‘opinion’ for every potential allocation and combine themthrough the proposed meta-ensemble model to conclude the final decision.The meta-ensemble learning scheme is based on a training dataset T D wheretuples of context vectors are present. Each training tuple is accompanied by theappropriate decision for the ‘virtual’ node that represents. More formally, everytraining tuple has the form v D = (cid:10) o T D , a
T D , r
T D , l
T D , s
T D , B (cid:11) . The first partof a training tuple is related to the aforementioned context vectors while thesecond part (i.e., B ) is related to the final classification result. We considera binary classification setup with two classes, i.e., B = 1 : Allocate and B = B = 0 : Do not allocate .The three aforementioned ensemble schemes, i.e., the AdaBoost, the Stack-ing and the Bagging methods are trained based on
T D and are adopted togenerate the envisioned results Y , Y , Y . When Y , Y , Y are produced, wehave to combine them to get the final result Y based on which, the final de-cision is concluded. In this effort, we adopt three approaches. The first is avery ‘strict’ scheme to produce Y relying on the Boolean model originated inInformation Retrieval [49]. Based on this model, the final result is deliveredthrough a simple conjunction, i.e., Y = (cid:81) i =1 Y i . This means that if at least oneof the aforementioned schemes ‘votes’ against the allocation of q t to a specificEN/QP, the final result will be the class B . To conclude the class B , all theclassifiers should agree upon this decision. The second approach is a majorityvoting scheme where the majority of classifiers conclude the final result. Hence,when count (cid:8) Y i = B (cid:9) is at least 2.0, Y = B as well. Our future researchplans involve the definition of a more complex methodology for combining the17opinion’ of the adopted ensemble classification schemes. In [26], the interestedreader can find a set of aggregation techniques for classifiers outputs. It shouldbe noted that our decision to incorporate the ‘strict’ technique together withthe majority voting scheme deals with our intention to use two representativeschemes that are capable of delivering the final result in limited time requir-ing limited resources (no additional data structures and variables that consumememory).Each tuple in T D represents a combination of the adopted parameters involv-ing queries and processors characteristics. For the delivery of the ENs where q t will be allocated, we apply the one-over-all (OVA) methodology [31]. Itconsists of a widely adopted technique for multiclass classification with highperformance. In [57], a number of optimization algorithms are compared withthe OVA model. Although these approaches may have a theoretical interest, itdoes not appear that they offer any advantage over a simple OVA scheme. Inaddition, OVA compared to other techniques does not require any additionaltime for training. In our scenario, N classes are available; one for each EN/QP.Hence, we adopt N binary classifiers built by the above described meta-ensemblemodel. Actually, we apply N times the same meta-ensemble binary classifier fordeciding in which nodes q t could be allocated. The classification of an unknownvector v D concerns a voting scheme. If the adopted classifier positively predictsthe i th node for q t , the i th EN gets a vote. If the result is negative (i.e., class B ) all the remaining nodes except i get a vote. The node(s) with the mostvotes is selected to host q t . In the case of ties, we rank the nodes based on theirload. We perform a high number of simulations adopting synthetic traces and datasetsfound in the literature. In addition, we compare the proposed scheme with ourprevious work. The aim is to reveal the advantages and the drawbacks of ourmodel when adopted for the allocation of analytics queries. In the upcomingsub-sections, we describe our experimental setup and the assessment of ourmodel.
We perform our experiments with a custom software written in Java that con-tains a number of classes for the simulation of the adopted nodes and datasetspresent in them. Initially, we deal with the accuracy of the proposed FCP fordetecting the correct complexity class of a query. For this, we define the metric υ as follows: υ = Q C | Q | (5)where Q C is the number of the correct predictions. Obviously, we want υ → QC s.We define the metric R that represents the discussed throughput, i.e., R = | Q | (cid:80) | Q | i =1 T ci (6)where T ci is the conclusion time (in ms) for the i th query. The conclusion time is the time required to allocate a query to one of the available nodes just afterits reception. R depicts the amount of queries executed by a QC in a time unit(i.e., ms). The higher the R is, the higher the performance of the proposedmodel becomes.We also adopt parameters l ∗ and s ∗ depicting the load and the speed of theselected node, respectively. Based on these parameters, we define the metrics D l and D s . Both of them represent the ‘distance’ of l and s of the selected node fromthe optimal load and speed in the entire group. As the optimal load, we definethe lowest load ( l min ) observed in the available nodes at the decision time. In asimilar way, as the optimal speed, we define the highest speed ( s max ) among theavailable nodes at the decision time. Through the adopted performance metrics,we try to reveal if the proposed model is capable of selecting the best possiblenode at the decision time. If this is true, our scheme allocates the incomingqueries to nodes that will return the final response in the minimum time. Thefollowing equations hold true: D l = l ∗ − l min (7) D s = s max − s ∗ (8)We compare the proposed Conjunctive Scheme CS and the Majority Vot-ing Scheme ( M V S ) with our previous work [39], [41], [43] (Model 1 - M1,Model 2 - M2, Model 3 - M3, respectively). We have already discussed thenovelty of the current framework in contrast to our previous schemes, thus,for providing a holistic comparison in our experimental evaluation, we delivernumerical results. We evaluate our model for different realizations of N get-ting N ∈ { , , , } . For each experiment, we consider the execution of1,000 queries and take results for the aforementioned models. For the type ofthe incoming queries and the delivery of their complexity class, we rely on thedataset and the methodology presented in [40]. For the remaining parameters,i.e., query vectors, a , l , s and the data reported in nodes, we rely on a set oftraces. Our data are retrieved by the following traces: • Dataset 1. A synthetic trace based on the Uniform distribution defining a for each query. For the generation of each value, we consider an maximum a equal to 10. • Dataset 2. A synthetic trace based on the Uniform distribution deliveringvalues for query vectors, the data present in each node, l and s .19 Dataset 3. A synthetic trace based on the Gaussian distribution deliveringvalues for query vectors, the data present in each node, l and s . • Dataset 4. The trace reported by [4]. From this dataset, we adopt the pro-cessor utilization dimension that depicts the percent of time that threadsare running in a processor. The data are incorporated to our simulatorfor providing the realizations of l . The remaining parameters take valuesas described for Datasets 2 & 3. Initially, we report on the accuracy of the proposed FCP for delivering q s . Recallthat values present in q s depict the ‘membership’ of each query to the availablecomplexity classes. Executing the FCP for the queries dataset reported in [40],we get υ = 1 . q s . As already described, it is difficult to assignan individual complexity class for each query, thus, the maximum value of q s represents the class that exhibits the highest similarity with q t . If we get thethreshold equal to 0.9, υ → .
55. It is worth noticing that the maximum valuein q s in the entire set of our experiments is over 0.8.We report on the probability density estimate ( pde ) of the adopted perfor-mance metrics achieved by the proposed model. In Figure 3, we see our resultsfor T c (the time required in ms for the allocation of a query). We observethat the uniformity of the data (upper figure) leads to the uniformity of thediscussed performance metric. In contrast, the use of the Gaussian distribution(lower figure) leads to results affected by the number of the available ENs. Inthe latter case, a low number of ENs (e.g., N << T c . Inthese scenarios, the throughput of QCs is high managing to allocate multiplequeries in a limited time interval. It should be noted that the discussed resultsrefer in the CS . If we pay attention on the M V S (see Figure 4), we observethe uniformity of T c as well. However, the higher the N is the higher the T c becomes. This is natural, as QCs should process multiple context vectors asinputs into the matching process. In any case, if we consider that the providedresults are in the scale of ms, we can conclude that the proposed mechanism,even if it involves multiple classification models, manages to deliver the finalallocation in a limited time interval. 20 T c pd e ( T c ) N = 10 N = 100 N = 500 -50 0 50 100 150 200 250 300 350 T c pd e ( T c ) N = 10 N = 100 N = 500 Figure 3: Pde of the allocation time for CS (up: Uniform distribution, down:Gaussian distribution). -50 0 50 100 150 200 250 T c pd e ( T c ) N = 10 N = 100 N = 500 -50 0 50 100 150 200 250 300 T c pd e ( T c ) N = 10 N = 100 N = 500 Figure 4: Pde of the allocation time for MVS (up: Uniform distribution, down:Gaussian distribution).In Figures 5 & 6, we present our results related to l ∗ for both CS and M V S .The higher the N is, the lower the l ∗ becomes. This is more intense in M V S .A high number of ENs give us more opportunities to seek for the best possiblenode. The provided classification scheme and the OVA approach manage toconclude nodes that exhibit low load, thus, it could deliver the final result ina limited time (however, this finally depends on the complexity of the processdemanded by the query). A low number of ENs may lead to a higher loadcompared to the previous scenario in both models, i.e., CS , M V S .21 l ∗ pd e ( l ∗ ) N = 10 N = 100 N = 500 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 l ∗ pd e ( l ∗ ) N = 10 N = 100 N = 500 Figure 5: Pde of l ∗ for CS (up: Uniform distribution, down: Gaussian distribu-tion). -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 l ∗ pd e ( l ∗ ) N = 10 N = 100 N = 500 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 l ∗ pd e ( l ∗ ) N = 10 N = 100 N = 500 Figure 6: Pde of l ∗ for MVS (up: Uniform distribution, down: Gaussian distri-bution).In Figure 7, we plot the throughput of the proposed mechanisms. Thethroughput is minimized when we have to deal with multiple nodes; a naturalconsequence of the involvement of multiple processes until the final decision. The CS exhibits the best performance among the models, however, the differencebetween them is limited when N → N = 10, the CS exhibits theworst performance when the Uniform distribution is adopted to produce thevalues for our parameters. In terms of the number of queries allocated in asecond, we get that the CS can allocate 420 to 35 queries while the M V S canallocate 430 to 16 queries. 22 N R Uniform, CSUniform, MVSGaussian, CSGaussian, MVS
Figure 7: The throughput of the proposed model.In Figure 8, we plot our results related to the optimality of the proposedmodel as far as D l concerns. The CS outperforms the M V S except one exper-imental scenario ( N = 10). Recall that if at least one of the adopted ensembleschemes rejects the allocation, the query is not classified to the correspondingENs. This means that the CS tries to take a decision under unanimity that isnot the case in the M V S . The significant is that as N → CS results in a distance equal to 0.016 & 0.022 for the Uniform and the Gaussiandistributions, respectively. The M V S results in a distance equal to 0.038 &0.110 for the Uniform and the Gaussian distributions, respectively. In any case,these numbers exhibit the efficiency of the proposed scheme keeping in mindthat we plot the average D l throughout our experiments.
10 50 100 500 N D l Uniform, CSUniform, MVSGaussian, CSGaussian, MVS
Figure 8: The load of the selected node as delivered by the proposed modelcompared to the lowest available load.23he next set of our experiments deal with the speed of the selected node.We should note that the speed of processors is randomly selected as depictedby the aforementioned traces with a maximum value equal to 10. In Figure9, we present the relevant results. The distance from the optimal speed inthe group increases as N increases. Especially when N = 500, the distanceis high. We have to compare these results with the results derived for D l .The increased number of nodes positively affects the load of the selected nodewhile it negatively affects the speed of the selected EN. The proposed modelpays more attention on the minimization of load, however, we should note thatthis is affected by the training set adopted to prepare the ensemble classifiersfor decision making. This dependence on the training set is ‘typical’ for anysupervised machine learning algorithm, thus, in our future research plans wewant to build an ensemble model based on unsupervised techniques and comparethe performance.
10 50 100 500 N D s Uniform, CSUniform,MVSGaussian, CSGaussian, MVS
Figure 9: The speed of the selected node as delivered by the proposed modelcompared with the highest available speed.In Figure 10, we plot the l ∗ realizations, i.e., the average load of the selectednodes for each of the 1,000 queries. The load is limited for a high number ofnodes, which means that the burden added to ENs is minimized. The inter-esting is that the statistical error in our measurements is also limited (for themajority of the scenarios) exhibiting the stability of our model. In this set ofour experiments, the CS exhibits the best performance compared to the M V S .24 N l ∗ Uniform, CSUniform, MVSGaussian, CSGaussian, MVS
Figure 10: The load of the selected node as delivered by the proposed model.The next set of experiments deals with the dataset provided by [4]. Ourresults are presented in Figures 11 & 12. The outcomes confirm our previousobservations when adopted the synthetic traces. An increment in the number ofENs positively affects the performance of the model and leads to the selectionof nodes with a low load. The average load of the finally selected node is similarfor CS and M V S . l ∗ pd e ( l ∗ ) N = 10 N = 100 N = 500 l ∗ pd e ( l ∗ ) N = 10 N = 100 N = 500 Figure 11: Pde for the load of the selected node based on Dataset 4 (up: CS,down: MVS). 25 N l ∗ CSMVS
Figure 12: The load of the selected node as delivered by the proposed modelbased on the real trace.We also report on our model’s sensitivity in parameters L and α . Recall that L depicts the number of dimensions into the available datasets and α is adoptedto ‘smooth’ the result of the Quasi arithmetic mean when we calculate q s and r i .In Figure 14, we consider Dataset 4 and L ∈ { , } to present our results for D l . We observe that the difference with the optimal (minimum) load increasesas N increases, however, D l is below 0.015 for both, i.e., the CS and the M V S .Apart from that, we can conclude that L does not heavily affect the performanceof the proposed approach. Our mechanism can be adopted for a high numberof dimensions to result the best possible node to host the incoming queries.It is worth noticing that an increased L will lead to an increased processingtime affecting R . In Figure 13, we present our results for α ∈ { . , . , . } and for Dataset 4. Again, the proposed scheme is not heavily affected by the α realization exhibiting efficiency in the detection of the best possible node tohost the incoming queries. The majority of results are below 0.01 (the selectednode is very close to the node exhibiting the lowest load).
10 50 100 500 N D l CS L = 10 L = 50
10 50 100 500 N D l MVS L = 10 L = 50 Figure 13: Our sensitivity results for various L realizations.26 N D l CS α = 0 . α = 1 . α = 5 .
10 50 100 500 N D l MVS α = 0 . α = 1 . α = 5 . Figure 14: Our sensitivity results for various α realizations.We also compare the proposed model with our previous efforts in the domain.The comparison refers in the throughput of QCs as well as the load of theselected node. Our model results an R varying from 0.42 to 0.022 for N ∈{ , , , } and average l ∗ ∈ { [0 . , . , [0 . , . } , for the CS andthe M V S , respectively. The model M1 discussed in [39] adopts an optimalstopping scheme that sequentially examines a set of processors before it deliversthe final allocation. M1 results R values between 0.02 (for N = 50) and 0.27 (for N = 10). Hence, for a low number of nodes CS and M V S outperform M1 whilein the remaining scenarios we observe a similar performance. The Model M2presented in [41] adopts a learning mechanism accompanied by a load balancingmodule. The discussed scheme manages to allocate 58.48 queries for N = 100 to454.55 queries for N = 20. The average load of the selected node (the learningscheme) is around 0.10 for N ∈ { , , , , } . The average load for theclustering scheme is between 0.160 and 0.670. Our current models outperformthe clustering scheme adn exhibit a similar performance with the learning model.Finally, the model M3 discussed in [43] adopts a ‘simple’ learning scheme for thedecision making. The performance of M3 related to the throughput is 50-100queries per time unit when N → l parameter, the current model managesto result the nodes with a low load and, in many cases, better than our previousefforts. 27 Conclusions
The present status of Web, IoT and pervasive computing deals with the pres-ence of numerous devices able to collect and process data. All these data aretransferred to the Cloud to be the subject of further processing and knowledgeproduction. Usually, such knowledge has the form of analytics that are theresponse in queries defined by users or applications. In this paper, we focuson an edge computing scenario where the collected data could be processed inedge nodes to reduce the latency in the provision of responses. Knowledge canbe produced locally, close to the source of data and end users. A number ofedge nodes can serve as aggregation points of data reported by the surroundingdevices. As data are geo-distributed, a critical questions is raised: in whichedge node a query should be executed? Our paper tries to respond to thisquestion and presents a model that efficiently allocates the incoming queries tothe appropriate nodes. We provide a decision making mechanism, i.e., a meta-ensemble learning model responsible to select the best possible node to allocatea query. The decision is made on top of multiple parameters related to queriesas well as to the available nodes. Our model takes into consideration the formof queries, their complexity, the deadline for having the final response togetherwith the load and the speed of nodes. More importantly, we propose the use ofa parameter that represents the distance between queries and datasets presentin nodes. The aim is to exclude nodes that do not own data related to the con-ditions raised by queries as these nodes cannot provide responses that ‘match’the requirements of the incoming queries. Our meta-ensemble learning schemeadopts multiple ensemble learning models to support a powerful decision mak-ing mechanism. We describe the performance of our framework through a largeset of simulations adopting synthetic traces and traces found in the literature.Our evaluation results show that our model is capable of efficiently allocatingthe incoming queries towards the support of (near) real time responses. In thefirst places of our future research plans is the definition of an adaptive learningmodel that will be fully aligned with the environment of the edge nodes. Thisis important as it will save resources devoted to the training process through an‘incremental’ training approach.
Acknowledgment
This research received funding from the European’s Union Horizon 2020 researchand innovation programme under the grant agreement No. 745829.