An Algebra to Merge Heterogeneous Classifiers
aa r X i v : . [ c s . D M ] J a n An Algebra to Merge Heterogeneous Classifiers
Philippe J. Giabbanelli a, ∗ , Joseph G. Peters b a University of Cambridge, Cambridge, CB2 0QQ, United Kingdom b School of Computing Science, Simon Fraser University, Burnaby, British Columbia, V5A 1S6, Canada
Abstract
In distributed classification, each learner observes its environment and deduces a classifier. As a learner has only alocal view of its environment, classifiers can be exchanged among the learners and integrated, or merged , to improveaccuracy. However, the operation of merging is not defined for most classifiers. Furthermore, the classifiers that haveto be merged may be of di ff erent types in settings such as ad-hoc networks in which several generations of sensorsmay be creating classifiers. We introduce decision spaces as a framework for merging possibly di ff erent classifiers.We formally study the merging operation as an algebra, and prove that it satisfies a desirable set of properties. Theimpact of time is discussed for the two main data mining settings. Firstly, decision spaces can naturally be used withnon-stationary distributions, such as the data collected by sensor networks, as the impact of a model decays overtime. Secondly, we introduce an approach for stationary distributions, such as homogeneous databases partitionedover di ff erent learners, which ensures that all models have the same impact. We also present a method that usesstorage flexibly to achieve di ff erent types of decay for non-stationary distributions. Finally, we show that the algebraicapproach developed for merging can also be used to analyze the behaviour of other operators. Keywords:
Model combination; Non-stationary distributions; Unsupervised meta-learning
1. Introduction
A wide variety of systems, such as sensor and peer-to-peer networks, are composed of independent entities. Theseentities record observations from their local environments, for example as databases of observed tuples. The entitiesoften have to adapt to future changes, but their local views may be too restricted to compute classifiers that can makeaccurate predictions. Therefore, a global classifier that can be used to predict global trends is often desirable. Entitiesmight not compute a global classifier by exchanging their datasets directly, because of security concerns, or becausethe large quantity of information could overwhelm limited resources such as local storage or battery power. Thus,the global classifier has to be realized by combining the classifiers deduced by the entities. The need to combineclassifiers also arises in other situations. Several examples illustrate the need to combine classifiers in Section 2, andothers can be found in the series of annual International Workshops on Multiple Classifier Systems [1].Di ff erent ways have been proposed to realize a global classifier (e.g., ensemble classifiers, meta-learning). Theseways have been grouped di ff erently, but a key distinction can be found between fusion and selection ( [17], p. 106).Intuitively, classifiers are specialized when they are restricted to instances with specific features: when a new instancecomes in, the appropriate classifier is thus selected. Alternatively, classifiers may be designed for instances with thesame features, and they are then fused. In this paper, we are concerned with the intermediate case: classifiers can havean overlapping feature space, but some may be more competent in parts of that space than others. This is typicallyreflected in approaches that weight the outputs of the classifiers depending on the instance being classified [17]. Toemphasize that our approach is intermediate, we say that we merge classifiers instead of selecting or fusing them.Merging classifiers is a complicated operation for several reasons. Firstly, classifiers can be of di ff erent structures,such as decision trees and support vector machines. Secondly, conflicts arise when classifiers di ff er on their predic-tions. Solving these conflicts either uses heuristics (e.g., pure meta-learning [6, 28, 12]), or requires access to the ∗ Corresponding author.
Phone: +
44 (0)1223 330315,
FAX: +
44 (0)1223 330316
Email addresses: [email protected] (Philippe J. Giabbanelli), [email protected] (Joseph G. Peters) ataset used to train the classifiers, which is undesirable in settings such as sensor networks due to the high cost oftransmission. In this paper, we consider the case of meta-learning in which the classifiers are potentially of di ff erenttypes. The concept of meta-learning and the precise types are described in Section 3.Most of the previously proposed solutions have been application-specific and were validated through experiments,leading researchers to suggest that questions about combining classifiers are (still) open “because we do not yet havea scientific understanding of the classifier combination mechanisms” [16]. Consequently, our approach emphasizesthe formal aspects. In Section 4, we introduce decision spaces as an algebraic framework to investigate the behaviourof distributed data mining algorithms in which learners propagate only local models and not local observations. Thisframework can be used in a large variety of cases since it allows the combination of di ff erent models ( e.g., decisiontrees, rule sets, support vector machines with linear kernels) and does not rely on homogeneous observations. In Sec-tion 5, we present a heuristic merge operator that solves all conflicts between models without using their observations.We use the algebra to prove that the operator satisfies a set of desirable properties, such as commutativity: when twomodels are merged, the result does not depend on which model comes first. However, the operator is not associative:if three or more models have to be merged, then the order in which the operator is applied a ff ects the result. This canbe a desirable behaviour in settings such as data streams [20], in which the distribution changes over time and themost recently received data is considered to be the most representative of current trends.In other situations, such as homogeneous distributions, this behaviour is not appropriate. For example, a massivehomogeneous database could be partitioned among computation units for the sake of e ffi ciency, and the contributionof each unit to the final result should not depend on the time at which it sends its model. In Section 6, we develop twoprotocols, or schemes . The first one uses storage flexibly to achieve several types of decay in a data stream setting,and can be used for the case of a homogeneous database. The second scheme uses no storage space and has no decay.Our formal framework can also be used to characterize other common, yet di ffi cult, problems of data mining. Forexample, a learner can generate a sequence of models and analyze this sequence to find patterns of changes in theunderlying system. In a data stream setting, this is referred to as a blind method operating over a sliding window. InSection 7, we define a reduction operator that reduces a sequence of models to a single model to permit easier analysisof the sequence, and we characterize its properties using an algebraic approach. We also briefly discuss how morecomplex operators can be defined using compositions of the reduction and merge operators.
2. Applications
Let us consider that we are monitoring the incidence of a chronic disease in a city. For each part of the city, wewant to know whether the incidence is high enough to prompt a specific action. Di ff erent groups provide maps ofthe disease, but these groups might not use the same spatial unit. One group may have divided the city into blockswhile another one uses administrative borders. Furthermore, reports may have been released in di ff erent months, andadministrative borders may have evolved. Thus, we need to merge maps reporting on the incidence with di ff erentunits. Specifically, we do not have access to individual data (i.e., the data points upon which the maps are based),and we must guarantee that all maps have the same impact in our composite picture. The decision space frameworkthat we propose is able to handle these requirements, and the algebra can be used to prove properties of the compositemap. Our framework can also handle the situation in which the monitoring would be done over a long period of time,and more recent maps need to have a higher impact in the composite map. Note that, in our framework, each elementof the map has to be a polygon; if a region is delimited by curves then it will have to be approximated. The problem of combining maps is classically faced by geographical information systems (GIS). In the early1980s, Tomlin proposed to consider maps as two-dimensional arrays and to design script-like languages to ma-nipulate them in GIS [26]. These languages allow operations such as the selection of areas (e.g.,
School area = distance ( SCHOOLS ) < m ), as well as the combination of selected areas using numerical operations. Thus, mapalgebras are specialized programming languages for manipulating the cells of grids, which now o ff er the possibilityto process maps within GIS using cellular automata or image-processing techniques [7, 22]. While these algebras canbe used to combine maps [9], there is no (mathematically proven) control of the impact that each map has on the finalcombination. 2 .3. Wireless sensor networks Wireless sensor networks [2] are a natural application for our approach. They can be found in applications such asforest fire detection, where temperature sensors measure the current temperature and its rate of change. They are alsoused to study earthquake activity by measuring the strength and duration of seismic waves in the earth’s crust. Themeasurements are forwarded to collection nodes, which forward the data to a base station for processing. A typicalsensor has a modest battery life which can be quickly drained by the transceiver when sending large amounts of data.Thus it is infeasible to forward all of the collected data to the base station. However, sensors have the computingcapabilities to perform summaries such as averages and standard deviations which can then be sent to the base station.In our approach, each sensor can construct a classifier based on its local observations, and send it as a summary.This summary is richer than a simple average when the goal is to make predictions. This richness may come at acost in some classifiers: for example, a model derived using random forest classifiers [5] may be heavier than itstraining data, and would not be appropriate for wireless sensor networks. However, the model is usually lighter whenit simplifies the data (e.g., using pruning in decision trees), and we focus on classifiers that lead to such models. Oncea classifier is sent by each sensor, a collection node can merge the classifiers that it receives to form a model of aregion before forwarding it toward the base station. The merging of classifiers provides a trade-o ff between batteryconsumption and prediction accuracy. The merging improves the accuracy of predictions compared to sending simpleaverages in exchange for an increase in battery consumption. The merging also provides a transparent means to givegreater weight to the most recent observations when the network is deployed over a long time.
3. Background
Our work focuses on classifiers. A classifier learns from a dataset of example instances about how target attributesare based on the values of predictive attributes. Then, it can address the classification problem of predicting the targetattributes of a new instance based on the values of its predictive attributes. Without loss of generality, we explain ourframework by focussing on one target attribute, but the operations can be applied for any number of target attributes.Formally, an instance has the form ( a , . . . , a m , y ) where a i is the value of the i - th attribute, and y is a class labelfor this data. For example, suppose that there are two instances ( a = , y = No ) and ( a = , y = Yes ) whereattribute a is an age in [0 , Yes or No , corresponding to an individual being classifiedas old or not. The goal is to find the best approximation to the function f ( a , . . . , a m ) = y that determines the labelof an unclassified instance given the values of its attributes. Three of the most commonly used techniques to train aclassifier, i.e., to deduce an approximation of f given a set of instances, are [15, 14] • A Support Vector Machine (SVM) classifies the instances into two classes by separating them with the ( m − • One of the most popular classifiers is the decision tree . A decision tree learner [23] applies a divide-and-conquer technique to recursively split the data using the value of an attribute while maximizing a metric suchas the information gain or Gini index. Each split is represented as a node and the recursive procedure yields atree. Thus, a path in the tree corresponds to a set of conditions on the values of the attributes, and leads to aclass distribution vector expressing the percentage of instances in each class. • A classifier can also be a set of rules based on the values of the attributes. This is commonly referred to as a rule set [8]. A rule is a conjunction of conditions on the attributes that results in a class distribution vector asin a decision tree. An attribute can be repeated at most twice in a rule, to specify a lower bound and an upperbound of an interval.In meta-learning, each learner builds a classifier using only its own data. For example, each unit in a network ofradar sensors can collect data about the movements of targets within its radius. Then, each sensor uses its own datato deduce a classifier such as a decision tree. In such an environment, all sensors may want to have a classifier thatis as accurate as possible and thus they exchange their classifiers and merge them. In another setting, one may wantto achieve a speedup by using parallel algorithms and thus a homogeneous database is partitioned among di ff erent3omputation units, either by providing them with subsets of observations or subsets of attributes (vertical partition-ing [24]). Then, the units send their classifiers to common sources in charge of merging. In both homogeneousdatabases and distributed environments, the problem is to merge a set of classifiers to create a single classifier.A number of researchers have focussed on merging decision trees: it was noted in [3] that “a kind of decision treeinduction [that is] e ffi cient in a wide area system employs meta-learning, [in which] each computer induces a decisiontree based on its local data and then the di ff erent models are merged to form the final tree” . For example, it has beenproposed [13] to transform decision trees into the sets of rules that they represent and to merge those sets. A rule notin conflict with other rules is kept intact and otherwise the conflicts are resolved using a heuristic. However, thereare two potential problems with the design of the heuristic proposed in [13]. Firstly, the authors argue that conflictsthat are not handled by the heuristic are “unlikely if the training sets contain similar distributions of examples froma coherent larger training set” , thus the approach is limited to homogeneous data bases. Secondly, their algorithmcould ask a learner to send all of the data on which there is a conflict in order to perform data mining again, and thisprevents uses in settings such as sensor networks.The requirement that the data needs to be examined in case of conflicts is found in several other approaches. Forexample, arbiter meta-learning [27] is a technique that merges classifiers in a hierarchical way. This technique hasbeen designed for homogeneous databases, and parts of the data must be propagated during the combination process.Ganti and colleagues [11] focused on a di ff erent problem: they examined how one could quantify the di ff erencebetween two datasets by analyzing their models, which can be used to determine whether a model should be updated.Their concepts are similar to ours in that they considered a model as a geometrical space divided into units, each ofwhich is assigned a value, and combining two models requires scanning the data. The key di ff erence of our framework,introduced in the next section, is its ability to merge potentially heterogeneous classifiers without requiring any databesides the models.
4. A framework: decision spaces
Intuitively, a decision space is an m -dimensional space, in which each dimension corresponds to one of the m attributes. It contains a set of non-overlapping elements which, if they cover all the space, form a partition of thespace. A geometrical interpretation of an element is a subspace defined by an m - polytope . Thus, the only requirementfor the classifiers considered in this paper is that their elements have to be polytopes, which is the case for SVMswith linear kernels, decision trees, and rule sets. Our study of polytopes is a first step to investigating the theoreticalbehaviour of pure meta-learning. Further research could remove the restriction to polytopes to generalize our approachto handle other classifiers such as SVMs with non-linear kernels [4]. Informally, non-linear kernels can define shapesin terms of curves whereas polytopes can only use lines.Each element, or polytope, has a class distribution vector that specifies the percentage of instances within thecovered space that are assigned to each class. Determining the percentages by enumerating the instances is straight-forward for the three classifiers considered here. An example of a decision space is shown in Figure 1(a): it has twoattributes, degree and age , and three elements, each with a class distribution vector of size 2 (with classes Yes and No ).Any classifier is capable of producing a label for a given instance, and can be repeteadly prompted for labels over adiscrete space. Therefore, any classifier can provide a class distribution vector. In Xu’s categorization of the outputsused to combine classifiers, our requirement is the most universal (Type 1) [29]. These concepts are formalized inDefinitions 1 and 2. Definition 1.
A decision space is an m -dimensional (product) space D attr × · · · × D attr i × · · · × D attr m where m is thenumber of attributes and D attr i is the space covered by the i - th attribute specified as a bounded poset ( i.e., a partiallyordered set with a least and a greatest element). Definition 2.
An element of an m -dimensional decision space D is a subspace of D , i.e., an m -dimensional polytope( m-polytope ). It is identified by a set of coordinates for each attribute, and has a class distribution vector V with c components, where c is the number of classes. The i - th component of the vector is the percentage of instances in class i , which is obtained by counting all instances that have class i and are within the element’s space. Thus, the contentof the vector sums to 100%. We will refer to the vector V as the value of the element.4 igure 1: Decision tree (a) and decision space (b) representations. The constraints on the structure of a classifier are used by a data mining algorithm to guide the search. A decisionspace is a framework and does not result directly from a data mining algorithm, thus its structure is less constrainedthan classifiers such as decision trees. This ensures that elements from several kinds of classifiers can be convertedto elements of a decision space with no loss of information. To convert a classifier into a decision space, the onlydata required besides the classifier itself are the ranges of attributes for the dataset on which the classifier was trained.These ranges can be deduced in one pass over the dataset by scanning for the maximum and minimum values. Theranges can also be user-supplied, but should not be smaller than what is found in the dataset for consistency sake.Thus, given a classifier and the ranges of the attributes, the main task of the conversion is to extract the individualelements, or polytopes. The polytopes for elements of the three classifiers defined in Section 3, in order of increasingconstraints on the shapes, are as follows:(1) The elements in an SVM can have the most general shapes because the data can be assigned into regions.(2) A rule in a rule set defines an axis-parallel rectangle.(3) Each path of a decision tree (Figure 1(b)) can be converted to a rule, and this rule defines an axis-parallelrectangle (Figure 1(a)).A decision tree cannot generate all axis-parallel rectangles. Indeed, a decision tree belongs to the data miningfamily of divide-and-conquer algorithms that imposes constraints on the search. Intuitively, a cut in the space alongthe border of an element, either vertical or horizontal, should not cut any element [10]. An example of a set of rulesthat violates this constraint is given below, and shown in Figure 2.IF age ≥ and age < and degree ≥ and degree < A IF age ≥ and age < and degree ≥ and degree < B IF age ≥ and age < and degree ≥ and degree < C IF age ≥ and age < and degree ≥ and degree < D IF age ≥ and age < and degree ≥ and degree < E The elements of rule sets and decision trees can be converted to elements of a decision space using Algorithm 1.The algorithm uses pattern-matching. For example, in line 4, attr k OP = { <, ≤} val k up is a pattern for which the valueof an attribute has to be lower or strictly lower than the value val k up . If the pattern is found, then attr k , OP , and val k up are bound to the actual values. For example, a pattern age < attr k = age , OP = “ < ”, val k up = attr k (line 15), then we use the lower bound of the attribute’s range whichwe denote min ( D attr k ). Finally, if no upper bound is found for attr k (line 20), we use the upper bound of the attribute’srange which we denote max ( D attr k ). 5 igure 2: A partition of space not allowed by decision trees but allowed by decision spaces. Converting elements from a support vector machine follows a simpler process: each element of the decision spacecorresponds to exactly one element of the SVM with the same distribution vector and the same coordinates specifyingthe space spanned.
Algorithm 1
RulesToSpaces(Ruleset R , Attribute ranges D attr × · · · × D attrM ) Require:
Rules are expressed in the following form: r : = IF attr ≍ val AND · · ·
AND attr m ≍ val m THEN class = X
1. Decision space S ← ∅ // Initially, the decision space is empty2. for r ∈ R do // Each rule of the rule set is converted to one element3. element e . value ← r . X // The elements value is the one predicted by the rule4. P ← all patterns attr k OP = { <, ≤} val k up in R // A rule is a conjunction of patterns, each delimiting abound5. for p ∈ P do // Use each upper bound to find the elements space6. if ∃ a pattern attr k OP = { >, ≥} val k low in R then // Check if there is a lower bound on the same variable7. if OP is < and OP is > then // The two bounds do not include the endpoints8. e . space ← e . space ∪ ( val k low , val k up )9. else if OP is < and OP is ≥ then // Only the upper endpoint of the bound is included10. e . space ← e . space ∪ ( val k low , val k up ]11. else if OP is ≤ and OP is > then // Only the lower endpoint of the bound is included12. e . space ← e . space ∪ [ val k low , val k up )13. else // Both endpoints of the bound are included14. e . space ← e . space ∪ [ val k low , val k up ]15. else // We only have an upper bound on the element16. if OP is < then e . space ← e . space ∪ [ min ( D attr k ) , val k up )18. else e . space ← e . space ∪ [ min ( D attr k ) , val k up ]20. if P = ∅ then // If there was no upper bound21. P ← all patterns attr k OP = { >, ≥} val k low in R // Then access the lower bound22. for p ∈ P do if OP is > then // The lower bound is excluded24. e . space ← e . space ∪ ( val k low , max ( D attr k )]25. else e . space ← e . space ∪ [ val k low , max ( D attr k )] // The lower bound is included27. S ← S ∪ e // Add the element to the decision space6 . Merge operator
The most fundamental operation on decision spaces is merge. In the following, we use the term “element” to referto a geometrical space spanned, and “value” to refer to a class distribution vector. Given two decision spaces X and Y , we merge them into Z using the following principles: Merge principles (1) If an element (subspace) x ∈ X does not intersect with any element y ∈ Y , then the prediction represented by x has no conflicts and can be added to Z . An element y ∈ Y with no conflicts is treated similarly.(2) If an element y ∈ Y is strictly contained within an element x ∈ X with the same value ( i.e., the same classdistribution vector V ), then we consider y to be too specialized (as explained in Definition 5) and delete it.(3) If neither of the first two conditions is satisfied, then the element x ∈ X intersects with at least one elementof Y and conflicts must be resolved.Prior to establishing the algorithm to merge elements, we introduce the formal notation on which it relies, andspecify the ways that elements can intersect. From here on, we use the following notation for an element x ∈ X : • x has a set of attributes A ( x ). • Each attribute a ∈ A ( x ) covers a (one-dimensional) space S ( x , a ). It can be an interval such as [8 , • The size of the space covered by x for the attribute a is denoted | S ( x , a ) | . • x has a class distribution vector V ( x ). Definition 3.
An element x ∈ X subsumes an element y ∈ Y , denoted y (cid:22) x , if A ( x ) = A ( y ) and ∀ a ∈ A ( x ), S ( y , a ) ⊆ S ( x , a ).In other words, an element y is subsumed by an element x when the space covered by y is included in the space cov-ered by x . We denote strict subsumption by ≺ when ∀ a ∈ A ( x ), S ( y , a ) ⊂ S ( x , a ). The main property of subsumptionis established by Theorem 1. Theorem 1.
Let X and Y be two decision spaces. For each y ∈ Y, there is at most one x ∈ X such that y (cid:22) x.Proof.
The elements of X and Y partition the spaces formed by X and Y , respectively. Thus, one element ( i.e., component of a partition) can be completely included in at most one other element. Definition 4.
The intersection of an element x ∈ X with a decision space Y is denoted x ⊎ Y and is the largest subsetof elements I = { y , . . . , y n } ⊆ Y such that for each y i ∈ I , ∃ a ∈ ( A ( x ) ∩ A ( y i )) such that S ( x , a ) ∩ S ( y i , a ) , ∅ .Subsumption is a special case of intersection.The first two principles of merging can be handled by the notions in Definitions 3 and 4. For the third principle,we resolve each conflict between two elements x ∈ X and y ∈ Y by creating a new element z for each intersection. Asimple approach would be to assign to z a value that is the average of the values of x and y , but it would not take intoconsideration the spaces covered by x and y ( i.e. , their regions of competence). Consequently, Definition 5 specifieshow to measure the space covered by an element, which we call its specialization (also known as competence). Then,Definition 6 calculates the value of z is calculated as a weighted average based on the spaces covered by x and y .It should be noted that Definition 6 is a classical combination scheme intermediate between classifier fusion andselection ( cf . Introduction). In these schemes, all classifiers contribute to the outcome of a given space space withweights based on their competence for that space [17]. Definition 5.
The specialization of an element x ∈ X is M ( x ) = P a ∈ A ( X ) | S ( x , a ) || A ( X ) | . x and we normalize by the number ofattributes. Small values of M indicate specialized predictions based on small spaces. Other possible metrics couldtake into account the number of instances, or the class distribution vector. However, in the case of pure meta-learninginvestigated in this paper, we cannot use the number of instances or the class distribution vector in a specific part ofan element. This could only be done by requiring the original classifier to perform an exhaustive search in its dataset,and repeated use of such an operation could lead to significant overhead. Therefore, such alternative metrics are notconsidered here. Definition 6.
The value of the intersection of two elements x ∈ X and y ∈ Y based on their specializations is the classdistribution vector V ( x ⊗ y ) = V ( x ) × M ( x ) M ( x ) + M ( y ) + V ( y ) × M ( y ) M ( x ) + M ( y ) . If the elements predict several attributes, then the class distribution vector for each attribute is computed using theabove formula. Other formulae for V ( x ⊗ y ) could be designed to suit application-specific needs. Indeed, both theelement’s space and its value must be taken into account, but specific applications may require that the specializationbe normalized di ff erently. However, care should be taken to ensure that custom formulae satisfy the commutativity,idempotency, and unique identity properties. Indeed, such properties are critical for proving that the merge algorithmbehaves appropriately, as will be shown in Section 5.3. The merge operator ⊗ : ( X , Y ) Z is defined in an algorithmic way by Algorithm 2 and is illustrated in Figure 3.First, we apply Merge principle (1): all elements with no intersection are added. Then, we apply principle (2): eachelement x ∈ X that is strictly contained in an element y ∈ Y with the same value is deleted. Finally, principle (3) isapplied as follows: 8 lgorithm 2 ⊗ : ( X , Y ) Z Z ← new decision space2. H ← ∅ // Hash map: set ( y , y ′ ) of keys y and associated values y ′ for x ∈ X do // For each element x if ∄ y ∈ Y such that x ≺ y and V ( x ) = V ( y ) then // If there is no y that subsumes it with the same value, thenprocess it5. if x ⊎ Y = ∅ then // If x does not conflict with any element of Y , then add it tothe result6. Z ← Z ∪ x else // Otherwise, resolve the conflicts8. tmp ← x . space // The initial space of x is saved9. for y ∈ x ⊎ Y do // Create an element z to handle each conflict10. z ← new element11. z . space ← tmp ∩ y . space // z s space is the one initially shared by x and the conflictingelement12. z . value ← V ( x ⊗ y ) // z s value (class distribution vector) is based on both x s and y s13. Z ← Z ∪ z // z is added to the resulting decision space14. if ∄ ( y , y ′ ) ∈ H then // If we never registered a conflict with y then register it15. H ← H ∪ ( y , y . space \ z . space ) // Register that the space of y that conflicted has been handled16. else // Otherwise update the previous record17. H ← H \ ( y , y ′ )18. H ← H ∪ ( y , y ′ . space \ z . space )19. x . space ← x . space \ z . space // Also update that some of x s conflicting space was handled20. if x . space , ∅ then // If, after solving the conflicts, x has (non conflicting) spacethen we keep it21. Z ← Z ∪ x for y ∈ Y such that ∄ ( y , y ′ ) ∈ H do // If some elements y have never found to be in a conflict . . . if ∄ x ∈ X such that y ≺ x and V ( y ) = V ( x ) then // . . . and that they are not subsumed by an element x with thesame value24. Z ← Z ∪ y // Then we add them to the result25. for ( y , y ′ ) ∈ H do // For each element y that was found to be in a conflict26. if y ′ , ∅ then // If some of this element was free of conflict27. Z ← Z ∪ y ′ // Then we add it28. return Z • We consider all possible intersections x ⊎ Y and handle them sequentially. • For each y ∈ ( x ⊎ Y ), we create an element z . The space covered by z is the intersection of the spaces coveredby x and y , and the value is V ( x ⊗ y ) (Definition 6). • There will be a “remainder” when the process is finished if x only has a partial intersection with Y . Thus, weremove from x the space of each z resulting from the intersection, and we add the remainder, if any, to the result.After applying all three merge principles for each element x ∈ X , we process the elements y ∈ Y . Merge principles(1) and (2) have to be applied to these elements, but principle (3) can be partially avoided. Indeed, the elements in theintersection of X and Y have already been computed when processing X . We use a hash map H as a cache, in whichthe element y is used as a key and its corresponding value y ′ in the hash map is the space that is updated throughoutthe process. When an intersection between x and y is found, the part in the intersection is virtually removed from y by changing the value for y in H . After all of the intersections have been computed, H contains the remainder of eachelement of Y and thus it can be added directly to the result. Example.
In Figure 3, the left decision tree was trained on a dataset with the degree ranging from 0 to 15, and theage ranging from 0 to 8. The right decision tree was trained on a dataset with the degree ranging from 0 to 11, and9 igure 3: Merging two decision trees by converting them into decision spaces and creating a union decision space. the age ranging from 0 to 9. The light grey element in the left decision space and the element with the checkerboardpattern in the right decision space intersect. As two elements converted from decision trees only intersect in onecontiguous space, the result is the new space z whose age ranges from 7 to 8, and whose degree ranges from 3 to 10.Using Definition 6, the percentage for the class Yes is ( × / ( + ) + ( × / ( + ) ≈
21, and since thereare only two classes in this example, the percentage for the class No is 100 − = m -dimensions (for m attributes) can result in spaces that are not rectangles. Forexample, Figure 4 shows the intersection of two decision spaces X and Y . Decision space X contains four rectangularelements and Y contains one rectangular element, but the intersection of the element of Y with any element of X leavesa non-rectangular remainder of Y . Constraining elements to be rectangles requires a heuristic that can bias the resultsince rectangles are only an approximation of the element’s actual specialization. We avoid this problem by usingpolytopes instead of rectangles to provide an exact algebra. The possibility that elements may have to be constrainedto simple polytopes ( e.g., rectangles) for computational reasons is discussed in the Conclusions. The goal of a merge operator ⊗ is to merge the information of two decision spaces, resolving any conflicts thatarise. Thus, it has to obey a set of algebraic properties in order to be consistent:10 igure 4: Intersection of decision spaces X (four rectangular elements) and Y (one rectangular element). • Commutativity: merging a decision space with another one should not depend on which one is first but only onthe information. • Identity element: merging a decision space with a decision space that does not contain any elements should notchange anything because there is no new information. • Idempotence: merging a decision space with itself should not change anything, as there are neither conflicts nornew information.Theorems 2, 3, and 4 prove that our merge operator satisfies these three algebraic properties. Intuitively, the algo-rithmic construction of the merge operator is based on unions of geometric spaces and the resolution of conflictsencountered for non-empty intersections. Both the union of spaces ( i.e., the intersection of geometrical elements) andthe resolution of conflicts ( i.e., the weighted average of values) satisfy the three algebraic properties. Therefore, weshow that they are combined “appropriately” by the merge operator. We then show in Theorem 6 that the operator isnot associative, which is the subject of the next section.
Definition 7.
The set of all decision spaces is denoted by D . Theorem 2.
The ⊗ operator is commutative, i.e., ∀ X , Y ∈ D , X ⊗ Y = Y ⊗ X.Proof.
Suppose that there is a z ∈ X ⊗ Y . We prove that it also has to be in Y ⊗ X . We consider all possible cases fromwhich such a z could result:(1) z results from an x ∈ X that has no intersection. We show that if a y ∈ Y has no intersection it will also be keptin the result Z .(2) z results from the intersection of an x ∈ X with a y ∈ Y . We will show that there are no changes if we considerit as the intersection of a y ∈ Y with an x ∈ X .(3) z is the remainder of an element x ∈ X . We show that the remainder of an element y ∈ Y will also be kept in theresult Z .(1) As x ∈ X has no intersection, it will be added to the result (line 6). If a y ∈ Y has no intersection, it will not be inany x ⊎ Y and thus we will not create a ( y , y ′ ) ∈ H in line 15. As ( y , y ′ ) < H in line 22, y will be considered unchanged.Given that y has no intersection, it will be added to the result in line 24.(2) As x intersects with a y , the operator ⊎ relates x to y (line 9) and an element z will be created, its space being theintersection of the spaces of x and y , and its value being V ( x ⊗ y ). Each operation involved is commutative, thus theoverall process is commutative. 11 igure 5: Illustration of Theorem 6. (3) For an x ∈ X , we consider each y ∈ Y with which it intersects (line 9): an element z is created for each intersection,and its space is taken out of the space of x (line 19); once the spaces of all intersections have been taken out of x , theremainder is added to the result if it is not empty (line 21). The same process takes place for the remainder of a y ∈ Y :all the x ∈ X with which it intersects are considered (lines 3 and 9), an element z is created for each intersection,and we keep track of the remainder of y by updating its associated value in H or creating it for the first intersection(lines 14–18). An element y will not be considered anymore if it intersected with some x ∈ X (line 25), and instead itsremainder is added to the result (line 27). Definition 8.
A decision space E ∈ D is called an identity of D with respect to the ⊗ operator if and only if ∀ D ∈ D , E ⊗ D = D ⊗ E = D . Theorem 3.
There exists a unique identity E ∈ D with respect to the ⊗ operator, and it is an empty set of polytopes( i.e., the empty space).Proof. Let X and E be two decision spaces and E = ∅ . We first prove that ∀ X ∈ D , X ⊗ E = X , which leads to E ⊗ X = X using Theorem 2, hence E is an identity element. No element x is subsumed by an element e ∈ E , and x ⊎ E = ∅ , thus all elements x ∈ X are added to the result (line 6). As there is no e ∈ E , the loops in lines 22–27 arenot executed and so the final result is equal to X .We complete the proof by showing that Y ∈ D cannot be an identity for any X ∈ D if Y , E . As Y , E = ∅ ,there is at least one y ∈ Y . Let us consider X ∈ D such that y ⊎ X = ∅ . As y has no intersection with any x ∈ X , itwill be ignored by the main for loop (lines 3–21), thus ∄ ( y , y ′ ) ∈ H . As a consequence, it will be considered by thesecond loop (line 22) and, as it is not subsumed by any x ∈ X , it will be added to the result in line 24; thus, the resultis X ∪ y , X because y < X . Similarly, for the case y ⊎ X , ∅ , the element y is processed by the main loop (lines8–21). For any such y , ∃ X ∈ D with an element x ∈ X intersecting with y such that x and y cover a di ff erent space.This results in dividing x into new elements (line 11) and thus the result of X ⊗ Y is di ff erent from X . Theorem 4.
The ⊗ operator is idempotent as X ⊗ X = X.Proof.
Let X and Y be two decision spaces such that X = Y : • There is no x ∈ X strictly contained in an y ∈ Y . The use of strict subsumption ≺ instead of (cid:22) , in the definitionof ⊗ , is particularly important at this point. Indeed, if X = Y and we were using (cid:22) , then all elements in X wouldbe discarded because they are subsumed by the same value in Y , and similarly for Y . Thus, X ⊗ X would lead tothe incorrect result ∅ . • Each element x intersects with exactly one y . As they are the same, we will add one element to the result withthe following value for each component of the vector (lines 8–13): V ( x ) × M ( x ) M ( x ) + M ( y ) + V ( y ) × M ( y ) M ( x ) + M ( y ) = V ( x ) × M ( x )2 × M ( x ) + V ( x ) × M ( x )2 × M ( x ) = V(x).12 x will be added to the result for all x ∈ X (line 13). For all y ∈ Y , we added a pair ( y , z ) ∈ H such that z . space = x . space \ y . space = ∅ (line 15). Thus, all elements of y ∈ Y are skipped because they are in H (line22), and because the associated value is empty (line 26). Therefore the resulting decision space contains each x ∈ X exactly once. Corollary 5.
The binary operator ⊗ : ( D , D ) D is idempotent, commutative, and contains a unique identity elementfor the set D of all decision spaces. Therefore, ( D , ⊗ ) is a unital, idempotent, and commutative magma (see [19] for abrief review of algebraic structures such as magmas). Theorem 6.
The ⊗ operator is not associative: ∃ X , Y , Z ∈ D such that ( X ⊗ Y ) ⊗ Z , X ⊗ ( Y ⊗ Z ) .Proof. In Figure 5, if we first merge x with y , then the value of the intersection will depend on all of x and y , andwe get the shaded remainder of y . Then, if we merge with z , the value of the resulting intersection will depend onthe shaded remainder and z . However, if we first merge y with z , then the part of y that intersects with z is removedfrom y , resulting in a remainder y ′ . Then, if we merge with x , the value depends on x and y ′ which covers a smallerspace than y . It follows that its intersection with x will have a di ff erent value. Thus, the values change with the orderin which elements are merged, while the spaces covered do not (as they result from the intersection of geometricalspaces which is an associative operation).
6. The impact of merge order
According to Theorem 6, the ⊗ operator is not associative, so the result of merging more than two decision spacesdepends on the order in which they are merged. We refer to a merge order as a merging scheme . Note that a schemeis not necessarily a simple sequence: a merging scheme can specify that large groupings of decision spaces have to bemerged first, and then merged together. Its representation is defined below. Definition 9. A merging scheme specifies the order in which a set of decision spaces is merged. It can be representedas a tree, in which a leaf represents a decision space, an intermediate node represents the application of the ⊗ operator(hence an intermediate decision space), and the root represents the final result.Two merging schemes are shown in Figure 6. In the merging scheme in 6(a), the decision spaces are mergedpairwise (bottom layer of the tree), then the resulting decision spaces are merged pairwise, and so on, until a singledecision space W is obtained. We can describe this merging scheme as ((( X ⊗ X ) ⊗ ( X ⊗ X )) · · · (( X m − ⊗ X m − ) ⊗ ( X m − ⊗ X m ))), where m = k . As we will demonstrate in Corollary 8, this merge order does not introduce any biasinto the calculation of the value of W . Any di ff erences in the impacts of the decision spaces on the value of W arebased on their relative specializations. In contrast, the merging scheme in Figure 6(b) will lead to a bias in the result.In this scheme, the first two decision spaces are merged, then the result is merged with the third decision space, andso on, until W is obtained. We can describe this merging scheme as ((( X ⊗ X ) ⊗ X ) · · · ⊗ X m ). This merge orderdoes introduce bias into the calculation of the value of W . Each decision space is merged with the result of mergingits predecessors, it’s impact is the same as all of its predecessors combined. So, the impact of earlier decision spacesis reduced each time that a new decision space arrives. This e ff ect can be desirable. If the underlying distributionis changing, then we often want recent decision spaces to have a larger impact on the result because they representrecent trends.Both merging schemes can be desirable depending on the setting. The merging scheme shown in Figure 6(a) canbe used when a homogeneous database has been partitioned among several computation units. Indeed, the modelproduced by each unit should have the same impact on the final result as the underlying distribution is homogeneous.The merging scheme shown in Figure 6(b) can be used for time-changing distributions, for example data streams. Inthis setting, learners can deduce models at di ff erent times, and we may want to favour the most recent models as theyrepresent recent trends. The situation in which decision spaces are merged as soon as they are received is depicted inFigure 6(b): the decision space received at time t is merged with everything that was received up to time t −
1. Notethat we could also have an online setting such as in data streams, but we might not want the time at which models arereceived to have an impact. In this case, the scheme in Figure 6(a) can also be used, and it specifies that some decision13 igure 6: Unbiased (a) and biased (b) binary merging schemes. spaces will have to be stored temporarily: large groupings of decision spaces must be formed before the mergingoperation can take place to guarantee an equal impact on the final result.However, the framework developed so far has limitations that can prevent unbiased results. Indeed, the mergingoperator is binary and thus an unbiased result can only be achieved if the number of decision spaces to merge is apower of 2. Furthermore, using powers of 2, the impact of decision spaces over time can only decay exponentially,whereas other types of decays may be needed. We will generalize the merge operator to overcome these limitationsin the next two subsections.
Consider a set { X , . . . , X m } of decision spaces. We extend Algorithm 2 from a binary operator to an m -ary operator.Increasing the number of geometrical objects that are intersected does not a ff ect the operator, since the intersection ofgeometrical objects is associative. Thus, the extension only involves the computation of the value, which is extendedin Definition 10. Definition 10.
The value of the intersection of m elements x ∈ X , x ∈ X , . . . , x m ∈ X m based on their specializa-tions is a vector V ( x ⊗ · · · ⊗ x m ) = V ( x ) × M ( x ) M ( x ) + · · · + M ( x m ) + · · · + V ( x m ) × M ( x m ) M ( x ) + · · · + M ( x m ) . The combination of the handling of geometrical spaces and the computation of the value is similar to the mergingalgorithm introduced in Section 5.2:(1) If an element is strictly included within another one and has the same value, discard it.(2) Otherwise, for each intersection, create a new element z with a value that is computed using the formula inDefinition 10.We assume in Theorem 7 and its corollaries that the decision spaces to be merged all cover the same space so thatwe can concentrate on the impact of merge order. Theorem 7.
In a merging scheme for m decision spaces X , . . . , X m , all of which cover the same space, the impact ofa decision space X i on the value of the final result is proportional to the product of the numbers of operands of the ⊗ operators on the path from the leaf representing X i to the root of the merging scheme.Proof. In a merging scheme, each internal node is an m -ary ⊗ operator and each leaf is a decision space. Consider aninternal node u with k operands T , T , . . . , T k . Each T j is either a decision space or a subtree of the merging schemeand each T j accounts for a fraction 1 / k of the value of the subtree rooted at u . Consider two internal nodes u and14 igure 7: Merging scheme from Theorem 7 with n = × × u with k and k operands, respectively, and suppose that u is an operand of u in the merging scheme. Then eachoperand of u accounts for 1 / k of its value and u accounts for 1 / k of the value of u , so each operand of u accountsfor a fraction 1 / ( k × k ) of the value of u . If the numbers of operands of u and u are exchanged, then each operandof u contributes the same fraction 1 / ( k × k ) of the value of u . Thus, the contribution of a decision space to thevalue of the root of the merging scheme depends only on the final product of the numbers of operands of the m -ary ⊗ operators on the leaf to root path.Figure 7 shows an example of a merging scheme for twelve decision spaces in which the impacts of all of thedecision spaces on the value of W are the same. Corollary 8.
In the merging scheme ((( X ⊗ X ) ⊗ ( X ⊗ X )) · · · (( X n − ⊗ X n − ) ⊗ ( X n − ⊗ X m ))) , where m = k (Figure 6(a)), each decision space accounts for − k of the value of W. Corollary 9.
In the merging scheme ((( X ⊗ X ) ⊗ X ) · · · ⊗ X m ) (Figure 6(b)), decision space X i accounts for − m + i − of the value of W if i > , and − m + if i = .6.2. Adjusting the bias As explained earlier in this section, when decision spaces are merged over time as in a data stream setting, thentheir impacts decay exponentially. In the previous sub-section, we used extra storage to change the impacts. In thissub-section, we introduce an approach that uses no extra storage and provides decision spaces with the same impact.Furthermore, this approach does not require prior knowledge of the number m of decision spaces to be merged. Incontrast, the approach in the previous sub-section requires m to design a merging scheme. Definition 11.
The β -weighted value of the intersection of two elements x ∈ X and y ∈ Y is the class distributionvector V ( x ⊗ y ) = V ( x ) × β + V ( y ) × (1 − β ), 0 < β < . The value produced by the formula in Definition 6 is a β -weighted value with β = M ( x ) M ( x ) + M ( x ) and V ( x ⊗ y ) dependsonly on the specializations of x and y . If x and y have the same specialization, then each accounts for half of thevalue of the result. The bias in a merging scheme like the one in Figure 6(b) is entirely the result of the mergeorder when β = M ( x ) M ( x ) + M ( x ) is used. We can use di ff erent values of β with this scheme to produce the final value V ((( x ⊗ x ) ⊗ x ) · · · ⊗ x m ) = V ( x ) × M ( x ) M ( x ) + ··· + M ( x m ) + · · · + V ( x m ) × M ( x m ) M ( x ) + ··· + M ( x m ) , the same as an m -ary merge. Theorem 10.
Let β i = M ( x ) + ··· + M ( x i ) M ( x ) + ··· + M ( x i + ) be the weight used to compute the value of the i-th merge, i = , , . . . , m − ,in the merging scheme ((( X ⊗ X ) ⊗ X ) · · · ⊗ X m ) . Then the resulting value of the intersection of m elements x ∈ X , x ∈ X , . . . , x m ∈ X m based on their specializations is V ((( x ⊗ x ) ⊗ x ) · · · ⊗ x m ) = V ( x ) × M ( x ) M ( x ) + ··· + M ( x m ) + · · · + V ( x m ) × M ( x m ) M ( x ) + ··· + M ( x m ) . roof. We prove this result by induction on m . If m =
1, then two decision spaces are merged using β = M ( x ) M ( x ) + M ( x ) and V ( x ⊗ x ) = V ( x ) × M ( x ) M ( x ) + M ( x ) + V ( x ) × (1 − M ( x ) M ( x ) + M ( x ) ) = V ( x ) × M ( x ) M ( x ) + M ( x ) + V ( x ) × M ( x ) M ( x ) + M ( x ) . Assume that thetheorem is true for m = j − ≥
2, so that V (((( x ⊗ x ) ⊗ x ) · · ·⊗ x j ) = V ( x ) × M ( x ) M ( x ) + ··· + M ( x j ) + · · · + V ( x j ) × M ( x j ) M ( x ) + ··· + M ( x j ) .The j -th merge uses β j = M ( x ) + ··· + M ( x j ) M ( x ) + ··· + M ( x j + ) . Then V (( x ⊗ · · · ⊗ x j ) ⊗ x j + ) = ( V ( x ) × M ( x ) M (((( x ⊗ x ) ⊗ x ) ···⊗ x j )) ⊗ x j + + · · · + V ( x j ) × M ( x j ) M ( x ) + ··· + M ( x j ) ) × M ( x ) + ··· + M ( x j ) M ( x ) + ··· + M ( x j + ) + V ( x j + ) × (1 − M ( x ) + ··· + M ( x j ) M ( x ) + ··· + M ( x j + ) ) = V ( x ) × M ( x ) M ( x ) + ··· + M ( x j + ) + · · · + V ( x j + ) × M ( x j + ) M ( x ) + ··· + M ( x j + ) .
7. Developing an algebra
In a time-varying environment, the accuracy of the predictions decreases over time, so new decision spaces mustbe created regularly. A sequence of decision spaces carries information about changes in the environment over time,so techniques such as time series analysis potentially could be used. While time series analysis considers a sequenceof vectors of fixed size , decision spaces can be of varying size: for example the values of attributes can evolve overtime to cover a broader space. A restriction operator can be used to simplify the decision spaces of a time series sothat they all have same size. We introduce this operator in Definition 12, and define it formally in Algorithm 3. Then,we apply the algebraic approach in a similar way to the merging operator: we study identity elements (Theorem 11),idempotency (Theorem 12), and associativity (Theorem 13).
Definition 12.
The restriction of a decision space X by a decision space Y is the decision space Z that only retainselements of X that intersect with elements of Y . The corresponding operator is denoted ⊙ . Algorithm 3 ⊙ : ( X , Y ) Z Z ← new decision space2. for x ∈ X such that x ⊎ Y , ∅ do z ← new element4. for y ∈ x ⊎ Y do z . space ← z . space ∪ ( x . space ∩ y . space )6. z . value ← x . value Z ← Z ∪ z return Z By definition, we only consider the elements x ∈ X that intersect with some elements y ∈ Y (line 2). For each suchelement x , we create an element z with the same value (line 6) but with a space restricted to the intersection between x and all y ∈ Y (lines 4–5). Intuitively, restriction is the intersection of geometrical spaces, and it leaves the valuesunchanged. Thus, the algebraic properties of restriction are derived only from the properties of intersection of spaces. Theorem 11.
Let F v be a family of decision spaces such that for every X ∈ F v , there is exactly one element x ∈ Xsuch that V ( x ) = v and ∀ a ∈ A ( x ) , S ( x , a ) = ∞ . Then, every identity element E ∈ D such that X ⊙ E = E ⊙ X = X isgenerated by F v . The number of elements of F v is infinite.Proof. A decision space X is not restricted by a decision space Y only if all elements in Y cover a space at least aslarge as the space covered by X , so a trivial identity is the decision space with only one element covering an infinitespace. As the value of this element does not matter, we can define a family taking its value v as a parameter. Theset of all possible values v can be infinite because the values are taken from a continuous range; thus, F v contains anunbounded number of identity decision spaces. Theorem 12.
The ⊙ operator is idempotent as X ⊙ X = X. roof. Let X and Y be two decision spaces. If X = Y then ∀ x ∈ X , x ⊎ Y = y such that ∀ a ∈ A ( x ), low ( a , x ) = low ( a , y )and up ( a , x ) = up ( a , y ). Thus, each element z is created with exactly the space and value of an x (lines 4–7), and onlyone z is created for each x (line 2). Hence the result Z is the same as X . Theorem 13.
The operator ⊙ : ( D , D ) D is associative: ( X ⊙ Y ) ⊙ Z = X ⊙ ( Y ⊙ Z ) for all decision spacesX , Y , Z ∈ D .Proof. As the values do not matter, the restriction can be considered to be an intersection of spaces, which is associa-tive.
The merge ( ⊗ ) and restriction ( ⊙ ) operators can be composed to create a variety of useful operators. When themerge operator creates an element z ∈ Z = X ⊗ Y , z can be created from an x ∈ X , a y ∈ Y , or both an x ∈ X anda y ∈ Y . The ⊙ operator can be used to restrict the results of a merge operation to include only elements z ∈ Z thatare created from both an x ∈ X and a y ∈ Y . This change has a significant impact: an element derived from only oneelement is not as precise as an element derived from two, because in the latter case a consensus is obtained through aweighted formula. Thus, this restricted form of merging is less sensitive to noise and, as we know that all elementsin Z are derived from exactly two elements, we can have the same confidence in the predictions of all elements in Z . A merge that provides the same confidence in the prediction of each element is obtained through the followingcomposite operator ⊕ . Definition 13.
The operator ⊕ : ( X , Y ) Z is defined by X ⊕ Y = ( X ⊗ Y ) ⊙ X ⊙ Y .This definition ensures that the values are correctly computed based on the specialization of each element of X and Y , and then the overall space is reduced to the intersections with X and Y . A stricter definition is given by the followingcomposite operator ¯ ⊙ that not only restricts the merging to the elements that intersect, but computes the values usingonly common spaces of attributes. Any part of an element that lies outside the intersection will be ignored by ¯ ⊙ whenmeasuring the specialization, and the values are computed based on the same space. Definition 14.
The operator ¯ ⊙ : ( X , Y ) Z is defined by X ¯ ⊙ Y = ( X ⊙ Y ) ⊗ ( Y ⊙ X ).The choice between ¯ ⊙ and ⊕ can be based on the application. Both ¯ ⊙ and ⊕ have the properties of idempotency,associativity, non-commutativity, and unique identity element.
8. Conclusions
Classifiers are found in a variety of settings, such as sensor networks. In these settings, each entity ( e.g., a sensor)deduces a classifier from its observations (i.e., a dataset of example instances). These classifiers must be merged toobtain a global view. However, merging has not been defined for several types of classifiers, such as decision trees.Furthermore, the classifiers that have to be merged can be of di ff erent types, for example, when several generations ofsensors are creating classifiers.In this paper, we introduced decision spaces as a framework for examining the properties of operations overclassifiers, and we defined a merge operation that handles possibly di ff erent classifiers. Our methods are designed fora pure meta-learning framework in which classifiers are merged instead of being stored separately, and observationsare never exchanged. We use algebraic methods to prove several desirable properties of the merge operator. We alsoshow that the operator is not associative: the result of merging several classifiers depends on the order that the mergeoperator is applied. This can be the desired behaviour in a situation such as a data-stream: if the operator is applied assoon as a classifier is received, then the result will be biased in favour of recent trends. However, other types of biasesmay be needed, and no bias should also be possible for an application such as a distributed database in which theunderlying data is homogeneous. We introduced two approaches to achieve di ff erent biases. The first one uses storagespace flexibly to customize the bias. The second approach results in no bias, thereby making the operator associative,and uses no space. 17 igure 8: Including confidence in a combined value. We showed that other operators can be defined in our framework, and used algebraic methods to establish theirbehaviours. In particular, we defined a restriction operator, which can be used for time series to analyze a sequence ofclassifiers, and can be used in combination with the merge operator for application-specific needs.The decision space framework relies on the intersection of geometrical elements. However, intricate shapes canresult from the intersections of large numbers of attributes and / or classifiers. This can a ff ect the space complexitysince the shapes may require more coordinates to be described, and it can a ff ect the time complexity as the shapesbecome harder to intersect. One direction for further research is the use of simplified shapes to reduce the time andspace complexities. This suggests several interesting questions: how do the time and space complexities depend onthe type of geometrical elements that are allowed? How does the accuracy of the results depend on the simplificationthat is chosen and when it is applied?Further questions arise when considering that uncertainty is found at many levels in real-world scenarios. Wedistinguish three broad types of uncertainty. Firstly, there can be uncertainty about the concept that the elements arepredicting. For example, consider the task of predicting the literary genre of an author based on dimensions measuringcultural features. Since the definition of a genre may not be clear, fuzzy values may have to be assigned to elements,and the problem becomes one of combining fuzzy values.Secondly, there can be uncertainty about the value contained in one element (i.e., its class distribution vector).Consider the case in which two elements A and B are merged: element A predicts classes Yes and No with the sameprobability, and element B predicts the class Yes with 80% probability and No with 20% probability. A mechanismis needed to detect that element A cannot make a reliable prediction, so the contribution of element B should beincreased. Confidence transformations provide such a mechanism by transforming a class distribution vector into ameasure representing its confidence [18]. There are several approaches to using these transformations when combiningtwo elements in this framework. In Figure 8(a), two values (i.e., class distribution vectors) are combined based solelyon the space of the elements to which they belong, ignoring their confidence measures. In Figure 8(b), the values aretransformed to confidence measures which are then used to generate the combined value. Another possibility is to usethe confidence measures to supplement the raw values (Figure 8(c)).A confidence transformation can be specific to a given classifier (i.e., we interpret the reliability of a value dif-ferently depending on the type of model that generated it). However, that specificity is lost when we combine a setof classifiers: how should we interpret the result coming from a classifier that combines a C4.5 decision tree witha support vector machine? If one general confidence transformation is used for such models, then how does theperformance change when a large number of models are combined based on initially large groupings (so that spe-cific transformations can be used)? Is it possible to not only combine models, but also combine the transformationsassociated with them?Thirdly, there can be uncertainty about the space covered by an element. For example, consider the classificationof soil types on a map. In one area, samples have been taken and the category has been determined. It is known thatbeyond that area, there are no geological faults and it can thus be speculated that the category does not change for afew dozen meters beyond the sampled area. This type of spatial vagueness is essential in geographical informationsystems, and frameworks have been proposed to address it. We can adopt the idea of an algebra for vague spatial datafrom [21] (which further highlights the connection of our approach with spatial algebras), by tagging each element ofa decision space as being either guaranteed or conjectured. The tag for elements resulting from the union, intersection,or di ff erence of two elements can be determined by using the same tables as in [21]. However, what should we dowhen merging an uncertain element with a guaranteed element? Should we omit the value from the uncertain element,or should we produce a weighted combination giving more weight to the value from the guaranteed element? Theanswers to these questions are application-specific, and the important advantage of our framework lies in its ability tomathematically express the consequences of each choice.18 cknowledgements We would like to thank Martin Ester, Binay Bhattacharya, and Oliver Schulte for helpful discussions. This researchwas supported by NSERC of Canada.
References [1] in: C. Sansone, J. Kittler, F. Roli (Eds.), Proc. 10th International Workshop on Multiple Classifier Systems, volume 6713 of
Lecture Notes inComputer Science , Springer, 2011.[2] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci, Wireless sensor networks: A survey, Computer Networks 38 (2002) 393–422.[3] A. Bar-Or, D. Keren, A. Schuster, R. Wol ff , Hierarchical decision tree induction in distributed genomic databases, IEEE Trans. Knowl. DataEng. 17 (2005) 1138–1151.[4] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: D. Haussler (Ed.), 5th Annual ACM Workshopon COLT, ACM Press, Pittsburgh, PA, 1992, pp. 144–152.[5] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32.[6] P. Chan, S.J. Stolfo, Toward parallel and distributed learning by meta-learning - working notes, in: AAAI Workshop on Knowledge Discoveryin Databases, AAAI, 1993, pp. 227–240.[7] J.P.C. Cordeiro, G. Camara, U.M. de Freitas, F. Almeida, Yet another map algebra, Geoinformatica 13 (2009) 183–202.[8] F.J. Ferrer-Troyano, J.S. Aguilar-Ruiz, J.C.R. Santos, Incremental rule learning and border examples selection from numerical data streams,J. UCS 11 (2005) 1426–1439.[9] A.U. Frank, Map algebra extended with functors for temporal data, Perspectives in Conceptual Modeling (Lecture Notes in Computer Science)3770 (2005) 194–207.[10] J. F¨urnkranz, Separate-and-conquer rule learning, Artificial Intelligence Review 13 (1999) 3–54.[11] V. Ganti, J. Gehrke, R. Ramakrishnan, W.Y. Loh, A framework for measuring di ff erences in data characteristics, Journal of Computer andSystem Sciences 64 (2002) 542–578.[12] C. Giraud-Carrier, R. Vilalta, P. Brazdil, Introduction to the special issue on meta-learning, Machine Learning 54 (2004) 187–193.[13] L.O. Hall, N. Chawla, K.W. Bowyer, Decision tree learning on very large data sets, in: Proc. of the International Conference on Systems,Man, and Cybernetics, volume 3, IEEE, 1998, pp. 2579–2584.[14] L.H. Hamel, Knowledge Discovery with Support Vector Machines, Wiley-Interscience, 2009.[15] J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006.[16] T.K. Ho, Multiple classifier combination: Lessons and the next steps, Multiple classifier combination: Lessons and the next steps, WorldScientific Publishing, 2002, pp. 171–198.[17] L.I. Kuncheva, Combining pattern classifiers: methods and algorithms, John Wiley and Sons, 2004.[18] C.L. Liu, H. Hao, H. Sako, Confidence transformation for combining classifiers, Pattern Anal Applic 7 (2004) 2–17.[19] A. Meier, V. Sorge, Symbolic computation and automated reasoning, Symbolic computation and automated reasoning, A. K. Peters, Ltd.,2001, pp. 175–190.[20] S. Muthukrishnan, Data Streams: Algorithms and Applications, volume 1:2, Now Publishers, 2005.[21] A. Pauly, M. Schneider, Vasa: An algebra for vague spatial data in databases, Information Systems 35 (2010) 111–138.[22] D. Pullar, Mapscript: a map algebra programming language incorporating neighborhood analysis, Geoinformatica 5 (2001) 145–163.[23] J. Quinlan, Improved use of continuous attributes in c4.5, Journal of Artificial Intelligence Research 4 (1996) 77–90.[24] D.B. Skillicorn, S.M. McConnell, Distributed prediction from vertically partitioned data, J. Parallel. Distrib. Comput. 68 (2008) 16–36.[25] N.A. Syed, H. Liu, K.K. Sung, Incremental learning with support vector machines, in: Proceedings of the Workshop on Support VectorMachines at the International Joint Conference on Articial Intelligence (IJCAI-99), Stockholm, Sweden.[26] C.D. Tomlin, A map algebra, in: Proceedings of the 1983 Harvard Computer Graphics Conference, pp. 127–150.[27] A. Tsymbal, S. Puuronen, V. Terziyan, Arbiter meta-learning with dynamic selection of classifiers and its experimental investigation, in:J. Eder, I. Rozman, T. Welzer (Eds.), Advances in Databases and Information Systems, Third East European Conference, volume 1691 of Lecture Notes in Computer Science , Springer, 1999, pp. 205–217.[28] R. Vilalta, Y. Drissi, A perspective view and survey of metalearning, Artificial Intelligence Review 18 (2002) 77–95.[29] L. Xu, A. Krzyzak, C.Y. Suen, Methods of combining multiple classifiers and their application to handwriting recognition, IEEE Transactionson Systems, Man, and Cybernetics 22 (1992) 418–435.19