Drift anticipation with forgetting to improve evolving fuzzy system
DDrift anticipation with forgetting to improveevolving fuzzy system
Clement Leroy
Intuidoc TeamUniv Rennes, CNRS, IRISA
Rennes, [email protected]
Eric Anquetil
Intuidoc TeamUniv Rennes, CNRS, IRISA
Rennes, [email protected]
Nathalie Girard
Intuidoc TeamUniv Rennes, CNRS, IRISA
Rennes, [email protected]
Abstract —Working with a non-stationary stream of data re-quires for the analysis system to evolve its model (the parametersas well as the structure) over time. In particular, concept driftscan occur, which makes it necessary to forget knowledge thathas become obsolete. However, the forgetting is subjected tothe stability-plasticity dilemma, that is, increasing forgettingimprove reactivity of adapting to the new data while reducingthe robustness of the system. Based on a set of inference rules,Evolving Fuzzy Systems - EFS - have proven to be effectivein solving the data stream learning problem. However tacklingthe stability-plasticity dilemma is still an open question. Thispaper proposes a coherent method to integrate forgetting inEvolving Fuzzy System, based on the recently introduced notionof concept drift anticipation. The forgetting is applied with twomethods: an exponential forgetting of the premise part and adeferred directional forgetting of the conclusion part of EFSto preserve the coherence between both parts. The originalityof the approach consists in applying the forgetting only in theanticipation module and in keeping the EFS (called principalsystem) learned without any forgetting. Then, when a drift isdetected in the stream, a selection mechanism is proposed toreplace the obsolete parameters of the principal system with moresuitable parameters of the anticipation module. An evaluationof the proposed methods is carried out on benchmark onlinedatasets, with a comparison with state-of-the-art online classifiers(Learn++.NSE, PENsemble, pclass) as well as with the originalsystem using different forgetting strategies.
Index Terms —Evolving Fuzzy System EFS, Learning withForgetting, Non Stationary data stream, Anticipation.
I. I
NTRODUCTION
Data stream learning problem has become a new topic ofinterest that breaks with classical batch learning model forseveral reasons: • The learning algorithm must process one instance at atime without requiring access to previously seen data(One-shot learning). • The stream is potentially infinite, thus instances shouldonly be saved for a short time. • Knowledge contained in the data stream can change overtime, that is called concept drift.In addition, most of the application using data stream havereal-time constraints, requiring a fast processing (take forexample, the monitoring of network traffic and the credit fraud identification [1], recommendation systems which take up thenew recent context to propose more interesting content [2]or even a customized-command-gesture recognition system[3]). To cope with theses new constraints, new incrementallearning algorithms have been designed, inspired by classicbatch approaches. For example, we can cite decision tree(CVFDT [4], Hoeffding tree [5]), or the ensemble classifiers(DWM [6], Learn++.NSE [7]) or the Evolving Fuzzy Systems- EFS - (pclass [8], ANYA [9]).For each of them, the adaptation of the model to the streamimplies the incremental adaptation of the model parameters(such as the updating of sufficient statistics) and the evolutionof the structure (addition/replacement of subtrees, classifier orfuzzy rules). Thus, the data stream learning problem can besimplified in two cases if ... then ... : • First case: If new data are close to the concepts alreadyseen
Then update the model parameters on these newdata, perhaps using a forgetting factor in case of smoothdrift (called incremental drift). • Second case: If new data comes from the appearance ofa new concept or a new class,
Then update the modelstructure, perhaps using a forgetting factor in case ofbrutal drift.The two main goals of a learning model are to guess in whichcase the system is after receiving new data, and what is theextent of the adaptation required. Evolving fuzzy systems aresuitable to address the data stream learning problem. They aregranular models composed of fuzzy rules which locally adaptthe distribution of points with a premise part, and discriminatethe classes in a conclusion part. In the recently introducedParaFIS [10], a new learning model for evolving fuzzy systemhas been proposed. It considers two systems in parallel. Theprincipal system is updated assuming the first case, i.e. no drift.But at the same time, the second system - the anticipationmodule - presupposes the need for a structural update totackle the second case. Then, with a posteriori information,the model can decide which assumption was the right one andcan update the principal system from the anticipation moduleif a structural update is necessary to have the most suitablestructure.However, questions still arise: when and how to forget a r X i v : . [ c s . A I] J a n ata. The first question concerns to the well-known stability-plasticity dilemma, which says that if a forgetting factor isapplied continuously with a high magnitude, the system willbe reactive to drift at all times but will be less stable ( i.e. less efficient in the long run term). Conversely, if none ora few forgetting factors are applied, then the system will bemore efficient in stationary phase because it learns on morepoints, but it will be less reactive in the event of drift. Thesecond question concerns the ways of forgetting data, i.e. what information, previously learned on a point, should beforgotten. Two approaches are possible, either forgetting all thepast information whatever its relevance for the present (blindadaptation [6]), or it is the learning model which selects theinformation to forget.This paper addresses these issues on two levels, from ageneral perspective in data stream, to a specific applicationin evolving fuzzy system. To the best of our knowledge, nostable approach of forgetting in the conclusion part is proposedin the state of the art. The main difficulty is the wind-upproblem [11] which leads to the collapse of the conclusionpart if a forgetting is applied continuously. To address thisproblem, we propose to take advantage of the anticipationmodule introduced in ParaFIS, to integrate forgetting in theconclusion part of the EFS. The application of forgetting onlyin the anticipation module makes it possible to respect thetrade-off between stability and reactivity. Indeed, as long asthe system does not detect any change in the data stream, noforgetting is necessary and the principal system adapts betterto the stationary environment. But once a drift is detected, theanticipation module learned with forgetting, will update theprincipal system to be more reactive to the drift. In addition,thanks to the granularity of the EFS, only the parametersaffected by the drift will be updated. The contributions of thepaper are as follows: • Integration of forgetting in the conclusion part, • Selection of conclusion parameters to update in case ofdrift.The paper is organized as follows. The section II recallsthe ParaFIS model with the concept of anticipation and itscurrent limits. The section III begins with a discussion of theproblem encountered with forgetting in conclusion part andpresents our contribution (forgetting in the conclusion part ofthe anticipation module and method of selecting parametersduring an update of the principal system). Section IV presentstwo experiments to evaluate the contribution step by step,with, at the end, a comparison with state-of-the-art data streamclassifiers. II. P
ARA
FIS M
ODEL
Before detailing our contribution, this section recalls theParaFIS system [10] on which it is based. The subsection II-Ais a brief overview of related works in the evolving fuzzysystems. The subsection II-B describes the architecture of aFuzzy Inference System (FIS), the subsection II-C presents thelearning step of such a system and the subsection II-D details the anticipation module. Last, the advantages and problemsremaining in ParaFIS are discussed in subsection II-E.
A. Related works
Since two decades, many fuzzy inference systems have beendesigned to be learned in an incremental manner, FlexFIS [12],eTs [13], ANYA [9]. Most of them are based on the Takagy-Sugeno fuzzy system composed of a set of fuzzy inferencerules R = { r i , ≤ i ≤ N } with an antecedent part (alsocalled premise), and a conclusion part. The difference betweenmodels lays in the choice of the structure and the choice ofthe criteria used to evolve the structure. The premise can beprototype with spherical shape [12], elliptical shape [14] orcloud [9]. Conclusion can have multiple input single ouputstructure - MISO or multiple input multiple ouput structure -MIMO [13]. The criteria used to add/remove new rule can bea distance-based criteria [12], a split condition based on theerror and volume’s rule [15], or density-based condition [9].ParaFIS system [10] is built from the generalized evolvingfuzzy system used in many papers such as in [16], [17]. Themodel is described below. B. Model Architecture
The architecture of ParaFIS is based on the Takagy-Sugenofuzzy system. The strength of such system is to combine theadaptation of the data distribution in the antecedent part (agenerative model) with the discrimination of classes in theconclusion part to better fit the decision boundaries. Eachrule’s antecedent is defined with a prototype that is set bya cluster with a center µ i and a covariance matrix A i . Therule’s conclusion is defined with c polynomial functions l ji (for rule i, class j), c being the number of classes. Finally, thestructure of a rule r i , is as follows: IF x is close to µ i THEN y i = l i ( x ) .. y ci = l ci ( x ) (1)The degree of the polynomial function is set to with π jik thepolynomial coefficients (see Eq. (2)). y ji = l ji ( x ) = π ji + π ji x + π ji x + .. + π jin x n = Π ji x (2)The membership of x to a rule r i , denoted β i ( x ) , is given by amultivariate cauchy function of the mahalanobis distance from x to µ i (see Eq.(3)). β i ( x ) = 1(1 + ( x − µ i ) A − i ( x − µ i ) T ) (3)Finally, the predicted class for x is given by Eq. (4),(5). class ( x ) = y = argmax j y j ( x ) (4) y j ( x ) = N (cid:88) i =1 ¯ β i ( x ) y ji (5) ¯ β i ( x ) = β i ( x ) / N (cid:88) l =1 β l ( x ) (6)The architecture of a FIS is illustrated in figure 1, block(A).ig. 1: Architecture of ParaFIS [10] C. Rule’s adaptation - Parameters adaptation
Each new incoming data x t is used to adapt the modelparameters. In the premise part, only the most activatedrule adapts its center and covariance matrix according toEq. (7),(8), in which, α = t is the fading factor where t = min ( k, tmax ) (see [14]), and with tmax , the thresholddefining the forgetting capacity, and k , the number of samplesthat activated the rules the most. µ t = (1 − α ) µ t − + α x t (7) A t = (1 − α ) A t − + α ( x t − µ t )( x t − µ t ) T (8)The conclusion part is learned using a Weighted RecursiveLeast Square method (WRLS). In this optimisation problem,the weight - here the membership functions β - are assumedto be almost constant to converge to the optimal solution.To reduce the computation time, the local learning of theconclusion part is often preferred [18]. Thus, the rules areassumed to be independent to apply a RLS optimization oneach one. The conclusion matrix Π i ( t ) = [Π i ( t ) , .., Π ci ( t ) ] ofthe rule r i at time t ( i.e. after t data points) is recursivelycomputed according to: Π i ( t ) = Π i ( t − + C i ( t ) β i ( x ) C i x ( Y t − x Π i ( t − ) (9)Where C i ( t ) = C i ( t − − β i ( x ) C i ( t − xx T C i β i ( x ) x T C i x (10)With C i a correlation matrix initialized by C i ( t =0) = Ω Id where Id is the identity matrix and Ω a constant often fixedto (see [19], [13]). D. Anticipation Module
In ParaFIS, as shown in figure 1 - block (B), an anticipationmodule (AM) is added to the fuzzy inference system (called inthis case principal system - PS). The goal of the anticipationmodule is to foresee an occurrence of a drift near each rulepremise by anticipating the need of a structural update. Indeed,if a drift occurs in the vicinity of a rule, the distribution of point changes and a single rule is no longer sufficient tomodel the data. The idea of the anticipation is to considerfor each rule i of the FIS, an anticipated system S i wherethe rule i is represented by two sub-rules i. , i. to modelthe same distribution of points. In this ways, if a drift occurs,the anticipated system will already be effective in adaptingto the drift before even detect it. To do this, the anticipationmodule is learned in parallel with the principal system. Onlythe anticipated system ( S i ) of the most activated rule i islearned synchronously with the rule i .In ParaFIS, the premise of the principal system does nothave forgetting (There is no tmax ) while the premise ofthe two sub-rules i. , i. from ( S i ) have two different fadingfactors α , α . One sub-rule is learned with a low forgettingfactor to capture information over a long period of time forstability, while the other is learned with a high forgetting factorto capture the most recent information in order to be reactivein case of drift.In addition, the detection of change in the distribution isdone also in the anticipation module. The ParaFIS systemdetects that a rule i is no longer sufficient to model the data( i.e. a drift has occurred) if the premises of the two sub-rules i. , i. in the anticipation system ( S i ) are sufficientlyseparated according to a clustering separability criterion givenin eq. 11-12. Condition 1 || µ i − µ j || > k s ( σ i + σ j ) (11) Condition 2 k i > n min (12)Where σ i (resp. σ j ) is the distance between µ i (resp. µ j )and the hyper-ellipsoid’s envelope of the cluster i (resp. j ),along ( µ i , µ j ) axis. k s is a coefficient related to the separationbetween cluster. When the conditions are met for the twosub-rules i. , i. , the principal system is replaced by theanticipated system ( S i ). E. Discussion
The ParaFIS system has two main advantages. First of all,the rule creation condition ( i.e. the detector) is based on aseparability criterion between the clusters which is enoughrobust to noise contrary to distance-based conditions. Second,the anticipation of the premise part allows to maintain anefficient and stable principal system in the absence of drift,but still reactive and better fitted when a drift occurs, thanks tothe structural update carried out from the anticipation module.However, in ParaFIS as in any fuzzy inference system, theconclusion part has no forgetting capacity, which raises totwo concerns. First, the system could be more reactive andmore efficient after a drift if the conclusion were learned withforgetting. Second, forget only the premise part but not theconclusion part leads to an inconsistency in the system. Indeed,they are learned with different information if forgetting isapplied differently, which could be damaging.II. O UR C ONTRIBUTION
The paper’s contribution is divided into three parts. The firstpart is a discussion on the difficulty of introducing forgettingin the conclusion part. The second part is our proposal tointegrate forgetting in the conclusion part of the anticipationmodule. The third part proposes two strategies to replacethe conclusion of the principal system from the anticipationmodule, when a drift is detected.
A. Discussion on forgetting and rules’ conclusion
The conclusion part of an evolving fuzzy system aims todiscriminate between classes. Its learning assumes that theunderlying distribution does not evolve over time. Typically,the β ( x ) function is assumed constant over time, which is nolonger true when drifts occur. To maintain an efficient andaccurate discrimination between classes over time and main-tain consistency between the premise part and the consequentpart, it is necessary to introduce the forgetting capacity in theconclusion. The common approach is to exponentially weightthe data over time ( [20]). However, introduce forgetting in theRLS learning method without introduce instability is still anopen question. Indeed, if old data are forgotten whatever theirsignificance, as in classical methods, then it causes the un-bounded growth of the estimator (known as estimator blowupor covariance windup problem) leading to noise sensitivity andnumerical difficulties [11]. Several ad-hoc approaches havebeen proposed, mainly based on regularization method withassumption or by introducing upper bound [11]. To the bestof our knowledge, these methods are still not used in evolvingfuzzy systems as they do not prevent collapse of the conclusionmatrix over a long time. Recently in [14], a new forgettingmethod, called the differed directional forgetting (DDF), hasbeen proposed to forget the RLS parameters. The main ideais based on the concept of ”directional forgetting”, i.e. tolimit the windup phenomenon by directing the forgetting inthe most excited dimensions in a sliding window. This offersa good compromise between forgetting data and maintainingthe stability of the parameters. However, as discussed earlier,the stability-plasticity dilemma tells us that it is damaging toforget data when there is no change in the data distribution.Starting from these problems, the next part introduces ourcontribution with a strategy to anticipate the forgetting inthe conclusion part in order to maintain consistency with thepremise part and the current data stream in the ParaFIS model. B. Anticipation principle and forgetting in conclusion part
To integrate forgetting in the conclusion part, we proposeto use the DDF method [14]. In DDF, the data point are savedin a sequential sliding window of fixed size. Once the windowis full, the oldest data point is used to recursively decrementthe correlation matrix C i ( t ) . The correlation matrix representsthe directional forgetting matrix that is used to update theRLS parameters in the conclusions matrix Π ( t ) . In this way,the correlation matrix is learned only on the data point fromtime s to time t with t − s the window size. The decremental Fig. 2: Illustration of the two replacements strategies in theevent that a drift is detected for rule 2. In the principal system,the 3 rules are represented with their respective conclusion.In the anticipation module, the 3 anticipated systems arerepresented with their own conclusions matrices learned withDDF. Replacement strategies are illustrated by the arrows.equation is given in eq.13, where C i ( s → t ) is the correlationmatrix of the rule i learned on the data points from s to t .On the contrary, the conclusion matrix is not decremented topreserve robustness of the consequent part and avoid windupproblems. C i ( s +1 → t ) = C i ( s → t ) + β i ( x s ) C i ( s ) x s x Ts C i ( s ) β i ( x s ) x Ts C i ( s ) x s (13)However, there are few pitfalls to avoid. Unlike to premisepart where the center and the covariance matrix are computedfor each rule independently of the other rules, the learning ofthe conclusion matrix of a rule depends on the others through-out the normalized ¯ β ( x ) function. Indeed, if a drift occursnearby the premise part of the rule i , then the conclusion partof all rules in the system will be impacted throughout ¯ β ( x ) .As a result, the conclusion part of a rule cannot be anticipatedwithout taking into account the other rules.To compute the normalized beta for the anticipated con-clusions, it is necessary to virtually built the ”r” anticipatedsystems S i as illustrated in Figure 1 - block (B), with r thenumbers of rules. The ”r” different assumptions lead to ”r”different scalar fields of the normalized membership functionused to compute the RLS parameters which will lead to rdifferent conclusion matrices. As an example, let’s considerthe RLS update in the hypothesis where the local distributionfitted by the premise of rule 1 has drifted ( S system). In S ,all conclusion matrices Π j , j ∈ [1 , r ] are updated using thelast sample x t according to eq. 9. The normalized membershipfunction ¯ β i ( x ) for the rule i is ¯ β i ( x ) = β i ( x ) (cid:80) j β j ( x ) with j ∈ . , . , , , .., r . At end, each of the r anticipated systemsequires the update of (r+1)*C hyperplanes. Updating ther*(r+1)*C hyperplanes on each point is time-consuming, so, itdoes not satisfy the real-time constraint. The following sectionpresents two strategies to reduce the complexity of the system. C. Strategies to update conclusion part from anticipation
To decrease the computation complexity, the naive ideaof considering only the conclusion matrices of the sub-rules i. , i. regardless the others can be explored. In this naiveapproach, only the conclusions Π i. , Π i. are computed forthe ( S i ) system ( i ∈ [1 , r ] ). Once a drift is detected for a rule i , it is replaced by the two sub-rules r i. , r i. without replacingthe conclusions matrices of the other rules. This naive strategyis illustrated in Figure 2.The naive approach assumes that a drift in the local area ofa rule will have no impact on the conclusions of the others.However, all rules make a decision to classify a point for anyclass. If a drift occurs for one class, then all rules’ conclusionswill be impacted.An other assumption can be done. The hyperplanes Π j/ ( S k ) of rules j (cid:54) = k in the virtually built ” k ” anticipated system willbe assumed identical and equals to Π j. / ( S j ) . In this ways, foreach of the anticipated system, all the conclusion matrices areknown. Then, once a drift is detected nearby the premise of therule i , the conclusion matrices of all rules are replaced by theconclusions matrices of the anticipated system as illustratedin Figure 2. In this assumption, the other rules have also beenlearned with forgetting in the premise and conclusion part.It assumes also that the second sub-rules in each anticipatedsystem will not impact too much the hyperplanes ( i.e. considerthe rule j (cid:54) = i is the same as consider rule j. and rule j. ).At end, the final system with the contribution can be illus-trated with the figure 3. The block (A) is a classical FIS thatreceives the data and gives the recognition label. The block (B)contains ’r’ anticipated systems which are learned in parallelwith forgetting (premise part + conclusion part). The block (C)is the drift detector based on a separability criteria applied onthe premise part of two sub-rules of each anticipated system.Once a drift is detected in ( S i ), the principal system is replacedby the anticipated one.An important point is that the choice of evolving fuzzysystem used as principal system is free. The choice done inthe paper may not be the best one and for example an evolvingfuzzy system with anticipated cloud structure [9] as premisecould be better. But, the main suggestion is that using theanticipation concept and forgetting in conclusion can help anyfuzzy system to be more reactive in case of drift while keepingstability in the other case, as it is with our choice of fuzzysystem. IV. E XPERIMENTS
Experiments are conducted to evaluate the advantages offorgetting the conclusion part with anticipation. The firstand the second section introduce the benchmark dataset andprotocol taken from the recent paper [17]. The third sectioncompares both, the naive and the global strategies, to update Fig. 3: Final system obtained from ParaFIS with the additionof forgetting in the conclusion part of the anticipation module.the conclusion part of the principal system after a drift. Thefourth section compares the different strategies to apply forget-ting in the conclusion part between [No forgetting, forgettingonly in the anticipation module (Forget AM Naive/Global),forgetting continuously in the principal system (Forget PS) ].The last section compares our contribution with state-of-the-art Evolving Fuzzy System and ensemble classifiers obtainedfrom [17] and shows improvements on 8 among 10 datasetscontaining different kind of drifts. The final mean accuracyscores are given in table I.
A. Evaluation protocol
Evaluate the performance of a streaming algorithm requiresdifferent protocols from those used for evaluating classicallearning algorithms. Many of them are discussed in [21]. Thispaper is only concerned with the classification performanceof a system. The simulation follows the periodic hold-outprocess where the stream of data is generated chunk by chunk.One chunk is used to train the system and then one chunk isused to test the system in an online mode. Thus, the systemis evaluated every two chunks to built performance criteriaover time. The performance is measured using the meanaccuracy score and the standard deviation computed on all thechunks. However, the existence of drift in the dataset naturallyinduces important fluctuation of the score independently of theclassifier. To compensate for this, McNemar’s significance testis also presented [22]. The McNemar test is used to comparetwo classifiers evaluated only once over the same dataset [22]as it is in our case. It consists in computing the statistic Kgiven equation 14 with n , the examples misclassified by thefirst classifier and not by the second and, n , the exampleswell-classified by the first classifier and not by the second.The K distribution converges to a χ distribution of degree1. The null hypothesis of getting non significant differencebetween the two classifiers is rejected with a confidence score α if K is greater than a given threshold. A statistic K thatrejects the null hypothesis with a confidence greater than odel ElectricityPricing Hyperplane Iris+ Car 10dplane Weather Sea SinH Line Sin MeanParaFIS No Forget ± ±
03 82 ±
14 8 ±
11 68 ±
34 78 ±
03 94 ±
04 67 ±
09 85 ±
15 85 ±
13 81Forget PS ±
15 93 ±
02 85 ± ±
12 63 ±
16 78 ±
03 97 ±
01 71 ±
07 93 ± ± ±
15 93 ± ± ± ±
31 79 ±
03 96 ±
03 70 ±
07 92 ±
10 93 ±
09 83Forget AMGlobal ±
15 93 ±
02 85 ± ±
10 77 ±
34 78 ± ± ± ±
06 94 ±
08 85
Learn++ CDE 69 ±
08 90 ± ± ±
30 71 ±
13 73 ±
02 93 ± ± ±
14 80 ±
13 79Learn++ NSE 69 ±
08 91 ±
02 84 ±
17 67 ±
30 72 ±
14 75 ±
03 93 ±
02 73 ±
22 88 ±
13 80 ±
15 79pENsemble AxisParallel 75 ±
16 92 ±
02 78 ±
15 79 ±
10 78 ± ± ±
02 71 ±
06 90 ±
07 78 ±
26 82pENsemble Multivariate 75 ±
16 92 ±
02 75 ±
17 79 ± ± ±
02 97 ±
02 71 ±
06 90 ±
07 78 ±
30 82pClass 68 ±
10 91 ±
02 73 ±
18 77 ±
10 63 ±
26 68 ±
04 89 ±
10 71 ±
09 91 ±
07 72 ±
20 76
TABLE I: Final Results - Mean Accuracy Score ( K > . ) is noted by a +, between [90%,99%] bya ≈ and below ( K < . ) by a -. If the numbers ofcontingent errors n , + n , is below the recommended valueof 25 to converge toward a Chi-square distribution, a (x) isadded. The McNemar test is apply on the classifier usingthe best strategy ”Forget AM Global” and the others strategyto measure a significance difference between both strategies.The proposed test can not be extended to the state-of-the-artbenchmark results get from [17] due to the unavailability ofthe classifiers (the classification score over each data point getfrom benchmark classifier is required). K = ( n , − n , ) ( n , + n , ) (14) B. Datasets
In order to evaluate the algorithm over different types ofdrifts (incremental/brutal/gradual/recurrent), the datasets sin,sinH, 10dplane, line, Car+, Iris+ have been chosen. Theyare generated with simulated drifts often using mathematicalequation described in [23]. The SEA dataset [24], in itsextended version [25], proposes to mix several types of driftwith noise and imbalanced data. In addition, Weather datasetfrom [26] with incremental drifts is studied. The Hyperplanesobtained from supplemental material of paper [17], generatedfrom the MOA frameworks [27] and the real world datasetElectricity pricing [28] are also investigated. Table II gives theinformation on the datasets and presents the test parameters,with in the columns: IA: Input Attributes, C: Classes, DP: DataPoints, TS: Time Stamps, TRS: Training Samples,TES: Test-ing Samples. Thus, all of these datasets cover a wide varietyof data streams with different shapes of data distributions anddifferent drifts.The paraFIS model contains 4 parameters to define: • α , α : the forgetting factor of sub-rules i1, i2; • k s : the separation measure between cluster; • ws : the windows size for DDF.The α , α parameters will depend on the features spacedimensions. α is fixed to 200 and α to 10 or 30 (The bestscore is chosen). The k s parameters depend not only on thedimension space but also on the choice of α , α and the datadistribution. There is no rule to define it, so several valuesare tested between [0 . , on a validation dataset (20% of Data stream IA C DP TS TRS TESSEA 3 2 100 000 200 250 250Weather 4 2 18159 400 30 30Line 2 2 2500 10 200 50Sin 2 2 2500 10 200 50Sinh 2 2 2500 10 200 5010dplane 10 2 1200 10 100 20Iris+ 4 4 450 10 34 11Car 6 2 1728 10 130 42Hyperplane 4 2 120K 96 1000 250Electricity pricing 8 2 45312 199 150 77
TABLE II: Datasets descriptionthe total dataset) and the value that created rules in a ”good”proportion is chosen. The window size of the DDF methoddepends on the dimensions of the space but also on the numberof classes. To set it, all window size values are tested overa range of [10 , and the window with the best score ischosen. The set of parameters of a dataset is the same forall the configurations tested (No forget,Forget PS,Forget AMnaive,Forget AM global). C. Comparison of Naive and Global update strategies
In the previous part, two updating strategies have beenproposed: the naive approach and the global replacement ofthe conclusion matrix with assumption. These are two strate-gies used to satisfy real-time constraints based on handcraftassumptions. The first considers that it is more importantto maintain the stability of rules far from the drift even ifreactivity of adaptation of their hyperplanes is reduced. On thecontrary, the second assumes that damaging the stability of theothers rules is preferable to improve reactivity of hyperplaneswhen a drift occurs. Both strategies are tested on all datasets,and the results of the models [Forget AM naive, Forget AMglobal] are presented in Table I. We can see that replacing allconclusion parts of all rules from the anticipation module isoften a better solution. This means that the use of hyperplaneslearned with directional forgetting, just after a drift, makes itpossible to react better to the drift without damaging stabilityof old knowledge too much. This can be explained by thefact that the conclusion matrix is not decremented, only thecorrelation matrix is, to guide the learning. Thus, the oldknowledge is still contained in the conclusion matrices and,after the switch, the decremented directional matrices will help a) Sea (b) Hyperplane (c) 10dplane
Fig. 4: Example of holdout test with three datasets for the three strategies (No forget, forget in principal system PS, forget inanticipation module AM). S is the mean score.the system to react more quickly to the drift by directing theconclusion learning on the new concept.
D. Different strategies to apply forgetting in the conclusionpart
In order to measure the interest to apply forgetting withDDF only in the anticipation module, two others strategiesare tested: the ”No Forget” strategy and the ”Forget PrincipalSystem” strategy (Forget PS). The first one is just the ParaFISsystem described in Sec. II where no forgetting capacity isapplied in the consequent part. In the second one, forgettingis applied in the anticipation module as described in ourcontribution (Sec. III). In addition, forgetting is also applied inthe principal system to get a system that continuously forgets.When a drift is detected and the conclusions are replaced fromthe anticipation module to the principal system, the windowsused in the anticipation module will also replace the one usedin the principal system. This keeps the consistency of thememorized normalized activation in the DDF-windows withthe conclusions matrix learning. Results of the Hold-Out teston all the datasets are presented in Table I in rows [ParaFISNo Forget, Forget PS, Forget AM Global]. Examples of plotsof mean score by chunk are given in Figure 4 for the Sea,Hyperplane and 10dplane datasets. By comparing the rows”No Forget” and ”Forget PS”, we see that there is no dominantstrategy. Sometimes it is preferable to forget the conclusion,sometimes not, depending on the stability-plasticity dilemma.However our strategy ”Forget AM Global” allows a trade offbetween the two strategies by forgetting just at the time of thedrift, which results in a better accuracy score on all datasetsexcept for sinH. To validate the significance of the results, aMcNemar test is carried out by comparing the best strategy”Forget AM Global”, with all the other strategies. Results arepresented in the Table III. We can see that for ElectricityPricing, Iris+, Car and SinH datasets, there is no differencein the proportion of error. For most of these datasets, thenumber of contingent errors is less than the recommendedvalue of 25 due to the small size of the dataset. For 10dplane,SEA, and Line, where the difference in the mean accuracy
Strategies (compared to Forget AM Global)Dataset No Forget Forget PS ForgetAM NaiveElectricityPricing - - -Hyperplane + ≈ ≈
Iris+ - (x) - (x) - (x)Car - - - (x)10dplane + (x) + + (x)Weather + + +Sea + + +SinH ≈ (x) - (x) -Line + - (x) +Sin + (x) - (x) ≈ TABLE III: McNemar test between ”Forget AM Global” andthe other strategies - A statistic K that reject the null hypothesiswith a confident above ( K > . ) is noticed by a +,between [90%,99%] by a ≈ and below ( K < . ) by a -.If the number of contingency errors is below the recommendedvalue of 25, a (x) is addedscore is the most important, there is a significant difference inthe proportions of errors. The strategy ”Forget AM Global” issignificantly better for most of the datasets where a differenceof mean accuracy score is observed in the Table I. E. Comparison with state-of-the-art
Finally, to validate the proposed method, a comparisonwith state-of-the-art streaming classifiers is carried out basedon the results published in [17]. Results are shown Table I.The pclass classifier [29] is an evolving fuzzy system basedon a generalized TSK fuzzy inference system as ParaFIS,with a rule creation process based on a recursive ”density”function and with an online feature selection. Its ensembleversion, namely pENsemble [17], is an ensemble classifiersusing pclass as base learned. It is combined with an online driftdetection, ensemble pruning and online features scenarios.Two other ensembles classifiers are compared: Learn++.NSE[26], Learn++.CDE [25]. They are designed to deal with nonstationary streams thanks to a dynamic voting scenario whichreflects the current data streams. We can see in the results, thator 6 among the 10 benchmark datasets, the ParaFIS systemoutperforms state-of-the-art models. The average mean scoresover all datasets have reached 85% of accuracy score against82% for pENsemble, the best from state-of-the-art.V. C
ONCLUSION & O
UTLOOKS
This paper introduced a new strategy to include forgettingcapacity in the conclusion part of an evolving fuzzy system,based on the deferred directional forgetting. The strategy relieson the anticipation concept recently introduced in the ParaFISsystem. The results of the paper highlight that the consequentpart of the EFS plays an important interdependent role in theclassification performance of such system. Consequently, it isnecessary to integrate forgetting to adapt the conclusion part tonon stationary stream. The proposed method consists to updateconsequent part of all rules to the drift when it is detected.The updates is based on the anticipation of the forgetting ofthe conclusions part. It gives a convincing trade-off to theplasticity-stability dilemma. The performances obtained by theproposed approach, for a classification task, are superior tothose of the state-of-the-art approaches, this for most of thetested datasets. However, these systems still have the thornyproblem of the parameters setting. Indeed, how to set theparameters in a data stream context (with no data available) isa important question to be resolved, in particular with ParaFISsystem which has 4 parameters. Future work will investigatehow to define or adapt these parameters with the data stream.R
EFERENCES[1] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, andJennifer Widom. Models and issues in data stream systems. In
Proceed-ings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium onPrinciples of Database Systems , PODS ’02, page 1–16, New York, NY,USA, 2002. Association for Computing Machinery.[2] L. Song, C. Tekin, and M. van der Schaar. Online learning in large-scale contextual recommender systems.
IEEE Transactions on ServicesComputing , 9(3):433–445, 2016.[3] Manuel Bouillon and Eric Anquetil. Online active supervision of anevolving classifier for customized-gesture-command learning.
Neuro-computing , 262:77 – 89, November 2017.[4] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In
Proceedings of the Seventh ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining ,KDD ’01, pages 97–106, New York, NY, USA, 2001. ACM.[5] Elena Ikonomovska, Jo˜ao Gama, and Saˇso Dˇzeroski. Learning modeltrees from evolving data streams.
Data Mining and Knowledge Discov-ery , 23(1):128–168, Jul 2011.[6] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: a newensemble method for tracking concept drift. In
Third IEEE InternationalConference on Data Mining , pages 123–130, Nov 2003.[7] Ryan Elwell and Robi Polikar. Incremental learning of concept driftin nonstationary environments.
IEEE TRANSACTIONS ON NEURALNETWORK , pages 1517–1528, october 2011.[8] Mahardhika Pratama, Sreenatha G. Anavatti, and Edwin Lughofer. Anincremental classifier from data streams. In Aristidis Likas, KonstantinosBlekas, and Dimitris Kalles, editors,
Artificial Intelligence: Methodsand Applications , pages 15–28, Cham, 2014. Springer InternationalPublishing.[9] Plamen Angelov and Ronald Yager. Simplified fuzzy rule-based systemsusing non-parametric antecedents and relative data density. In , pages62–69, April 2011.[10] Clement Leroy, Eric Anquetil, and Nathalie Girard. ParaFIS:A newonline fuzzy inference system based on parallel drift anticipation.
FUZZ-IEEE New-Orleans , Jul 2019. [11] Janusz Milek and Frantisek J. Kraus. Time-varying stabilized forgettingfor recursive least squares identification.
IFAC Proceedings Volumes ,28(13):137 – 142, 1995. 5th IFAC Symposium on Adaptive Systemsin Control and Signal Processing 1995, Budapest, Hungary, 14-16 June,1995.[12] E. D. Lughofer. Flexfis: A robust incremental learning approach forevolving takagi–sugeno fuzzy models.
IEEE Transactions on FuzzySystems , 16(6):1393–1410, 2008.[13] Plamen Angelov and Dimitar P. Filev. An approach to online identifi-cation of takagi-sugeno fuzzy models.
IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics) , 34(1):484–498, Feb 2004.[14] M. Bouillon, E. Anquetil, and A. A. Almaksour. Decremental learning ofevolving fuzzy inference systems using a sliding window. In
Proceedingsof the 2012 11th International Conference on Machine Learning andApplications - Volume 01 , ICMLA ’12, pages 598–601, Washington,DC, USA, 2012. IEEE Computer Society.[15] E. Lughofer, M. Pratama, and I. Skrjanc. Incremental rule splitting ingeneralized evolving fuzzy systems for autonomous drift compensation.
IEEE Transactions on Fuzzy Systems , 26(4):1854–1865, Aug 2018.[16] Andre Lemos, Walmir Caminhas, and Fernando Gomide. Multivariablegaussian evolving fuzzy modeling system.
IEEE Transactions on FuzzySystems , 19(1):91–104, Feb 2011.[17] Mahardhika Pratama, Witold Pedrycz, and Edwin Lughofer. Evolv-ing ensemble fuzzy classifier.
IEEE Transactions on Fuzzy Systems ,26(5):2552–2567, Oct 2018.[18] J. Yen, Liang Wang, and C. W. Gillespie. Improving the interpretabilityof tsk fuzzy models by combining global learning and local learning.
IEEE Transactions on Fuzzy Systems , 6(4):530–537, Nov 1998.[19] Abdullah Almaksour and Eric Anquetil. Improving premise structurein evolving Takagi–Sugeno neuro-fuzzy classifiers.
Evolving Systems ,2(1):25–33, Mar 2011.[20] Chi Sing Leung, G. H. Young, J. Sum, and Wing-Kay Kan. On theregularization of forgetting recursive least square.
IEEE Transactionson Neural Networks , 10(6):1482–1486, 1999.[21] Jo˜ao Gama, Raquel Sebasti˜ao, and Pedro Pereira Rodrigues. Onevaluating stream learning algorithms.
Mach. Learn. , 90(3):317–346,March 2013.[22] Thomas G. Dietterich. Approximate statistical tests for comparingsupervised classification learning algorithms.
Neural Computation ,10(7):1895–1923, 1998.[23] Leandro L. Minku, Allan P. White, and Xin Yao. The impact of diversityon online ensemble learning in the presence of concept drift.
IEEETransactions on Knowledge and Data Engineering , 22(5):730–742, May2010.[24] Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea)for large-scale classification. pages 377–382, 07 2001.[25] Gregory Ditzler and Robi Polikar. Incremental learning of concept driftfrom streaming imbalanced data.
IEEE Transactions on Knowledge andData Engineering , 25(10):2283–2301, Oct 2013.[26] Ryan Elwell and Robi Polikar. Incremental learning of concept driftin nonstationary environments.
IEEE Transactions on Neural Networks ,22(10):1517–1531, Oct 2011.[27] Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer.MOA: massive online analysis.
J. Mach. Learn. Res. , 11:1601–1604,2010.[28] Michael Harries, U Nsw cse tr, and New South Wales. Splice-2comparative evaluation: Electricity pricing. Technical report, 1999.[29] Edwin Lughofer, Carlos Cernuda, Stefan Kindermann, and MahardhikaPratama. Generalized smart evolving fuzzy systems.