Drift-Aware Multi-Memory Model for Imbalanced Data Streams
aa r X i v : . [ c s . L G ] D ec Drift-Aware Multi-Memory Model forImbalanced Data Streams
Amir Abolfazli
L3S Research CenterLeibniz University of Hanover
Eirini Ntoutsi
L3S Research CenterLeibniz University of Hanover
Abstract —Online class imbalance learning deals with datastreams that are affected by both concept drift and classimbalance. Online learning tries to find a trade-off betweenexploiting previously learned information and incorporating newinformation into the model. This requires both the incrementalupdate of the model and the ability to unlearn outdated informa-tion. The improper use of unlearning, however, can lead to the retroactive interference problem, a phenomenon that occurs whennewly learned information interferes with the old informationand impedes the recall of previously learned information. Theproblem becomes more severe when the classes are not equallyrepresented, resulting in the removal of minority informationfrom the model. In this work, we propose the Drift-Aware Multi-Memory Model (DAM3), which addresses the class imbalanceproblem in online learning for memory-based models . DAM3mitigates class imbalance by incorporating an imbalance-sensitivedrift detector, preserving a balanced representation of classes inthe model, and resolving retroactive interference using a workingmemory that prevents the forgetting of old information. We showthrough experiments on real-world and synthetic datasets thatthe proposed method mitigates class imbalance and outperformsthe state-of-the-art methods.
Index Terms —online learning, class imbalance, concept drift,retroactive interference, multi-memory model.
I. I
NTRODUCTION
The challenge of learning from imbalanced data streamswith concept drift has attracted a lot of attention from bothacademia and industry in recent years. The term conceptdrift refers to changes in the underlying data distributionover time.
Class imbalance occurs when the classes are notrepresented equally. Online class imbalance learning dealswith data streams that are affected by both concept drift andclass imbalance and exists in many real-world applicationssuch as anomaly detection, risk management, and social media.Online learning algorithms, dealing with imbalancedstreams, not only try to better represent the minority class forthe learning model (e.g., by oversampling the minority class)but also try to find a trade-off between retaining the previouslylearned information and adapting to new information fromthe stream, known as the stability-plasticity dilemma [1]. Thebasic idea is that a learning model requires plasticity for theintegration of new information, but also stability in order toprevent the forgetting of old information [2]. A too adaptivemodel forgets previously learned information and a too stable model cannot learn new information, hence, finding a trade-offbetween plasticity and stability is required.In recent years, many methods have been proposed to dealwith online class imbalance learning (e.g., [3]–[5]). Somemethods employed unlearning to deal with concept drift (e.g.,[6]–[8]) by removing the information that is inconsistent,i.e., contradicting class labels, with the incoming data fromthe stream. SAM-kNN [7] is one of such models based onkNNs that makes use of unlearning in the neighborhood ofthe instances. SAM-kNN is a dual-memory model [7] thatpartitions the knowledge between short-term memory (STM)and long-term memory (LTM), containing the information ofthe current and former concepts, respectively. Preserving theconsistency, in SAM-kNN, is based on a cleaning operationthat unlearns the information of former concepts in the LTMthat contradicts the information of the most recent concept,stored in the STM. Although unlearning is a desired propertyfor the model adaptation, if not applied carefully, it could leadto the retroactive interference problem [9] that occurs whennew information interferes with previously learned informa-tion, causing the (unintentional) forgetting of old information.Figure 1 illustrates the problem of retroactive interferencein a dual-memory model, where old information in the LTMis removed because the model adapts to new data in theSTM. The problem becomes more severe when the classesare not equally represented in the stream as it could lead tothe removal of minority instances, which are of higher interestthan majority instances in many real-world applications.In this work, we propose a multi-memory model which dealswith the class imbalance by 1) incorporating an imbalance-sensitive drift detector, 2) preserving a balanced representationof classes in the model, and 3) resolving the retroactiveinterference by means of a working memory (WM) [10], ma-nipulating information in the LTM for every incoming instanceto the STM. We also contribute new synthetic benchmarks withdifferent drift types and class imbalance ratios.II. P
RELIMINARIES AND BASIC CONCEPTS
A data stream D is a potentially infinite sequence ofinstances arriving at distinct time points , · · · , t, · · · , where t is the current timepoint. Each instance x ∈ D is describedin a d -dimensional feature space, i.e., x ∈ R d . Without loss TM LTM LTM most recent instancedistance threshold time t time t time t+1 (a) (b) (c) Figure 1. Illustration of retroactive interference problem in a dual-memorymodel. (a) the red star instance is the most recent instance in the STM.Its neighborhood (indicated by the green dashed circle) is defined by themaximum distance of consistent instances (i.e., instances with the same class:red) in its neighborhood ( k = 5 ). (b ) state of the LTM at t , the affected area(indicated by the green dashed circle) centered at the new point’s location. (c) state of the LTM at t + 1 , where (the majority of) the previously storedinstances have been removed due to inconsistency with the new informationof the STM. of generality, we assume a binary classification problem, i.e., Y = { + , −} . We follow the first-test-then-train or prequentialevaluation setup [11]. Assuming a probability distribution P ( x , y ) generating the instances of D , the characteristics of P might change with time, i.e., for two time-points i , j , itmight hold that P i ( x , y ) = P j ( x , y ) , a phenomenon called concept drift [11]. The drift type could also be characterizedbased on the rate at which drift occurs. Sudden drift resultsin a severe change in the distribution of data.
Incrementaldrift occurs when the concept incrementally changes.
Gradualdrift occurs when the instances belonging to two differentconcepts are interleaved for a certain period of time.
Recurringdrift describes a case in which the concept which has alreadybeen observed, reoccurs. Apart from the occurrence of conceptdrifts, we also assume that the stream is imbalanced with themajority class (assumed to be − ) occurring more often thanthe minority class (assumed to be + ). To express the degreeof imbalance, we use the imbalance ratio (IR) [12] defined asthe number of minority instances over majority instances. IRis commonly denoted by 1:r (r is a value corresponding to themajority class) which specifies the ratio between the minorityand majority class [13]. In a streaming data environment,imbalance can be either static assuming a fixed class ratioor dynamic assuming varying class ratio over the stream.Learning under imbalance is harder in a stream environmentas there is no prior knowledge about the IR and often the roleof minority and majority changes over the stream [14].III. R ELATED W ORK
Online class imbalance learning methods deal with datastreams, affected by both concept drift and class imbalance.Bifet et al. in [15] proposed OBA which is an onlineensemble method that improves the Online Bagging algorithm[16] by adding the ADWIN change detector. When a change isdetected, the worst-performing base learner of the ensemble isreplaced with a new one. Bifet et al. in [17] proposed LeverageBagging (LB), an online ensemble method that leverages theperformance of bagging by increasing the weights of theresampling using a larger value λ to compute the value ofthe Poisson distribution to increase diversity of the ensemble. LB uses the ADWIN change detector to deal with conceptdrifts. When a concept drift is detected, the worst base learneris reset. Both OBA and LB indirectly deal with class imbalanceas Diez-Pastor et al. in [18] showed that diversity-increasingtechniques such as bagging improve the performance of en-semble methods for imbalanced problems.Melidis et al. in [19] proposed a method assuming thelikelihood of different features with respect to the class followsdifferent trends and proposed an ensemble method that predictsthe best trend detector.Wang and Pineau in [3] proposed online AdaC2, an onlineboosting algorithm that considers the different misclassifica-tion costs when calculating the weights of base learners andupdates the weights of instances accordingly. More precisely,AdaC2 increases weight more on the misclassified positiveinstances than the misclassified negative instances. The sameauthors in [3] proposed the online RUSBoost, an onlineboosting algorithm that removes instances from the majorityclass by randomly undersampling the majority-class instancesin each boosting round. The original version of both AdaC2and RUSBoost do not deal with concept drift. However, theimproved version of these methods deal with concept driftsusing an ADWIN change detector.The most relevant work to our work is the Self AdjustingMemory model for the k Nearest Neighbor algorithm (SAM-kNN) [7]. SAM-kNN builds an ensemble of classifiers inducedon different memories: the short-term memory (STM) for thecurrent concept and the long-term memory (LTM) for theformer concepts, and the combined memory (CM) which isthe union of STM and LTM. The authors propose a cleaningoperation during the transfer that deletes instances of the LTMthat are inconsistent with transferred instances of the STM.The original SAM-kNN model does not consider class im-balance. In case of imbalance, the memories and in particularthe LTM memory is increasingly dominated by the majority-class instances (see Figure 5 (b)). As a result, the performanceof the model on minority instances is dropping. As we showin our experiments, this is not only because of the reducedrepresentation of the minority class in the input stream butalso because of the cleaning operation which deletes moreinstances of the minority class (see Figure 6 (e)).Unnikrishnan et al. in [20] proposed specialized kNN mod-els (for each entity) and a global kNN (for the whole stream)to ensure adequate representation of the entities in the learningmodels, independent of their volume. Moreover, their methodleverages the global model to deal with the cold-start problem.Recently, the problem of fairness-aware learning in theonline setting and under the class imbalance has been in-troduced [21]; the proposed solution changes the trainingdistribution to take into account the evolving imbalance anddiscriminatory behavior of the model, both of which areevaluated over the historical stream.IV. D
RIFT -A WARE M ULTI -M EMORY M ODEL
In the research field of human memory, multi-memorymodels [22] have been proposed to overcome the limitations ofual-memory models and better represent the human memory.Such models consist of the sensory register (SR), short-termmemory (STM), and long-term memory (LTM). The basic ideais that, first, the sensory information enters the SR, keeping theinformation for a very short time. The sensory information istransferred into the STM for temporary storage, and is encodedvisually, acoustically or, semantically. The information is thentransferred to the LTM after getting enough attention byprocesses such as active rehearsal [22]. The information thatenters the STM is joined by context-relevant information inthe LTM, which requires the retrieval of information fromthe LTM. Sometimes, the information of the LTM cannotbe retrieved due to the retroactive interference (RI) [9]. RIis the interference occurring when newly learned informationimpedes the recall of previously learned information. A theo-retical concept proposed in the field of cognitive psychologyis the working memory [23], which introduces a memory thattemporarily stores information relevant to the current task.In a dual-memory model (see Figure 1), the problem ofretroactive interference occurs when inconsistent informationof the LTM is replaced with new information of the STM.Such a replacement, intended for dealing with concept drifts inSAM-kNN [7], can result in loss of information that happens1) when information is transferred from the STM to the LTM,and 2) when the LTM is cleaned with respect to the STM forevery incoming instance from the stream. As we will see in theexperiments (Figure 5 (b)), such a replacement greatly affectsthe minority class, and therefore, the RI problem becomesmore severe for the minority class.To deal with this problem, we incorporate a working mem-ory (WM) , preventing the model from removing inconsistentinstances, into our model. Due to the fact that stream containsmore instances from the majority class and also the fact thatthe LTM is cleaned with respect to every incoming instancefrom the stream, more minority instances become inconsistentand thus removed from the LTM. Therefore, the minority classbenefits more from the use of the WM.The proposed model, DAM3, is a multi-memory model fordata streams with class imbalance and concept drifts that: i)introduces a working memory to deal with the retroactiveinterference (Section IV-A), ii) incorporates an imbalance-sensitive drift detector (Section IV-C) to take into account theinherent imbalance, iii) preserves a balanced representation ofclasses using oversampling, (Section IV-D), and iv) removesnoisy instances in the working memory that are generatedafter exchanging information (Section IV-F). The architectureof DAM3 is shown in Figure 2.
A. Model memories
DAM3 consists of four memories:
ST M , W M , LT M ,and CM , each of which is represented by a set of labeledinstances. Short-term memory (STM) is dedicated to the current con-cept and is a dynamic sliding window containing the mostrecent m instances from the stream ( t is the current timepoint): ST M = (cid:8) ( x i , y i ) ∈ R d × { +1 , − }| i = t − m + 1 , . . . , t (cid:9) . Oversampling CM Figure 2. Architecture of DAM3.
Long-term memory (LTM) maintains information ( p points)of former concepts that is consistent with the current concept: LT M = (cid:8) ( x i , y i ) ∈ R d × { +1 , − }| i = 1 , . . . , p (cid:9) . Combined memory CM (STM ∪ LTM) represents the com-bination of short-term and long-term memories. It comprisesjust the union of STM and LTM and has the size m + p . Working memory (WM) lends itself to resolving the retroac-tive interference problem. It preserves inconsistent informationof the LTM and also transfers back (to the LTM) informationthat becomes consistent with the most recently stored infor-mation in the STM. In this way, the WM makes its consistentinformation available to the LTM for current predictions, madeby the classifiers LTM and CM, and also retains valuableinformation for later predictions. The WM is a set of q points: W M = (cid:8) ( x i , y i ) ∈ R d × { +1 , − }| i = 1 , . . . , q (cid:9) . B. DAM3 training, weighting, and prediction
Each memory induces a classifier; therefore, DAM3 couldbe considered as an ensemble method.
1) DAM3 training:
In the SAM-kNN [7] model, weightedkNN classifiers were employed for all memories. In this work,we use the weighted kNN for the
LTM and CM as it allowsthe seamless implementation of cleaning operation.kNN assigns a label to an instance x based on the memoryinstances: kNN M (x) = arg max ˆ y X x i ∈ N k ( x ,M ) | y i =ˆ y d ( x i , x ) (1)where d ( x , x ) is the Euclidean distance between twopoints, N k ( x , M ) is the set of k nearest neighbors of x in M , and M ∈ { LT M, CM } .We use the full Bayes classifier [24] for the STM instead ofthe weighted kNN as the instance-based learning classifiers arequite sensitive to noisy data [25]. In our case, the use of thekNN as the STM classifier might result in incorrect predictionspassed to the drift detector (c.f., Section IV-C).he full Bayes classifier assumes that the distribution of datacan be modeled with a multivariate Gaussian distribution [24].A new instance x is classified as follows: FB STM ( x ) = arg max y p ( y ) f ( x | y ) (2)where p ( y ) is the class prior and f ( x | y ) is the multivariateGaussian density function [24].
2) DAM3 weighting and prediction:
Each of the baseclassifiers STM, LTM, and CM is weighted based on itsbalanced accuracy on the most recent ms instances of thestream, where ms is equal to the minimum size of the STM.The best performing model is chosen to predict the class labelof the current instance. C. Imbalance-sensitive drift detection
Information is transferred from the STM to the LTM whenconcept changes. A change in concept is signaled by signif-icant degradation of the performance of the STM classifier,corresponding to the model learned on the current concept.Due to inherent class imbalance of the stream, performance ofthe model on both classes should be taken into account. Hence,we propose a drift detector that relies on balanced accuracy.The detector takes as inputs the incoming instances and theircorresponding predictions made by the STM classifier.Our proposed drift detector divides all the balanced accuracyvalues into two windows (reference window and test window)and performs the non-parametric Kolmogorov–Smirnov test.If the balanced accuracy values of the reference window aresignificantly different from those of the test window, the driftis detected and the STM size is reduced from m to ws , wherethe m is the current size of the STM and ws is the window sizeof the drift detector (each of the reference and test windowshas the size ws ), respectively. In this way, the test windowbecomes the new STM ( ST M t +1 , with the size ws ) and theremaining set (denoted by ∆ ), with the size m − ws , is firstoversampled and then transferred to the LTM. D. Balanced representation of remaining set
Before transferring the information from the STM to theLTM, we perform oversampling on the remaining set ∆ toensure a balanced representation of both classes. We denotethe oversampled remaining set by O ∆ . We use BorderlineS-MOTE [26], which selects instances of the minority class thatare misclassified by a kNN classifier and oversamples onlythose difficult instances, being more important for classifica-tion. The oversampled set is then transferred to the LTM. E. Transfer and cleaning:
DAM3 keeps the LTM consistent with the STM, similar tothe SAM-kNN [7]. The LTM is cleaned with respect to everyincoming instance from the stream. However, DAM3 doesnot perform any cleaning on the remaining set (transferredfrom the STM to the LTM) due to its designed drift detector.SAM-kNN performs the cleaning as all the instances of the We used the parameter values k = 5 , m = 5 for the BorderlineSMOTE. remaining set might not belong to the previous concept. Inboth cases (transfer of information from the STM to the LTMand cleaning of the LTM), the deletion of inconsistent in-stances might lead to the retroactive interference problem (c.f.,Figure 1). To allow for the deletion of outdated informationand also to prevent the deletion of older information due tothe interference with newer information, we use the workingmemory. This memory resolves the retroactive interferenceby exchanging its consistent information with the inconsistentinformation of the LTM for every incoming instance from thestream. Figure 3 illustrates how this problem is resolved.We use the basic cleaning operation, proposed by the SAM-kNN [7], that deletes inconsistent instances of different classesin the neighborhood of an instance. The main idea is that mostrecent instances of the stream convey the correct class-label,and instances of different labels in its neighborhood should beconsidered as outdated and thus deleted. Transfer of information from STM to LTM:
The instances of O ∆ (oversampled remaining set) are transferred to the LTMand the LT M is updated as follows:
LT M t d +1 = LT M t d ∪ O ∆ td , where t d is the time at which drift occurs. Transfer of inconsistent information from LTM to WM:
The k nearest neighbors of x i in ST M \ ( x i , y i ) , at time t ,are determined and the ones with label y i are selected. The distance threshold θ at time t is then defined as: θ t = max { d ( x i , x )) | x ∈ N k ( x i , ST M t \ ( x i , y i )) ,y ( x ) = y i } . On the basis of the found distance threshold of the
ST M t ,we define the inconsistent set of the LTM at time t ( IS LT M t )with respect to the instance ( x i , y i ) in the ST M t : IS LT M t = LT M t ∩ { ( x j , y ( x j )) | x j ∈ N k ( x i , LT M t ) ,d ( x j , x i ) ≤ θ, y ( x j ) = y i } . Similarly, we define the consistent set of the WM at time t ( CS W M t ) with respect to the set LT M t and the instance ( x i , y i ) in the ST M : CS W M t = W M t ∩ { ( x j , y ( x j )) | x j ∈ N k ( x i , W M t ) ,d ( x j , x i ) ≤ θ, y ( x j ) = y i } . Based on the IS LT M t and CS W M t , inconsistent instancesof the LTM ( IS LT M t ) are transferred to the WM and infor-mation of the W M is updated as follows:
W M t +1 = ( W M t \ CS W M t ) ∪ IS LT M t . Transfer of consistent information from WM to LTM:
Consis-tent instances of the WM are transferred back to the LTM andinformation of the
LT M is updated as follows:
LT M t +1 = ( LT M t \ IS LT M t ) ∪ CS W M t . The WM preserves inconsistent information of the LTM, as itmight be useful for later predictions, and makes its consistentinformation available to the LTM for current predictions. ompression of the LTM and WM.
The information ofthe LTM and WM is not discarded when their size exceedsthe maximum threshold. Instead, as in SAM-kNN [7], wecompress the data through class-wise
K-Means++ clustering,such that the number of instances per class is reduced by half.
F. Noise removal
The exchange of inconsistent instances of the LTM with theconsistent instances of the WM might result in the generationof noisy instances. This means that some instances mightbecome consistent with the LTM. We remove those instanceswhich could be correctly classified based on the informationof only the LTM, at each timepoint.
WMSTM WMLTM LTM time t time t+1 (a) (c) (d) time t time t (b) time t+1 Figure 3. Resolving the retroactive interference using a working memoryincorporated into a multi-memory model. (a) state of the STM at time point t ; the red star point denotes the most recent instance in the STM. (b) stateof the LTM and WM at time point t ; there are four inconsistent instances inthe LTM and four consistent instances in the WM. (c) consistent instances(in red) and inconsistent instances (in blue) with respect to the most recentinstance. (d) state of the LTM and WM at the time point t +1 ; the inconsistentinstances in the LTM are exchanged with the consistent instances in the WM. V. E
XPERIMENTS
First, we compare DAM3 with state-of-the-art competi-tors on a variety of datasets (Section V-A) using appropriatemeasures and evaluation setup (Section V-B). The results arediscussed in Section V-C. Then, we focus on the behaviorof our proposed model, namely the class-imbalance ratio ofeach memory (Section V-D1), the interaction between thedifferent memories, i.e., cleaning and transfer operations,(Section V-D2), and the size of memories with respect to theminority and majority classes (Section V-D3).All the experiments were implemented and evaluated inPython using the scikit-multiflow framework [27]. The codeand datasets are available on GitHub .For all the competitors, we used the default parameter valuesreported in the corresponding papers. The parameter values ofthe competitors, including DAM3, are given in Table I. Weused the pretrainSize = 200 for all classifiers. A. Datasets
We experimented with a variety of synthetic and realdatasets summarized in terms of their cardinality, dimension-ality, class ratio, and drift type in Table II, as described below. https://amir-abolfazli.github.io/DAM3/ Table IP ARAMETER VALUES FOR
DAM3
AND COMPARED CLASSIFIERS . Classifier Parameter Value Parameter ValueDAM3 n _ neighbors ltm _ size weighting distance max _ window _ size min _ stm _ size drift _ detector _ ws min_stm_size wm _ size drift _ detector _ sig _ level n _ neighbors ltm _ size weighting distance max _ window _ size min _ stm _ size num _ estimators cost _ positive base _ estimator KNNAdwin cost _ negative n _ neighbors drift _ detector AdwinRUSBoost [3] num _ estimators algorithm base _ estimator KNNAdwin drift _ detector Adwin n _ neighbors num _ estimators lambda base _ estimator kNN delta n _ neighbors drift _ detector AdwinOBA [16] num _ estimators n _ neighbors base _ estimator KNNAdwin drift _ detector Adwin
Synthetic datasets:
Synthetic datasets have the advantagethat any desired drift behavior can be explicitly simulated. Weused the MOA framework [28] to generate 4 synthetic streamswith different types of concept drift. The SEA generator [29]was used to generate two streams: one stream with threesudden drifts and a constant 1:10 IR (
SEA_S ); and one streamwith three gradual drifts where each concept has a different IR1:4/1:5/1:2/1:10 (
SEA_G ). Similarly, the Hyperplane generator[30] was used to simulate two streams with incremental drifts:one with a constant IR 1:10 (
HyperFast ); and one with adynamic IR 1:1 → HyperSlow ). For the streams
SEA_S and
HyperFast (constant IR =1:10), we included 10% noise,and for the streams SEA_G and
HyperSlow (having differentIRs and dynamic IR, respectively), we included 5% noise.
Real-world datasets:
Real-world datasets are used to showhow well the stream classifiers perform in practice. A few real-world drift benchmarks are available for binary classification,of which we considered weather, electricity, and PIMA.The weather dataset [31] contains 18,159 instances and 8features corresponding to the measurements such as tempera-ture and wind speed. The goal is to predict whether it will bea rainy day (minority class) or not.The electricity dataset [32] contains 45,312 instances and 8features such as date, demand, and price, and the goal is topredict whether the price will increase (minority class) or not,according to the moving average of last 24 hours.The
PIMA
Indian dataset [33] contains 768 instances and 8features, such as blood pressure, insulin, and age. The goal isto diagnostically predict whether a patient will have diabetesmellitus (minority class) or not, in 1-5 years.
Table IIT
HE CHARACTERISTICS OF THE DATASETS USED IN THE EXPERIMENTS . Type Datasets → B. Evaluation setup1) Performance metrics:
An appropriate performance met-ric takes into account the performance on all classes, ratherthan the overall performance which is heavily affected by theajority class. The balanced accuracy is an appropriate per-formance metric for imbalanced data. For binary classification,it is defined as the arithmetic mean of the sensitivity andspecificity [13]. Another performance metric, often used forimbalanced data, is the geometric mean (G-Mean) . For binaryclassification, G-Mean is defined as the squared root of theproduct of the sensitivity and specificity. G-Mean punishesthose models for which there is a big disparity between thesensitivity and specificity. It is different from the balancedaccuracy which treats both classes equally [34].
2) Evaluation method:
In data stream classification, themost commonly used evaluation method is the prequentialevaluation [35]. The prequential evaluation is specificallydesigned for streaming settings, where instances arrive insequential order. The idea is to first test the model on theinstance, and then that instance is used to update the model.In this way, the model is always tested on the instances,not seen yet. The prequential evaluation is preferred over thetraditional holdout evaluation as it makes the maximum useof the available data (i.e., no test set is needed) [36].For the experiments, We evaluate the performance of theclassifiers using the prequential evaluation and report onsensitivity, specificity, G-Mean, and balanced accuracy.
C. Predictive performance
In Table III, the predictive performance of DAM3 and com-pared classifiers on the different datasets is shown. The pro-posed method DAM3 outperforms all the compared methodsin terms of G-Mean and balanced accuracy. This indicates thatDAM3 finds a trade-off between the sensitivity and specificitybetter than other methods. To further investigate the differencesin the average G-Mean and balanced accuracy (i.e., averageranks) of the compared methods on the considered datasets,we used the post-hoc Bonferroni-Dunn test [37] to computethe critical difference (CD). The results are shown in Figure 4and as we can see, the performance of DAM3 is significantlybetter than AdaC2, LB, and SAM-kNN in terms of G-Mean,and significantly better than AdaC2, RusBoost, and LB, interms of balanced accuracy.
Balanced Accuracy (b)
G-Mean (a)
Figure 4. Critical difference diagram for the post-hoc Bonferroni-Dunn test.
Ablation study.
In DAM3, all the components (drift detec-tor, oversampling, working memory, and weighting of classi-fiers based on balanced accuracy) work together to mitigatethe class imbalance. Therefore, showing the impact of eachcomponent alone does not make sense. Apart from thesecomponents, one of the main differences between DAM3 and
Table IIIP
REDICTIVE PERFORMANCE OF THE CLASSIFIERS . C
LASSIFIERS WITHTHE BEST AND SECOND - BEST PERFORMANCE ARE MARKED IN BOLD ANDUNDERLINED , RESPECTIVELY . Dataset Classifier Sensitivity Specificity G-Mean Balanced AccuracySEA_S DAM3 0.4614 0.9190
SAM-kNN 0.3778
SAM-kNN 0.5467
SAM-kNN 0.2205
SAM-kNN 0.2852
SAM-kNN 0.5080
SAM-kNN 0.7879 0.8566 0.8215 0.8222AdaC2 0.3824
SAM-kNN 0.3308 0.8727 0.5373 0.6017AdaC2 0.2247
SAM-kNN is the use of full Bayes as the STM classifier(instead of the weighted kNN).In Table IV, we compare the performance of DAM3 interms of G-Mean and balanced accuracy with the performanceof SAM-kNN with both kNN and full Bayes as the STM clas-sifier, and show the difference in the performance. The resultsindicate that the use of the full Bayes as the STM classifier inSAM-kNN slightly improves the performance on all datasets(except SEA_S and Weather). The results also show that forall the datasets (except Electricity), the superior performanceof DAM3 is mainly due to the considered components and notthe use of the full Bayes classifier.
Table IVI
MPACT OF THE FULL B AYES CLASSIFIER USED AS
STM
CLASSIFIER FOR
SAM- K NN,
COMPARED WITH THE ORIGINAL
SAM- K NN AND
DAM3.
Dataset Classifier G-Mean G-Mean Diff Balanced Accuracy Balanced Accuracy DiffSEA_S
DAM3 ↑ ↑ SAMkNN − STM _ FB ↓ ↓ SAMkNN − STM _ kNN DAM3 ↑ ↑ SAMkNN − STM _ FB ↑ ↑ SAMkNN − STM _ kNN DAM3 ↑ ↑ SAMkNN − STM _ FB ↑ ↑ SAMkNN − STM _ kNN DAM3 ↑ ↑ SAMkNN − STM _ FB ↑ ↑ SAMkNN − STM _ kNN DAM3 ↑ ↑ SAMkNN − STM _ FB ↓ ↓ SAMkNN − STM _ kNN DAM3 ↑ ↑ SAMkNN − STM _ FB ↑ ↑ SAMkNN − STM _ kNN DAM3 ↑ ↑ SAMkNN − STM _ FB ↑ ↑ SAMkNN − STM _ kNN . Model behavior The goal of this section is to shed light on the internalmechanisms of DAM3 and their contribution towards tacklingboth class imbalance and concept drifts.
1) Imbalance perception by the model:
In this section,we examine the IR of the memories of our model, DAM3,compared with the cumulative IR of the stream. We alsoexamine the IR of the memories of the SAM-kNN [7] which isthe most similar model to our model in terms of architecture.Figure 5 (a) shows the IR of all memories for our modelDAM3, compared with SAM-kNN, on the Weather dataset.The drift detector helps the model to not make the STM fullwith majority instances. As a result, the IR of the STM inDAM3 is lower than that of SAM-kNN. In (b), the green andred lines correspond to the IRs of the LTM for DAM3 andSAM-kNN, respectively. The lines reveal that what is reflectedby SAM-kNN is significantly higher than the actual IR shownin (d). Moreover, for SAM-KNN, the IR of the LTM graduallyincreases over time, implying that the majority class becomesincreasingly dominant in the LTM. Unlike SAM-kNN, DAM3reflects a balanced representation of classes ( IR ≈ ). In (c),the purple line shows the IR of the WM that corresponds tothe instances which are inconsistent with the LTM. Figure 5. The IR of the memories of DAM3 compared with that of SAM-kNNon the Weather dataset.
2) Removal and transfer of instances in LTM and WM:
Inthis section, we demonstrate the “removal” and “transfer” ofinformation within the long-term and working memories.In Figure 6, (a) and (b) show the number of inconsistent minority and majority instances, removed and transferred fromthe LTM to the WM in DAM3. (c) and (d) show the numberof consistent minority and majority instances, removed andtransferred from the WM to the LTM in DAM3. (e) and(f) show the number of inconsistent minority and majorityinstances which are removed from the LTM in SAM-kNN.Since we used the kNN with k=5, at most, there could be 5inconsistent instances to be removed and transferred at eachtime point. Comparing the subfigures (a) and (b) with (e)and (f) shows that DAM3 removes fewer minority instancescompared with SAM-kNN. This statement is supported by thesubfigure (b) in Figure 5, showing the IR of the LTM (in red),where the IR increases gradually. This implies that SAM-kNN removes more minority instances over time. The subfigures(c) and (d) show the number of removed and transferred consistent instances of the WM for both minority and majorityclasses. Both (c) and (d) show a similar behavior, revealing thatthere are some consistent instances which could be transferredback almost all the time. This means that the DAM3 resolvesthe problem of retroactive interference, impeding the model’sability to retrieve the old minority instances.
SAM-kNN (a)(c)(b)(e)(d)(f)
DAM3
Figure 6. The number of removed and transferred instances from the LTMto WM and vice versa for DAM3, compared with the number of removedinconsistent instances from LTM for SAM-kNN, on the Weather dataset.
3) Size of memories:
In Figure 7, (a) and (b) correspondto the size of STM with respect to the minority and majorityfor the models DAM3 and SAM-kNN, respectively. Bothsubfigures show a similar behavior, however, the size of STMin DAM3 is, on average, 32% smaller than that of the SAM-kNN, due to the use of the drift detector. In (c), the number ofmajority instances in the LTM, for the SAM-kNN (blue line),gradually increases while the number of minority instances(red line) remains almost the same. The behavior is completelydifferent for DAM3, where the number of minority instances isalmost equal to the number of majority instances. The suddendrops in the size of the LTM correspond to the times at whichcompression occurs. In (d), green and red lines correspond tothe number of minority and majority instances, respectively,in the WM. (a) (b)(c) (d) S T M - S i z e Figure 7. The size of all memories with respect to the minority and majorityclasses for both models DAM3 and SAM-kNN, on the Weather dataset.
I. C
ONCLUSION
In this paper, we proposed the Drift-Aware Multi-MemoryModel (DAM3), designed to mitigate the class imbalancein dual-memory models, dedicating short-term and long-termmemories to the current and former concepts, respectively.DAM3 mitigates the class imbalance by incorporating animbalance-sensitive drift detector, preserving a balanced rep-resentation of classes in the long-term memory, resolving theretroactive interference using a working memory preventingthe removal of old information, and weighting the classifiersinduced on different memories based on their balanced ac-curacy. Our experimental results showed that the proposedmethod outperforms the state-of-the-art methods in terms ofG-Mean and balanced accuracy. For future work, we intend todesign a multi-memory model that deals with recurring drifts.A
CKNOWLEDGMENT
The work of the first author was supported by the GermanResearch Foundation (DFG) within the project OSCAR (Opin-ion Stream Classification with Ensembles and Active leaRners)and HEPHAESTUS (Machine learning methods for adaptiveprocess planning of 5-axis milling), for both of which thesecond author is a principal investigator.R
EFERENCES[1] G. A. Carpenter and S. Grossberg, “Art 2: Self-organization of stablecategory recognition codes for analog input patterns,”
Applied optics ,vol. 26, no. 23, pp. 4919–4930, 1987.[2] M. Mermillod, A. Bugaiska, and P. Bonin, “The stability-plasticitydilemma: Investigating the continuum from catastrophic forgetting toage-limited learning effects,”
Frontiers in psychology , vol. 4, p. 504,2013.[3] B. Wang and J. Pineau, “Online bagging and boosting for imbalanceddata streams,”
IEEE Transactions on Knowledge and Data Engineering ,vol. 28, no. 12, pp. 3353–3366, 2016.[4] Y. Lu, Y. ming Cheung, and Y. Y. Tang, “Dynamic weighted majorityfor incremental learning of imbalanced data streams with concept drift,”in
Proceedings of the Twenty-Sixth International Joint Conference onArtificial Intelligence, IJCAI-17 , 2017, pp. 2393–2399.[5] H. Zhang, W. Liu, S. Wang, J. Shan, and Q. Liu, “Resample-based en-semble framework for drifting imbalanced data streams,”
IEEE Access ,vol. 7, pp. 65 103–65 115, 2019.[6] B. Krawczyk and M. Wozniak, “Weighted naive bayes classifier withforgetting for drifting data streams,” in . IEEE, 2015, pp. 2147–2152.[7] V. Losing, B. Hammer, and H. Wersing, “Knn classifier with selfadjusting memory for heterogeneous concept drift,” in . IEEE, 2016, pp.291–300.[8] H. Yu and G. I. Webb, “Adaptive online extreme learning machine byregulating forgetting factor by concept drift map,”
Neurocomputing , vol.343, pp. 141–153, 2019.[9] Z. Sosic-Vasic, K. Hille, J. Kröner, M. Spitzer, and J. Kornmeier, “Whenlearning disturbs memory–temporal profile of retroactive interference oflearning on memory formation,”
Frontiers in psychology , vol. 9, p. 82,2018.[10] A. Diamond, “Executive functions,”
Annual review of psychology ,vol. 64, pp. 135–168, 2013.[11] J. Gama, I. Žliobait˙e, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “Asurvey on concept drift adaptation,”
ACM computing surveys (CSUR) ,vol. 46, no. 4, pp. 1–37, 2014.[12] A. Orriols-Puig and E. Bernadó-Mansilla, “Evolutionary rule-basedsystems for imbalanced data sets,”
Soft Computing , vol. 13, no. 3, p.213, 2009.[13] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, andF. Herrera,
Learning from imbalanced data sets . Springer, 2018. [14] E. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, and H.-P. Kriegel,“Density-based projected clustering over high dimensional data streams,”in
SIAM SDM . SIAM, 2012, pp. 987–998.[15] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà, “Newensemble methods for evolving data streams,” in
Proceedings of the 15thACM SIGKDD , 2009, pp. 139–148.[16] N. C. Oza, “Online bagging and boosting,” in , vol. 3. Ieee, 2005, pp.2340–2345.[17] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for evolv-ing data streams,” in
Joint European conference on machine learningand knowledge discovery in databases . Springer, 2010, pp. 135–150.[18] J. F. Díez-Pastor, J. J. Rodríguez, C. I. García-Osorio, and L. I.Kuncheva, “Diversity techniques improve the performance of the bestimbalance learning ensembles,”
Information Sciences , vol. 325, pp. 98–117, 2015.[19] D. P. Melidis, M. Spiliopoulou, and E. Ntoutsi, “Learning under featuredrifts in textual streams,” in
ACM CIKM , 2018, pp. 527–536.[20] V. Unnikrishnan, C. Beyer, P. Matuszyk, U. Niemann, R. Pryss,W. Schlee, E. Ntoutsi, and M. Spiliopoulou, “Entity-level stream clas-sification: exploiting entity similarity to label the future observationsreferring to an entity,”
International Journal of Data Science andAnalytics , vol. 9, no. 1, pp. 1–15, 2020.[21] V. Iosifidis and E. Ntoutsi, “Fabboo-online fairness-aware learning underclass imbalance,” in
Discovery Science , 2020.[22] R. C. Atkinson and R. M. Shiffrin, “The control of short-term memory,”
Scientific American , vol. 225, no. 2, pp. 82–91, 1971.[23] A. D. Baddeley and G. Hitch, “Working memory,” in
Psychology oflearning and motivation . Elsevier, 1974, vol. 8, pp. 47–89.[24] M. R. Berthold, C. Borgelt, F. Höppner, and F. Klawonn,
Guide tointelligent data analysis: how to intelligently make sense of real data .Springer Science & Business Media, 2010.[25] Z. Qin, A. T. Wang, C. Zhang, and S. Zhang, “Cost-sensitive clas-sification with k-nearest neighbors,” in
International Conference onKnowledge Science, Engineering and Management . Springer, 2013,pp. 112–131.[26] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: a new over-sampling method in imbalanced data sets learning,” in
Internationalconference on intelligent computing . Springer, 2005, pp. 878–887.[27] J. Montiel, J. Read, A. Bifet, and T. Abdessalem, “Scikit-multiflow: Amulti-output streaming framework,”
The Journal of Machine LearningResearch , vol. 19, no. 1, pp. 2915–2914, 2018.[28] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive onlineanalysis,”
Journal of Machine Learning Research , vol. 11, no. May, pp.1601–1604, 2010.[29] W. N. Street and Y. Kim, “A streaming ensemble algorithm (SEA) forlarge-scale classification,” in
ACM SIGKDD , 2001, pp. 377–382.[30] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting datastreams using ensemble classifiers,” in
Proceedings of the ninth ACMSIGKDD , 2003, pp. 226–235.[31] R. Elwell and R. Polikar, “Incremental learning of concept drift innonstationary environments,”
IEEE Transactions on Neural Networks ,vol. 22, no. 10, pp. 1517–1531, 2011.[32] M. Harries and N. S. Wales, “Splice-2 comparative evaluation: Electric-ity pricing,” 1999.[33] J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes,“Using the adap learning algorithm to forecast the onset of diabetesmellitus,” in
Proceedings of the Annual Symposium on Computer Appli-cation in Medical Care . American Medical Informatics Association,1988, p. 261.[34] V. García, R. A. Mollineda, and J. S. Sánchez, “Index of balancedaccuracy: A performance measure for skewed class distributions,” in
Iberian conference on pattern recognition and image analysis . Springer,2009, pp. 441–448.[35] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating streamlearning algorithms,”
Machine learning , vol. 90, no. 3, pp. 317–346,2013.[36] A. Bifet, R. Gavaldà, G. Holmes, and B. Pfahringer,
Machine learningfor data streams: with practical examples in MOA . MIT Press, 2018.[37] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”