[PDF] An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams

Abstract

Existing FNNs are mostly developed under a shallow network configuration having lower generalization power than those of deep structures. This paper proposes a novel self-organizing deep FNN, namely DEVFNN. Fuzzy rules can be automatically extracted from data streams or removed if they play limited role during their lifespan. The structure of the network can be deepened on demand by stacking additional layers using a drift detection method which not only detects the covariate drift, variations of input space, but also accurately identifies the real drift, dynamic changes of both feature space and target space. DEVFNN is developed under the stacked generalization principle via the feature augmentation concept where a recently developed algorithm, namely gClass, drives the hidden layer. It is equipped by an automatic feature selection method which controls activation and deactivation of input attributes to induce varying subsets of input features. A deep network simplification procedure is put forward using the concept of hidden layer merging to prevent uncontrollable growth of dimensionality of input space due to the nature of feature augmentation approach in building a deep network structure. DEVFNN works in the sample-wise fashion and is compatible for data stream applications. The efficacy of DEVFNN has been thoroughly evaluated using seven datasets with non-stationary properties under the prequential test-then-train protocol. It has been compared with four popular continual learning algorithms and its shallow counterpart where DEVFNN demonstrates improvement of classification accuracy. Moreover, it is also shown that the concept drift detection method is an effective tool to control the depth of network structure while the hidden layer merging scenario is capable of simplifying the network complexity of a deep network with negligible compromise of generalization performance.

Full PDF

AA N I NCREMENTAL C ONSTRUCTION OF D EEP N EURO F UZZY S YSTEM FOR C ONTINUAL L EARNING OF N ON - STATIONARY D ATA S TREAMS ∗ A P

REPRINT

Mahardhika Pratama

School of Computer Science and EngineeringNanyang Technological UniversitySingapore [email protected]

Witold Pedrycz

Department of Electrical and Computer EngineeringUniversity of AlbertaCanada [email protected]

Geoffrey I. Webb

Information Technology FacultyMonash UniversityAustralia [email protected]

December 10, 2019 A BSTRACT

Existing fuzzy neural networks (FNNs) are mostly developed under a shallow network conﬁgurationhaving lower generalization power than those of deep structures. This paper proposes a novel self-organizing deep fuzzy neural network, namely deep evolving fuzzy neural networks (DEVFNN).Fuzzy rules can be automatically extracted from data streams or removed if they play limitedrole during their lifespan. The structure of the network can be deepened on demand by stackingadditional layers using a drift detection method which not only detects the covariate drift, variationsof input space, but also accurately identiﬁes the real drift, dynamic changes of both feature spaceand target space. DEVFNN is developed under the stacked generalization principle via the featureaugmentation concept where a recently developed algorithm, namely Generic Classiﬁer (gClass),drives the hidden layer. It is equipped by an automatic feature selection method which controlsactivation and deactivation of input attributes to induce varying subsets of input features. A deepnetwork simpliﬁcation procedure is put forward using the concept of hidden layer merging to preventuncontrollable growth of dimensionality of input space due to the nature of feature augmentationapproach in building a deep network structure. DEVFNN works in the sample-wise fashion and iscompatible for data stream applications. The efﬁcacy of DEVFNN has been thoroughly evaluatedusing seven datasets with non-stationary properties under the prequential test-then-train protocol. Ithas been compared with four popular continual learning algorithms and its shallow counterpart whereDEVFNN demonstrates improvement of classiﬁcation accuracy. Moreover, it is also shown that theconcept drift detection method is an effective tool to control the depth of network structure while thehidden layer merging scenario is capable of simplifying the network complexity of a deep networkwith negligible compromise of generalization performance. K eywords Deep Neural Networks · Data Streams · Online Learning · Fuzzy Neural Network ∗ This paper has been published in IEEE Transactions on Fuzzy Systems. a r X i v : . [ c s . A I] D ec PREPRINT - D

ECEMBER

10, 2019

Deep neural network (DNN) has gained tremendous success in many real-world problems because its deep networkstructure enables to learn complex feature representations [1]. Its structure is constructed by stacking multiple hiddenlayers or classiﬁers to produce a high level abstraction of input features which brings improvement of model’sgeneralization. There exists a certain point where introduction of extra hidden nodes or models in a wide conﬁgurationhas negligible effect toward enhancement of generalization power. The power of depth has been theoretically proven[2] with examples that there are simple functions in d -dimensional feature space that can be modelled with ease by asimple three-layer feedforward neural network but cannot be approximated by a two-layer feedforward neural networkup to a certain accuracy level unless the number of hidden nodes is exponential in the dimension. Despite of theseaforementioned working advantages, the success of DNN relies on its static network structure blindly selected bytrial-error approaches or search methods [3]. Although a very deep network architecture is capable of deliveringsatisfactory predictive accuracy, this approach incurs excessive computational cost. An over-complex DNN is alsoprone to the so-called vanishing gradients [4] and diminishing feature reuse [5]. In addition, it calls for a high numberof training samples to ensure convergence of all network parameters.To address the issue of a ﬁxed and static model, model selection of DNN has attracted growing research interest. It aimsto develop DNN with elastic structure featuring stochastic depth which can be adjusted to suit the complexity of givenproblems [6]. In addition, a ﬂexible structure paradigm also eases gradient computation and addresses the well-knownissues: diminishing feature reuse and vanishing gradient. This approach starts from an over-complex network structurefollowed by complexity reduction scenario via dropout, bypass, highway [5], hedging [7], merging [8], regularizer[9], etc. Another approach utilizes the idea of knowledge transfer or distillation [10]. That is, the training process iscarried out using a very deep network structure while deploying a shallow network structure during the testing phase.Nonetheless, this approach is not scalable for data stream applications because most of which are built upon iterativetraining process. Notwithstanding that [7] characterizes an online working principle, it starts its training process witha very deep network architecture and utilizes the hedging concept which opens direct link between hidden layer andoutput layer. The ﬁnal output is determined from aggregation of each layer output. This strategy has strong relationshipto the weighted voting concept.The concept of DNN is introduced into fuzzy system in [11, 12, 13] making use of the stacked generalization principle[14]. Two different architectures, namely random shift [11] and feature augmentation [12, 13], are adopted to createdeep neuro fuzzy structure. It is also claimed that fuzzy rule interpretability is not compromised under these twostructures because intermediate features still have the same physical meaning as original input attributes. Similar workis done in [15] but the difference exists in the parameter learning aspect adopting the online learning procedure insteadof the batch learning module. These works, however, rely on a ﬁxed and static network conﬁguration which calls forprior domain knowledge to determine its network architecture. In [16], a deep fuzzy rule-based system is proposed forimage classiﬁcation. It is built upon a four-layered network structure where the ﬁrst three layer consists of normalizationlayer, scaling layer and feature descriptor layer while the ﬁnal layer is a fuzzy rule layer. Unlike [11, 12, 13, 7], thisalgorithm is capable of self-organizing its fuzzy rules in the fuzzy rule layer but the network structure still has a ﬁxeddepth (4 layer). Although the area of deep learning has grown at a high pace, the issue of data stream processing orcontinual learning remains an open issue in the existing deep learning literature.A novel incremental DNN, namely Deep Evolving Fuzzy Neural Network (DEVFNN), is proposed in this paper.DEVFNN features a fully elastic structure where not only its fuzzy rule can be autonomously evolved but also thedepth of network structure can be adapted in the fully automatic manner [17]. This property is capable of handlingdynamic variations of data streams but also delivering continuous improvement of predictive performance. The deepstructure of DEVFNN is built upon the stacked generalization principle via the augmented feature space where eachlayer consists of a local learner and is inter-connected through augmentation of feature space [12, 13]. That is, theoutput of previous layer is fed as new input information to the next layer. A meta-cognitive Scaffolding learner, namelyGeneric Classiﬁer (gCLass), is deployed as a local learner, the main driving force of hidden layer, because it notonly has an open structure and works in the single-pass fashion but also answers two key issues: what-to-learn andwhen-to-learn [18]. The what-to-learn module is driven by an active learning scenario which estimates contributionof data points and selects important samples for model updates while the when-to-learn module controls when therule premise of multivariate Gaussian rule is updated. gClass is structured under a generalized Takagi Sugeno Kang(TSK) fuzzy system which incorporates the multivariate Gaussian function as the rule premise and the concept offunctional link neural network (FLANN) [19] as the rule consequent. The multivariate Gaussian function generates nonaxis-parallel ellipsoidal clusters while the FLANN expands the degree of freedom (DoF) of the rule consequent via theup to second order Chebyshev series rectifying the mapping capability[20].The major contribution of this paper is elaborated as follows:2 PREPRINT - D

ECEMBER

10, 2019 • Elastic Deep Neural Network Structure : DEVFNN is structured by a deep stacked network architectureinspired by the stacked generalization principle. Unlike the original stacked generalization principle havingtwo layers, DEVFNN’s network structure is capable of being very deep with the use of feature augmentationconcept. This approach adopts the stacked deep fuzzy neural network concept in [12] where the featurespace of the bottom layer to the top one is growing incorporating the outputs of previous layers as extrainput information. DEVFNN differs itself from [12, 13, 15] where it characterizes a fully ﬂexible networkstructure targeted to address the requirement of continual learning [1]. This property is capable of expandingthe depth of the DNN whenever the drift is identiﬁed to adapt to rapidly changing environments. The useof drift detection method for the layer growing mechanism generates different concepts across each layersupposed to induce continuous reﬁnement of generalization power. The elastic characteristic of DEVFNNis borne out with the introduction of a hidden layer merging mechanism as a deep structure simpliﬁcationapproach which shrinks the depth of network structure on the ﬂy. This mechanism focuses on redundant layershaving high mutual information to be coalesced with minor cost of predictive accuracy. • Dynamic Feature Space Paradigm : an online feature selection scenario is integrated in the DEVFNN learningprocedure and enables the use of different input combinations for each sample. This scenario enables ﬂexibleactivation and deactivation of input attributes across different layers which prevents exponential increase ofinput dimension due to the main drawback of feature augmentation approach. As with [21, 22], the featureselection process is carried out with binary weights (0 or 1) determined from the relevance of input features tothe target concept. Such feature selection method opens likelihood of previously deactivated features to beactive again whenever its relevance is substantiated with current data trend. Moreover, the dynamic featurespace paradigm is realized using the concept of hidden layer merging method which functions as a complexityreduction approach. The use of hidden layer merging approach has minor compression loss because one layercan be completely represented by another hidden layer. • An Evolving Base Building Unit : DEVFNN is constructed from a collection of Generic Classiﬁer (gClass) [18]hierarchically connected in tandem. gClass functions as the underlying component of DEVFNN and operatesin every layer of DEVFNN. gClass features an evolving and adaptive trait where its structural constructionprocess is fully automated. That is, its fuzzy rules can be automatically generated and pruned on the ﬂy.This property handles local drift better than a non-evolving base building unit because its network structurecan expand on the ﬂy. The prominent trait of gClass lies in the use of online active learning scenario as thewhat-to-learn part of metacognitive learner which supports reduction of training samples and labeling cost.This strategy differs from [23] since the sample selection process is undertaken in a decentralized manner andnot in the main training process of DEVFNN. Moreover, the how-to-learn scenario is designed in accordancewith the three learning pillars of Scaffolding theory [24]: fading, problematizing and complexity reductionwhich processes data streams more efﬁciently than conventional self-evolving FNNs due to additional learningmodules: local forgetting mechanism, rule pruning and recall mechanism, etc. • Real Drift Detection Approach : a concept drift detection method is adopted to control the depth of networkstructure. This idea is conﬁrmed by the fact that the hidden layer should produce intermediate representationwhich reveals hidden structure of data samples through multiple linear mapping. Based on the recent study[25], it is outlined from the hyperplane perspective that the number of response region has a direct correlationto model (cid:48) s generalization power and DNN is more expressive than a shallow network simply because it hasmuch higher number of response region. In realm of deep stacked network, we interpret response region asthe amount of unique information a base building unit carries. In other words, the drift detection methodpaves a way to come up with a diverse collection of hidden layers or building units. Our drift detectionmethod is a derivation of the Hoeffding bound based drift detection method in [26] but differs from the factthat the accuracy matrix which corresponds to prequential error is used in lieu of sample statistics [27]. Thismodiﬁcation targets detection of real drift which moves the shape of decision boundary. Another salientaspect of DEVFNN (cid:48) s drift detector exists in the conﬁdence level of Hoeffding (cid:48) s bound which takes intoaccount sample (cid:48) s availability via an exponentially decreasing conﬁdence level. The exponentially decreasingconﬁdence parameter links the depth of network structure and sample’s availability. This strategy reﬂects thefact that the depth of DNN should be adjusted to the number of training samples as well-known from the deeplearning literature. A shallow network is generally preferred for small datasets because it ensures fast modelconvergence whereas a deep structure is well-suited for large datasets. • Adaptation of Voting Weight : the dynamic weighting scenario is implemented in the adaptive voting mechanismof DEVFNN and generates unique decaying factors of every hidden layer. Our innovation in DEVFNN lies inthe use of dynamic penalty and reward factor which enables the voting weight to rise and decline with differentrates. It is inspired by the fact that the voting weight of a relevant building unit should decrease slowly whenmaking misclassiﬁcation whereas that of a poor building unit should increase slowly when returning correct3

PREPRINT - D

ECEMBER

10, 2019prediction. This scenario improves a static decreasing factor as actualized in pENsemble [28], pENsemble+[23], DWM [29] which imposes too violent ﬂuctuations of voting weights.The novelty of this paper is summed up in ﬁve facets: 1) contributes methodology in building the structure of DNNswhich to the best of our knowledge remains a challenging and open issue; 2) offers a DNN variant which can be appliedfor data stream processing; 3) puts forward online complexity reduction mechanism of DNNs based on the hiddenlayer merging strategy and the online feature selection method; 4) introduces a dynamic weighting strategy with thedynamic decaying factor. The efﬁcacy of DEVFNN has been numerically validated using seven synthetic and real-worlddatasets possessing non-stationary characteristics under the prequential test-then-train approach - a standard evaluationprocedure of data stream algorithms. DEVFNN has been also compared by its static version and other online learningalgorithms where DEVFNN delivers more encouraging performance in term of accuracy and sample consumption thanits counterparts while imposing comparable computational and memory burden. This paper is organized as follows:Section 2 outlines the network architecture of DEVFNN and its hidden layer, gClass; Section 3 elaborates the learningpolicy of DEVFNN encompassing the hidden layer growing strategy, the hidden layer merging strategy and the onlinefeature selection strategy; Section 4 offers brief summary of gClass learning policy; Section 5 describes numerical studyand comparison of DEVFNN; some concluding remarks are drawn in the last section of this paper.

DEVFNN is deployed in the continual learning environment to handle data streams C t = [ C , C , ..., C T ] whichcontinuously arrives in a form of data batch in the T time stamps. In practise, the number of time stamps is unknownand is possible to be inﬁnite. The size of data batch is C t = [ X , X , ..., X P ] ∈ (cid:60) P × n and is set to be equal-sized inevery time stamp. Note that the size of data batches may vary as well across different time stamps. n stands for thenumber of input attributes. In realm of online learning environments, it is impractical to assume the direct access tothe true class label vector Y t = [ y , y , ..., y P ] ∈ (cid:60) P × m and labelling process often draws some costs subject to theexistence of ground truth or expert knowledge. m here stands for the number of target classes. This fact conﬁrms thesuitability of prequential test-then-train scenario as evaluation procedure of the continual learning. The − encodingscenario is applied to form a multi-output target matrix. For instance, if a data samples lies in a class and there existsin total three classes, the − encoding scheme produces a target vector [1 , , , while a class 2 leads to [0 , , . Theissue of labelling cost is tackled with the use of metacognitive learner as a driving force of hidden layer having theonline active learning scenario as a part of the what-to-learn phase.Data stream is inherent to the problem of concept drift where data batches do not follow static and predictable datadistributions P ( Y | X ) t (cid:54) = P ( Y | X ) t − well-known to be in two typical types: real and covariate. The real drift isgenerally more challenging to handle than the covariate drift because variations of input data induce the shape ofdecision boundary usually resulting in classiﬁcation error. This problem cannot be solved by inspecting the statistic ofinput data as proposed in [26]. One approach to capture the presence of concept drift is by constructing accuracy metricwhich summarizes predictive performance of a classiﬁer. A decreasing trend of classiﬁer (cid:48) s accuracy is a strong indicatorof concept drift [27]. This issue is resolved here by the use of accuracy matrix which advances the idea of accuracymatrix in [27] with an adaptive windowing scheme based on the cutting point concept. The presence of concept driftalso calls for innovation in the construction of deep network structure because every layer is supposed to representdifferent concepts rather than different levels of data abstraction. DEVFNN handles this issue with direct connectionof hidden layer to output layer under the deep stacked network structure with the feature augmentation approach inwhich it allows every layer to contribute toward the ﬁnal classiﬁcation decision. Speciﬁcally, the weighting votingscheme is implemented where the voting weight is derived from the reward and punishment approach with dynamicdecaying rates. The dynamic decaying rate sets unique reward and penalty factors giving added ﬂexibility accordingto relevance of every hidden layer. On the other hand, the feature augmentation method in building deep networkstructure risks on the curse of dimensionality due to the expansion of input dimension as the network depth. Thenetwork simpliﬁcation procedure is realized with the hidden layer merging procedure and the online feature selectionscenario. Fig. 1 delineates the working principle of DEVFNN. This section elaborates on the network architecture of DEVFNN including both the deep network structure and thehidden layer structure. 4

PREPRINT - D

ECEMBER

10, 2019Figure 1: Learning Architecture of DEVFNN

The hidden layer of DEVFNN is constructed by Generic Classiﬁer (gClass) [18] working cooperatively in tandem toproduce the ﬁnal output. gClass realizes the generalized TSK fuzzy inference system procedure where the multivariateGaussian function with non diagonal covariance matrix is used in the rule layer while the consequent layer adopts theconcept of expansion block of the FLANN via a nonlinear mapping of the up-to second order Chebyshev series [20].The multivariate Gaussian rule generates better input space partition than those classical rules using the dot productt-norm operator because it enables rotation of ellipsoidal cluster [30]. This trait allows a reduction of fuzzy rule demandespecially when data samples are not spanned in the main axes. The use of FLANN idea in the consequent layer [19]aims to improve the approximation power because it has a higher DoF than the ﬁrst order TSK rule consequent. Severalnonlinear mapping functions such as polynomial power function, trigonometric function, etc. can be applied as theexpansion block but the up-to second order Chebyshev function is utilized here because it incurs less free parametersthan trigonometric function but exhibits better approximation power than other polynomial functions of the same order[31].The fuzzy rule of gClass is formally expressed as follows: (cid:60) i : IF X k is N i ( X k ; C i , A − i ) Then ˜ y i = Φ W i = W ,i T + W ,i T ( x ) + W ,i T ( x ) + ... + W n − ,i T ( x n ) + W n,i T ( x n ) where X k ∈ (cid:60) × n denotes the input vector at the k − th time instant and N i ( X k ; C i , A − i ) stands for the multivariateGaussian function which corresponds to the rule premise of i − th fuzzy rule. ˜ y i labels the rule consequent of the i − th rule constructed by the up to second order Chebyshev polynomial expansion where T ( . ) , T ( . ) , T ( . ) refer to the zero,ﬁrst and second order Chebyshev polynomial expansion respectively. The multivariate Gaussian function generates theﬁring strength of i − th fuzzy rule which reveals its degree of compatibility as follows: φ i = exp ( X − C i ) A − i ( X − C i ); (1)where C i ∈ (cid:60) × n denotes the center of multivariate Gaussian function while A − i ∈ (cid:60) n × n stands for the non-diagonalinverse covariance matrix while R, n are respectively the number of fuzzy rules and input features. The non-diagonalcovariance matrix makes possible to describe inter-relations among input variables which vanishes in the case ofdiagonal covariance matrix and steers the orientation of ellipsoidal cluster. On the other hand, the rule consequent ofgClass is built upon the non-linear mapping of the up-to second order Chebyshev series deﬁned as ˜ y i = Φ W i where Φ ∈ (cid:60) × (2 n +1) is the output of functional expansion block while W i ∈ (cid:60) (2 n +1) × is the weight vector of the i − th rule. Φ is produced by the up-to second order Chebyshev function expressed as follows: T n +1 ( x j ) = 2 x j T n ( x j ) − T n − ( x j ) (2)Since the Chebyshev series is expanded up to the second order, T ( x j ) = 1 , T ( x j ) = x j , T ( x j ) = 2 x j − . Supposethat a predictive task is navigated by two input attributes, the functional expansion block Φ is crafted as follows: Φ = [ T ( x j ) , T ( x ) , T ( x j ) , T ( x ) , T ( x )] (3) Φ = [1 , x , x − , x , x − (4)The concept of FLANN in the framework of TSK fuzzy system improves approximation capability of zero or ﬁrst orderTSK rule consequent because it enhances the degree of freedom of rule consequent, although it does draw around extra5 PREPRINT - D

ECEMBER

10, 2019Figure 2: Deep Stacked Network Structure with Feature Augmentation n network parameters. Under the MIMO structure, the rule consequent incurs (2 n + 1) × R × m parameters while therule premise of gClass imposes ( n × n ) × R + ( n × R ) .The output of gClass is inferred using the weighted average operation of the rule ﬁring strength and the output offunctional expansion block as follows y o = (cid:80) Ri φ i ˜ y i (cid:80) Ri φ i = (cid:80) Ri exp ( X − C i ) A − i ( X − C i ) T ˜ y i (cid:80) Ri exp ( X − C i ) A − i ( X − C i ) T (5)gClass is formed in the MIMO architecture possessing the class-speciﬁc rule consequent [32, 33] addressing the issue ofclass overlapping better than the popular one-versus-rest architecture because each class is handled separately. BecausegClass generates multiple output, the ﬁnal classiﬁcation decision is obtained by applying the maximum operation toﬁnd the most conﬁdent prediction as follows: y = max ≤ o ≤ m y o (6)where m is the number of target classes. The MIMO structure is also more stable than the direct regression approachhampered by the class shifting issue [32, 33]. This problem is caused by the smooth transition of the TSK fuzzy systemwhich cannot cope with a dramatic class change. DEVFNN is realized in the deep stacked network structure where each hidden layer is connected in series. Hidden layersexcept the ﬁrst layer accept predictive outputs of previous layers as input attributes plus original input patterns [12].This trait retains the physical meaning of original input variables which often loses in conventional DNN architecturesdue to multiple nonlinear mappings. This aspect also implies three exists a growth of input space across each layer. Thatis, the deeper the hidden layer position the higher the input dimension is, since it is supposed to receive all predictiveoutputs of preceding layers as extra input features. This structure is inspired by [12, 13] where the concept of featureaugmentation is introduced in building a DNN architecture. Our approach extends this approach with the introductionof structural learning process which fully automates the construction of deep stacked network via online analysis ofdynamic data streams. Moreover, this approach is equipped by the online feature selection and hidden layer mergingscenarios which dynamically compress the input space of each layer. This mechanism replaces the random featureselection of the original work in [12, 13]. Our approach also operates in the one-pass learning scenario being compatiblewith online learning setting whereas [12, 13] still applies an ofﬂine working scenario. The network architecture ofDEVFNN is shown in Fig. 2.DEVFNN works in the chunk-by-chunk basis where each chunk is discarded once learned and the chunk size mayvary in different time stamps. Suppose that a data chunk with the size of

P C t = ( X k , Y k ) k =1 ,...,P is received at the6 PREPRINT - D

ECEMBER

10, 2019 k − th time stamp while DEVFNN is composed of D hidden layers built upon gClass, the ﬁrst hidden layer receivesthe original input representation with no feature augmentation while the − nd hidden layer is presented with thefollowing input attributes: (cid:98) X k = [ X k , (cid:98) Y k ] (7)where (cid:98) X k ∈ (cid:60) P × ( n + m ) consists of the input feature vector and the predictive output of the ﬁrst layer. Additional m input attributes are incurred because gClass is structured under the MIMO architecture deploying a class-speciﬁcrule consequent. For d − th hidden layer, its input features are expressed as (cid:98) X dk = [ X k , (cid:98) Y k , ..., (cid:98) Y d − k ] k =1 ,...,P ∈(cid:60) P × ( n +( m × ( d − . This situation also implies that dimensionality of input space linearly increases as the depth ofnetwork structure. This issue is coped with the self-organizing property of DEVFNN featuring a controlled growth ofnetwork depth by means of the drift detection method which assures the right network complexity of a given problem.In addition, the hidden layer merging and online feature selection mechanisms further contribute to compression ofinput dimension to handle possible curse of dimensionality.The deep stacked network through feature augmentation is consistent to the stacked generalization principle [14] whichis capable of achieving an improved generalization performance since the manifold structure of the original featurespace is constantly opened. In addition, each hidden layer functions as a discriminative layer producing predicted classlabel. The predicted class label delivers an effective avenue to open the manifold structure of the original input space tothe next hidden layer supposed to improve generalization power.The end output of DEVFNN is produced by the weighted voting scheme where the voting weight of hidden layerincreases and decreases dynamically in respect to their predictive accuracy. This strategy reﬂects the fact that eachhidden layer is constructed with different concepts navigated by the drift detection scenario. That is, a hidden layer isrewarded when returning correct prediction by enhancing its voting weight while being punished by lowering its votingweight when incurring misclassiﬁcation. It reduces the inﬂuence of irrelevant classiﬁer where it is supposed to carry lowvoting power due to the penalty as a result of misclassiﬁcation [28]. It is worth noting that the predictive performanceof hidden layers is examined against the prequential error when true class labels are gathered. The prequential test-then-train scenario is simulated in our numerical study. It enables creation of real data stream environment where datastreams are predicted ﬁrst and used for model updates once the true class label becomes available. This scenario is moreplausible to capture real data stream environments in practise in which true class labels often cannot be immediately fedby the operator. Furthermore, DEVFNN is composed of independent building blocks interconnected with the featureaugmentation approach meaning that each hidden layer is trained independently with the same cost function and outputsthe same target variables. The ﬁnal output of DEVFNN is formalized as follows: Class final = max ≤ o ≤ m σ o (8)where σ o denotes the accumulated weight of a target class resulted from predicted class label of hidden layers σ o = χ classd + σ o . χ od stands for the voting weight of d − th hidden layer predicting o − th class label. This procedureaims to extract the most relevant subsets of hidden layer to the C t data batch. That is, each hidden layer is set to carrydifferent concepts and is triggered when the closest concept comes into the picture. In addition, the tuning phase isrestricted to the win hidden layer in order to induce a stable concept for every hidden layer. That is, irrelevant hiddenlayers are frozen from the tuning phase when they represent different concepts. The win hidden layer can be determinedfrom its voting weight where a high voting weight indicates strong relevance of a hidden layer to the current concept. This section elaborates the learning policy of DEVFNN in the deep structure level. An overview of DEVFNN learningprocedure is outlined in Algorithm 1. DEVFNN initiates its learning process with the adaptation of voting weightwith dynamic decaying factors. The tuning of voting weight results in a unique voting weight of each hidden layerwith respect to their relevance to the current concept measured from their prequential error. This method can be alsointerpreted as a soft structural simpliﬁcation method since it is capable of ruling out hidden layers in the training processby suppressing their voting weights to a small value. The training procedure continues with the online feature selectionprocedure which aims to cope with the curse of dimensionality due to the growing feature space bottleneck of the deepstacked network structure. The feature selection is performed based on the idea of feature relevance examining thesensitivity of input variables in respect to the target classes. The hidden layer merging mechanism is put forward tocapture redundant hidden layers and reduces the depth of network structure by coalescing highly correlated layers.Furthermore, the drift detection mechanism is applied to deepen the deep stacked network if changing system dynamicsare observed in data streams. The drift detection module is designed using the concept of Hoeffding bound [26, 27]which determines a conﬂict level to ﬂag a drift. There exist three conditions resulted from the drift detector: stable,warning and drift. The drift phase leads to addition of a hidden layer, while the warning phase accumulates data samples7

PREPRINT - D

ECEMBER

10, 2019into a buffer later used to create a new hidden layer if a drift is detected. The stable phase simply updates the winninghidden layer determined from its voting weight. In what follows, we elaborate on the working principle of DEVFNN:

The voting weight is dynamically adjusted by a unique decaying factor for each layer ρ d ∈ [0 , which plays a majorrole to adapt to the concept drift. A low value of ρ d leads to slow adaptation to rapidly changing conditions but worksvery well in the presence of gradual or incremental drift where the drift is not too obvious initially. A high value of ρ d adapts to sudden drift more timely than the low value but compromises the stability in the gradual drift where datasamples are drawn from the mixture between two distributions during the transition period. Considering this issue, ρ d isnot kept ﬁxed rather it continuously adjusted to reﬂect the learning performance of a hidden layer. The decaying factor ρ d is tuned using a step size, π , as follows: ρ d = ρ d ± π ; (9)This mechanism underpins a ﬂexible voting scenario where the decaying factor mirrors hidden layer compatibilityto existing data streams. In other words, the voting weight of hidden layers augments and diminishes with differentintensities. It is also evident that the voting weight of a strong hidden layer should be conﬁrmed whereas the inﬂuenceof poor building units should be minimized in the voting phase. That is, ρ d = ρ d + π occurs if a hidden layer returns acorrect prequential prediction whereas ρ d = ρ d − π takes place if a sample leads to a prequential error. The step size π sets the rate of change where the higher the value increases its sensitivity to the hidden layer performance.The penalty and reward scenario is undertaken by increasing and decreasing the voting weight of a hidden layer. Apenalty is imposed if a wrong prediction is produced by hidden layer as follows: χ d = χ d ∗ ρ d ; (10)On the other hand, the reward scenario is performed if correct prediction is returned as follows: χ d = min ( χ d (1 + ρ d ) , (11)the reward scenario also functions to handle the cyclic drift because it reactivates inactive hidden layer with a minorvoting weight. Because it is also observed from (11) and the range of the decaying factor ρ d is [0 , , the votingweight is bounded in the range of [0 , . Compared to similar approaches in [28, 23], this penalty and reward scenarioemphasizes rewards to a strong hidden layer while discouraging a penalty to such layers whereas a poor hidden layershould be penalized with a high intensity while receiving little reward when returning correct prediction. This strategyis done since every hidden layer has direct connection to output layer outputting its own predicted class label. Theﬁnal predicted class label should take into account relevance of each hidden layer reﬂected from its prequential error.Furthermore, this mechanism also aligns with the use of drift detection method as an avenue to deepen the depth ofnetwork structure since the drift detection method evolves different concepts of each hidden layer. DEVFNN implements the feature weighting strategy based on the concept of feature relevance. That is, a relevantfeature is deﬁned as that showing high sensitivity to the target concept whereas low sensitivity feature signiﬁes poorinput attributes which should be assigned a low weight to reduce its inﬂuence to the ﬁnal predictive outcomes [34].There exist several avenues to check the sensitivity of input attributes in respect to the target concept: Fisher SeparabilityCriterion [35], statistical contribution [30], etc. The correlation measure is considered as the most plausible strategyhere because it reveals the mutual information of input and target features [22] which signals the presence of changingsystem dynamics. The correlation between two variables, x , x , can be estimated using the Pearson correlation index(PCI) as follows: ζ ( x , x ) = cov ( x , x ) (cid:112) var ( x , x ) (12)where cov ( x , x ) , var ( x , x ) respectively stand for covariance and variance of x , x which can be calculatedin recursive manners. Although the PCI can be directly integrated into the feature weighting scope without anytransformation because the highest correlation is attained when it returns either − or , the PCI method is sensitive tothe rotation and translation of data samples [36]. The mutual information compression index (MICI) method [36] isapplied to achieve trade-off between accuracy of correlation measure and computational simplicity. The MICI methodworks by estimating the amount of information loss if one of the two variables is discarded. It is expressed as follows: γ ( x , x ) = 12 ( vr + vr − (cid:113) vr + vr − vr vr (1 − ζ ( x , x ) )) (13)8 PREPRINT - D

ECEMBER

10, 2019where ζ ( x , x ) denotes the Pearson correlation index of two variables and vr , vr represent var ( x ) , var ( x ) ,respectively. Unlike the PCI method where -1 and 1 signify the maximum correlation, the maximum similarity isattained at γ ( x , x ) = 0 . This method is also insensitive to rotation and translation of data points[36]. Once thecorrelation of the j − th input variable and all m target classes are calculated, the score of the j − th input feature isdeﬁned by its relevance to all m − th target classes. This aspect is realized by taking average correlation across m target classes as follows: Score j = mean o =1 ,..,m γ ( x j , y o ) (14)where γ ( x j , y o ) denotes the maximum information compression index between the j − th input feature and the o − th target class. The use of average operator is to assign equal importance of each target class and to embrace the fact thatan input feature is highly needed to only identify one target class. This strategy adapts to the characteristic of gClassformed in the MIMO architecture [33].The feature selection mechanism is carried out by assigning binary weights λ j , either 0 or 1, which changes on demandin every time stamp. This strategy is applied to induce ﬂexible activation and deactivation of input variables duringthe whole course of training process which avoids loss of information due to complete forgetting of particular inputinformation. An input feature is switched off by assigning 0 weight if its score falls above a given threshold δ meaningthat it shows low relationship to any target classes as follows: Score j > = δ (15)where δ ∈ [0 , stands for the predeﬁned threshold. The higher the value of this threshold the less aggressive thehidden layer merging is performed and vice versa. The feature selection process is carried out by setting the inputweight to zero with likelihood being set to one again in the future whenever the input attribute becomes relevant again.Moreover, the input weighting strategy is committed in the centralized manner where the similarity of all input attributesare analyzed at once. That is, all input attributes are put together. The online feature weighting scenario only analysesthe n dimensional original feature space rather than the n + m × ( D − dimensional augmented feature space. Extrainput feature is subject to the hidden layer merging mechanism which studies the redundancy level of D hidden layers. DEVFNN implements the hidden layer merging mechanism to cope with the redundancy across different base buildingunits. This scenario is realized by analyzing the correlation of the outputs of different layers [23]. From manifoldlearning viewpoint, a redundant layer containing similar concept is expected not to inform salient structure of the givenproblem because it does not open manifold of learning problem to unique representation - at least already coveredby previous layers. Suppose that the MICI method is applied to explore the correlation of two hidden layers and γ ( y i , y j ) , i (cid:54) = j , the hidden layer merging condition is formulated as follows: γ ( y i , y j ) < δ (16)where δ ∈ [0 , is a user-deﬁned threshold. This threshold is linearly proportional to the maximum correlation indexwhere the lower the value the less merging process is undertaken. The merging procedure is carried out by setting itsvoting weight as zero, thus it is considered as a "don (cid:48) t care" input attribute of the next layers. This strategy expeditesthe model updates because redundant layers can be bypassed without being revisited for both inference and trainingprocedures. The hard pruning mechanism is not implemented in the merging process because it causes a reduction ofinput dimension which undermines the stability of next layers unless a retraining mechanism from scratch is carried out.It causes dimensional reduction of the output covariance matrix of next layers. This strategy is deemed similar to thedropout scenario [37] in the deep learning literature but the weight setting is analyzed from the similarity analysis ratherthan purely probabilistic approach. The dominance of two hidden layers is simply determined from its voting weight.The voting weight is deemed a proper indicator of hidden layer dominance because it is derived from a dynamic penaltyand reward scenario with unique and adaptive decaying factors. The redundancy-based approach such as mergingscenario is more stable than relevance-based approach because information of one layer can be perfectly represented byanother layer. In addition, parameters of two hidden layers are not fused because similarity of two hidden layers isobserved in the output level rather than fuzzy rule level. DEVFNN realizes an evolving deep stacked network structure which is capable of introducing new hidden layerson demand. This strategy aims to embrace changing training patterns of data streams and to enhance generalizationperformance by increasing the level of abstraction of training data. This is done by utilizing a drift detection modulewhich vets the status of data streams whether the concept change is present. Notwithstanding that the idea of addinga new component to handle concept drift is well-established in the literature as presented in the ensemble learningliterature, each ensemble member has no interaction at all to their neighbors.9

PREPRINT - D

ECEMBER

10, 2019The drift detection scenario makes use of the Hoeffding (cid:48) s bound drift detection mechanism proposed in [26] wherethe evaluation window is determined from the switching point rather than a ﬁxed window size. Nevertheless, theoriginal method is developed based on the increase of population mean which ignores the change of data distribution inthe target space. In other words, it is only capable of detecting covariate drift. Instead of looking at the increase ofpopulation mean, the error index is applied. This modiﬁcation is based on the fact that the change of feature space doesnecessarily induce the concept drift in the target domain P ( Y | X ) k (cid:54) = P ( Y | X ) k − . This concept is implemented byconstructing a binary accuracy vector where elements presents correct prediction while elements are inserted forfalse predictions. This scenario is inspired by the fast Hoeffding drift detection method (FDDM) [27] but our approachdiffers from [26] in which the window size is set fully adaptive according to the switching point. Moreover, each datasample is treated with equal importance with the absence of any weights which detects sudden drift rapidly althoughit is rather inaccurate to pick up gradual drift [26]. The advantage of the Hoeffding (cid:48) s method is free of normal datadistribution assumption - too restrictive in many applications. Moreover, it is statistically sound because a Hoeffdingbound corresponds to a particular conﬁdence level.Assuming that P denotes the chunk size, the data chunk is partitioned into three groups F ∈ (cid:60) P × ( n + m ) , G ∈(cid:60) cut × ( m + n ) , H ∈ (cid:60) ( P − cut ) × ( m + n ) where cut is the switching point. F, G, H records the error index instead of theoriginal data points in which only two values, namely or , is present - for true prediction, for false prediction.Note that cut is elicited by evaluating data samples - the ﬁrst sample up to the switching sample. Each data partition F, G, H is assigned with the error bounds (cid:15) F , (cid:15) G , (cid:15) H calculated as follows: (cid:15) F,G,H = ( b − a ) (cid:115) size size ∗ cut ) ln( 1 α ) (17)where size denotes the size of data partition and α labels the signiﬁcance level of Hoeffding’s bound. a, b denote themaximum and minimum values in the data partition. The signiﬁcance level has a clear statistical interpretation becauseit corresponds to the conﬁdence level of Hoeffding’s bound − α . In realm of DNN, the model (cid:48) s complexity mustconsider the availability of training samples to ensure parameter convergence especially in the context of continuallearning situation where a retraining process over a number of epochs is prohibited. A shallow model is generallypreferred over a deep model to handle a small dataset. Considering the aspect of sample (cid:48) s availability, a dynamicsigniﬁcance level is put forward where the signiﬁcance level exponentially rises as the number of time stamps with alimit. A limit is required here to avoid loss of detection accuracy because the signiﬁcance level is inversely proportionalto the conﬁdence level: α D = min (1 − e − kT , α Dmin ) , α W = min (1 − e − kT , α Wmin ) (18)where k, T respectively denote the number of time stamps seen thus far and the total number of time stamps while α Dmin , α

Wmin stand for the minimum signiﬁcance level of the drift and warning phases. The minimum signiﬁcance levelhas to be capped at 0.1 to induce above 90% conﬁdence level.Once the conﬁdence level is calculated, the next step is to ﬁnd the switching point, cut , indicating the horizon of thedrift detection problem or the time window in which a drift is likely to be present. The switching point is found if thefollowing condition is met as follows: ˆ F + (cid:15) F ≤ ˆ G + (cid:15) G (19)where ˆ F , ˆ G, ˆ H denote the statistics of the three data partitions. It is observed that the switching point targets a transitionpoint between two concepts where the statistic of G is larger than H . The switching point portrays a time index where adrift starts to come into picture. It is worth noting that the statistics of the data partition ˆ G is expected to be constant orto reduce during the stable phase. The drift phase, therefore, refers to the opposite case where the empirical mean ofthe accuracy vector ˆ G increases. The condition (19) aims to ﬁnd the cutting point cut where the accuracy vector is nolonger in the decreasing trend. Once cut is located, the three error index vectors, F, G, H , can be formed.Our drift detector returns two conditions: warning and drift tailored from the two signiﬁcance levels α warning , α drift .The two signiﬁcance levels α warning , α drift can be set to the conﬁdence level of the drift detection. The smaller thevalue of the signiﬁcance level implies the more accurate the drift detection method delivers. The warning and driftconditions are signaled if two following situations are come across as follows: | ˆ H − ˆ G | > α drift (20) | ˆ H − ˆ G | > α warning (21)These two conditions present a case where the null hypothesis H : E [ G ] ≤ E [ H ] is rejected. The warning condition ismeant to capture the gradual drift. It pinpoints a situation where a drift is not obvious enough to be declared or nextinstances are called for to conﬁrm a drift. No action is taken during the warning phase, only data samples of the warning10 PREPRINT - D

ECEMBER

10, 2019condition are accumulated in the data matrix φ = [ X warning , Y warning ] . Once the drift condition is satisﬁed, thestructure of deep network is deepened by appending a new hidden layer stacked at the top level. The new hidden layeris created using the incoming data chunk and the accumulated data samples in the warning phase Φ = [ X new , Y new ; φ ] .The opposite case or the normal case portrays the alternative H : E [ G ] > E [ H ] . That is, the stable phase onlyactivates an adjustment of the winning hidden layer to improve the generalization power of DEVFNN. The winninghidden layer is selected for the adaptation scenario instead of all hidden layers to expedite the model update. Moreover,DEVFNN adopts the concept of different-depth network structure which opens the room for each layer to produce theend-output of DEVFNN because each layer is trained to solve the same optimization problem. Because the dynamicvoting mechanism is implemented, the winning hidden layer is simply selected based on its voting weight. This section provides brief recap of gClass working principle [18] constructed under three learning pillars of meta-cognitive learning: what-to-learn, how-to-learn and when-to-learn. The what-to-learn component functions as thesample selection module implemented under the online active learning scenario, while the when-to-learn component setsthe sample update condition. The how-to-learn module realizes a self-adaptive learning principle developed under theframework of Scaffolding theory [24]: problematizing, fading and complexity reduction. This procedure encompassesrule generation, pruning, forgetting and tuning mechanisms. A ﬂowchart of gClass learning policy is placed in thesupplemental document. • What-To-Learn Phase - Online Active Learning Scenario : The online active learning scenario of gClassis driven by the extension of extended conﬂict ignorance (ECI) principle [38] which relies on two samplecontribution measures. The ﬁrst concept is developed from the idea of extended recursive density estimation(ERDE) applied as the rule growing scenario in [39]. Our approach distinguishes itself from [39] in whichthe RDE concept is modiﬁed for the multivariate Gaussian rule and integrates the sample weighting conceptovercoming the outlier (cid:48) s bottleneck. Unlike the rule growing concept ﬁnding salient samples as those havingmaximum and minimum densities, the sample of interest for deletion purpose is redundant samples deﬁned asthose violating the maximum and minimum conditions. The second approach is designed using the distance-to-boundary concept. A classiﬁer is said to be conﬁdent if it safely classiﬁes a sample to one of classes or isfar from decision boundary. The sample-to-boundary concept is set as a ratio of the ﬁrst and second dominantclasses examined by the classiﬁer (cid:48) s outputs. An uncertain sample is indicated if the ratio returns a value around0.5 whereas a high ratio signiﬁes a conﬁdent case. • How-To-Learn - Sample Learning Strategy : Once a sample is accepted by the online active learning scenario,it is passed to the how-to-learn scenario evolving parameter and structure of gClass. The problematizingpart concerns on the issue of concept drift handling and the complexity reduction part relieves the problem (cid:48) scomplexity while the fading part is devised for reduction of model (cid:48) s complexity. The three components areintegrated in a single dedicated learning process and executed in the one-pass learning scenario.The problematizing phase consists of two learning modules: the rule growing and forgetting scenarios. Therule growing phase of gClass adopts the same criteria of pClass [32] in which the three rule growing conditions,namely data quality (DQ), datum signiﬁcance (DS), volume check, are consolidated to pinpoint an idealobservation to expand the rule base size. The data quality method is a derivation of the RDE method in [39]involving the sample weighting concept. Moreover, it is tailored to accommodate the multivariate Gaussianrule. The DS concept estimates the statistical contribution of a data sample whether it deserves to be a candidateof new rule [40], while the volume check is integrated to prevent the over-sized rule which undermines model (cid:48) sgeneralization. The DS concept extends the concept of neuron signiﬁcance [41] to be compatible with thenon axis-parallel ellipsoidal rule while the volume check examines the volume of winning rule and a newrule is introduced given that its volume exceeds a pre-speciﬁed limit. Once the three rule growing conditionsare satisﬁed, a new fuzzy rule is initialized using the class overlapping situation. The class overlappingmethod arranges the three initialization strategies in respect to spatial proximity of a data sample to inter-classand intra-class data samples. This strategy is carried our using the quality per class method which studiesrelationship of current sample to target classes. The rule forgetting scenario is carried out in the local modewhich deploys unique forgetting levels of each rule [42]. It is derived using the local DQ (LDQ) methodperforming recursive local density estimation. The gradient of LDQ method for each rule is calculated andused to deﬁne the forgetting level in the rule premise and consequent. Moreover, the rule consequent tuningscenario is driven by the fuzzily weighted generalized recursive least square (FWGRLS) method inspired bythe work of generalized recursive least square method (GRLS) method [43] putting forward the weight decayterm in the cost function of RLS method. This method can be also perceived as a derivation of FWRLS methodin [39] in which it improves the tuning scenario with the weight decay term to improve model (cid:48) s generalization.11

PREPRINT - D

ECEMBER

10, 2019

Algorithm 1

Learning Policy of DEVFNN Input : ( X n , Y n ) ∈ (cid:60) P × ( n + m ) , π , δ , δ , V ig and k prune Output : Class final ﬁnal predicted Class Testing phase : Predict : The class label of the current data batch. Training phase : Step 1: Tuning of χ : for k = 1 to P do for d = 1 to D do if ( ˆ y dk (cid:54) = y k ) then Execute : ρ d = ρ d − π and chi d = χ d ∗ ρ d Update : the accuracy matrix

Acc ( k ) = 1 else Execute : ρ d = ρ d + π , χ d = min ( χ d (1 + ρ d ) , Update : the accuracy matrix

Acc ( k ) = 0 end if end for end for Step 2: Online Feature Selection Mechanism: for j = 1 to n do for o = 1 to m do Calculate : the input target correlation (12) if ( score > δ ) then Set : the input weight to λ j = 0 end if end for end for Step 3: Hidden Layer Merging Mechanism : for d = 1 to D do for d = 1 to D do Calculate : the correlation coefﬁcient (14) if ( γ ( y d , y d ) < δ , ∀ d (cid:54) = d ) then Set : the voting weight of weaker layer to 0 end if end for end for

Step 4: Drift Detection Mechanism: for k = 1 to P do Construct : F = Acc, G = Acc (1 : k ) Calculate : (cid:15) F , (cid:15) G , ˆ G, ˆ H (15) if ( ˆ G + (cid:15) G ≥ ˆ F + (cid:15) F ) then Set : cut = k Construct : H = Acc ( P − cut + 1 : cut ) and ˆ H end if end for Calculate : the drift and warning level (cid:15) drift , (cid:15) warning if ( | ˆ G − ˆ F | ≥ (cid:15) drift ) then Deepen : the network structure using ( X n , Y n ) ∈ (cid:60) P × ( n + m ) and φ else if ( | ˆ G − ˆ F | (cid:54) = (cid:15) drift and | ˆ G − ˆ F | ≥ (cid:15) warning ) then Create : φ else Update : The winning layer using ( X n , Y n ) ∈ (cid:60) P × ( n + m ) end if PREPRINT - D

ECEMBER

10, 2019The fading phase lowers structural complexity of gClass to avoid the overﬁtting problem and to expedite theruntime. The fading phase of gClass is crafted under the same strategy as pClass [32] where two rule pruningstrategy, namely extended rule signiﬁcance (ERS) and potential+ (P+) methods, are put forward. The ERSmethod shares the same principle as the DS method where it approximates the statistical contribution of fuzzyrules. This approach can be also seen as an estimator of expected outputs of gClass under uniform distribution.The P+ method adopts the concept of rule potential [39] but converts this method to perform the rule pruningtask. Unlike the ERS method capturing superﬂuous rules playing little role during their lifespan, the P+ methoddiscovers obsolete rule, no longer relevant to represent the current data distribution. By extension, the P+method is also applied in the rule recall scenario to cope with the cyclic drift. That is, obsolete rules are onlydeactivated with possibility to be reactivated again in the future if it becomes relevant again.gClass implements the online feature weighting mechanism based on the ﬁsher seperability criteria (FSC) inthe empirical feature space using the kernel concept as adopted in pClass [32] as a part of complexity reductionmethod. Nevertheless, the online feature weighting scenario of gClass is switched into a sleep mode becauseDEVFNN is already equipped by an online feature selection and layer merging module in the top level. • When-To-Learn - Sample Reserved Strategy : The when-to-learn strategy of gClass is built upon the standardsample reserved strategy of the meta-cognitive learner [44]. Our approach, however, differs from that [44]where the sample learning condition is designed under different criteria. The sample reserved strategyincorporates a condition for the tuning scenario of rule premise and data samples violating the rule growingand tuning procedures are accumulated in a data buffer reserved for the future training process. Our approachsimply exempts those samples and exploit them merely for the rule consequent adaptation scenario becausesuch sample quickly become outdated in rapidly changing environments. In addition, This procedure isdesignated to reduce the computational cost. A ﬂowchart of DEVFNN learning policy is attached in thesupplemental document.

This section discusses experimental study of DEVFNN in seven popular real-world and synthetic data stream problems:electricity-pricing, weather, SEA, hyperplane, SUSY, kddCUP and indoor RFID localization problem from our ownproject. DEVFNN is compared against prominent continual learning algorithms in the literature: PNN [45], HAT[46], DEN [4], DSSCN [17], staticDEVFNN. PNN, HAT and DEN are popular continual learning algorithms in thedeep learning literature also designed to prevent the catastrophic forgetting problems. They feature self-evolution ofhidden nodes but still adopt the static network depth. DSSCN represents a deep algorithm having stochastic depth. Thekey difference with our approach lies in the concept of random shift to form deep stacked network structure ratherthan the feature augmentation approach. Comparison with staticDEVFNN is important to demonstrate the efﬁcacy ofﬂexible structure. That is, the hidden layer expansion and merging modules are switched off in the static DEVFNN.The MATLAB codes of DEVFNN are made publicly available in . Because of the page limit, only SEA problem isdiscussed in the paper while the remainder of numerical study is outlined in the supplemental document. Nonetheless,Table 1 displays numerical results of all problems. Our numerical study follows the prequential test-then-train protocol.That is, a dataset is divided into a number of equal-sized data batches. Each data batch is streamed to DEVFNN inwhich it is used to ﬁrst test DEVFNN’s generalization performance before being used for the training process. Thisscenario aims to simulate real data stream environment [47] where numerical evaluation is independently performed perdata batch. The numerical results in Table 1 are reported as the average of numerical results across all data batches.Five evaluation criteria are used here: Classiﬁcation Rate (CR), Fuzzy Rule (FR), Precision (P), Recall (R), HiddenLayer (HL). The SEA problem is one of the most prominent problems in the data stream area where the underlying goal is tocategorize data samples into two classes based on the summation of two input attributes [48]. That is, a class 1 isreturned if the condition f + f < θ while the opposite case f + f > θ indicates a class 2. The concept drift isinduced by shifting the class threshold three times θ = 4 → → → . This shift results in the abrupt drift andfurthermore the cyclic drift because the class boundary is changed in the recurring fashion. The SEA problem is builtupon three input attributes where the third input attribute is merely a noise. We make use of the modiﬁed SEA problemin [49] which incorporates 5 to 25% minority class proportion. Data points are generated from the range of [0 , and the concept drift takes place in the target domain due to the drifting class boundary. The trace of classiﬁcation https : //bit.ly/ ZfnN y PREPRINT - D

ECEMBER

10, 2019Table 1: Numerical results of consolidated algorithms

CR FR P R HLSEA DEVFNN 91.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

13 4.24 ± ± ±

13 24 0.56 0.9 3DSSCN 68 ±

13 3.58 ± ± ± ± ± ± ± ± ± ± ± ± ±

10 10 0.76 0.92 1HAT 72 ± ±

11 2 NA NA 1SDEVFNN 90.7 ± ± ± ± ± ± ± ± ± ±

10 1 0.99 0.99 1PNN 99 ± ± ± CR: classiﬁcation rate, FR: fuzzy rule, P: precission, R: recall, HL: number of hidden layer, SDEVFNN:Static-DEVFNNrate, hidden layer and training sample are depicted in Fig. 4(a)-(d) of supplemental documents. Numerical results arereported in Table 1 along with numerical results of other problems.The advantage of DEVFNN is reported in Table 1 where it outperforms other algorithms in terms of classiﬁcationrates, precision and recall. This result highlights the efﬁcacy of a deep stacked network structure compared to the staticdepth network in improving the generalization power. The use of a deep stacked network structure allows continuousreﬁnement of predictive power where the output of preceding layer is fed as extra input information of current layersupposed to guide the predictive error toward zeros. The main bottleneck of the deep stacked network architecturethrough the feature augmentation approach lies in the linear increase of input dimensionality as the number of hiddenlayer. Notwithstanding that the hidden layer merging mechanism is integrated in DEVFNN and is supposed to lowerthe input dimension due to reduction of extra input attributes, the soft dimensionality reduction is applied by zeroingthe voting weight of a hidden layer and the hidden layer is excluded from any training and inference processes. Thisstrategy is to prevent instability issue due to the discontinuity of the training process which imposes a retraining processfrom scratch. Another complexity reduction method is implemented in terms of a dynamic voting weight scenariowhich overrides the inﬂuence of hidden layers in the ﬁnal classiﬁcation decision. This scenario generates unique rewardand penalty weights giving heavy reward to a good hidden layer whereas strong penalty is imposed to that of poorhidden layers. 14

PREPRINT - D

ECEMBER

10, 2019Because of the page limit, our numerical results of other datasets are presented in the supplemental document. Abrief summary of our numerical results is as follows: 1) the decentralized online active learning strategy in thelayer level is less efﬁcient than the centralistic variant because it is classiﬁer-dependent; 2) the proposal of real driftdetection scenario is more timely than the covariate drift detection scenario because it reacts when the classiﬁer (cid:48) sperformance is compromised; 3) the dynamic voting scenario with dynamic decaying factors lead to more stable votingscenario because the effect of penalty and reward can be controlled in respect to the classiﬁer (cid:48) s performance; 4) Theapplication self-evolving depth enhances learning performance of the static depth because it grows the network ondemand considering variation and availability of data samples.

A novel deep fuzzy neural network, namely dynamic evolving fuzzy neural network (DEVFNN), is proposed for miningevolving and dynamic data streams in the lifelong fashion. DEVFNN is constructed with a deep stacked networkstructure via the feature space augmentation concept where a hidden layer receives original input features plus theoutput of preceding layers as input features. This strategy generates continuous reﬁnement of predictive power across anumber of hidden layers. DEVFNN features an autonomous working principle where its structure and parameters areself-conﬁgured on the ﬂy from data streams. A drift detection method is developed based on the principle of FDDM butit integrates an adaptive windowing scheme using the idea of cutting point. The drift detection mechanism is not onlydesigned to monitor the dynamic of input space, covariate drift, but also to identify the nature of output space, real drift.Furthermore, DEVFNN is equipped by a hidden layer merging mechanism which measures correlation between twohidden layers and combines two redundant hidden layer. This module plays a key role in the deep stacked networkstructure via the feature augmentation concept to address uncontrollable increase of input dimension in rapidly changingconditions. DEVFNN also incorporates the online feature weighting method which assigns a crisp weight to inputfeatures in respect to their relevance to the target concept. The dynmic voting concept is introduced with the underlyingnotion "unique penalty and reward intensity" examined by the relevance of hidden layers.DEVFNN is created with a local learner as a hidden layer, termed gClass, interconnected in tandem. gClass characterizesthe meta-cognitive learning approach having three learning phases: what-to-learn, how-to-learn, when-to-learn. Thewhat-to-learn and when-to-learn schemes provide added ﬂexibility to gClass putting forward the online active learningscenario and the sample reserved strategy while the how-to-learn scheme is devised according to the Scaffolding theory.The online active learning method selects relevant samples to be labeled and to train a model which increases learningefﬁciency and mitigates the overﬁtting risk. The sample reserved strategy sets conditions for rule premise update whilethe scaffolding theory is followed to enhance the learning performance with several learning modules tailored accordingto the problematizing, fading and complexity reduction concepts of the Scaffolding theory.The efﬁcacy of DEVFNN has been numerically validated using seven prominent data stream problems in the literaturewhere it produces more accurate classiﬁcation rates, precision and recall than other benchmarked algorithms whileincurring minor increase of computational and memory demand. It is found that the depth of network structure possessesa linear correlation with a generalization power given that every hidden layer is properly initialized and trained whilethe stochastic depth property improves learning performance compared to the static depth. Moreover, the dynamicadjustment of voting weights makes possible to adapt the voting weight with the dynamic adjustment factor whichdynamically augments and shrinks according to its prequential error. It is perceived that the merging scenario dampensthe network complexity without compromising classiﬁcation accuracy. The concept drift detection method is applied togrow the hidden layer of network structure which sets ﬂexibility for direct access of hidden layer to output layer. Thisimplies that the ﬁnal output is inferred by a combination of every layer output. In other words, DEVFNN actualizes adifferent-depth network paradigm where each level puts forward unique aspects of data streams. Our future work willinvestigate different approaches for self-generating the hidden layer of deep fuzzy neural network because it is admittedthat the use of a concept drift detection method replaces the multiple nonlinear mapping of one concept generating ahigh level feature description with diverse concepts per layer.

References [1] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani. Deep learning for iot big data and streaming analytics:A survey.

IEEE Communications Surveys Tutorials , pages 1–1, 2018.[2] D. H. Wolpert. The power of depth for feed-forward neural networks.

Journal of Machine Learning Research ,49:1–39, 2016.[3] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In

Advances in Neural Information Processing Systems 25 . Curran Associates, Inc., 2012.15

PREPRINT - D

ECEMBER

10, 2019[4] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandablenetworks. ICLR, 2018.[5] C. Yu, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S-F. Chang. An exploration of parameter redundancyin deep networks with circulant projections.

International Conference on Computer Vision (ICCV) , pages2857–2865, 2015.[6] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In

ECCV (4) ,volume 9908 of

Lecture Notes in Computer Science , pages 646–661. Springer, 2016.[7] D. Sahoo, Q. D. Pham, J. Lu, and S. C. Hoi. Online deep learning: Learning deep neural networks on the ﬂy. arXiv preprint arXiv:1711.03705 , abs/1711.03705, 2017.[8] T. Chen, Ian J. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer.

CoRR ,abs/1511.05641, 2015.[9] J. M. Alvares and M. Salzmann. Learning the number of neurons in deep networks. In

Advances in NeuralInformation Processing Systems 29 , pages 2270–2278. Curran Associates, Inc., 2016.[10] G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. arXiv preprintarXiv:1503.02531 .[11] T. Zhou, F-L. Chung, and S. Wang. Deep TSK fuzzy classiﬁer with stacked generalization and triplely conciseinterpretability guarantee for large data.

IEEE Transactions on Fuzzy Systems , 2017.[12] T. Zhou, H. Ishibuchi, and S. Wang. Stacked-structure-based hierarchical takagi-sugeno-kang fuzzy classiﬁcationthrough feature augmentation.

IEEE Transactions on Emerging Topics in Computational Intelligence , 1(6):421–436, Dec 2017.[13] Y. Zhang, H. Ishibuchi, and S. Wang. Deep takagi-sugeno-kang fuzzy classiﬁer with shared linguistic fuzzy rules.

IEEE Transactions on Fuzzy Systems , 26(3):1–1, 2017.[14] D. H. Wolpert. Stacked generalization.

Neural Networks , 5(2):241–259, 1992.[15] Y. Bodyanskiy, O. Vynokurova, I. Pliss, D. Peleshko, and Y. Rashkevych. Deep stacking convex neuro-fuzzysystem and its on-line learning. In

Advances in Dependability Engineering of Complex Systems . SpringerInternational Publishing, 2018.[16] Xiaowei Gu and Plamen P. Angelov. Semi-supervised deep rule-based approach for image classiﬁcation.

Appl.Soft Comput. , 68:53–68, 2018.[17] M. Pratama and D. Wang. Deep stacked stochastic conﬁguration networks for non-stationary data streams. arXivpreprint arXiv:1808.02234 , 2018.[18] Mahardhika Pratama, Jie Lu, Sreenatha G. Anavatti, Edwin Lughofer, and Chee Peng Lim. An incrementalmeta-cognitive-based scaffolding fuzzy neural network.

Neurocomputing , 171:89–105, 2016.[19] Y-H. Pao and Y. Takefuji. Functional-link net computing: Theory, system architecture, and functionalities, specialissue of computer on computer architectures for intelligent machines.

IEEE Computer Society Press , 1991.[20] J. C. Patra and A. C. Kot. Nonlinear dynamic system identiﬁcation using chebyshev functional link artiﬁcialneural networks.

IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , 2002.[21] J. Wang, P. Zhao, S. CH. Hoi, and R. Jin. Online feature selection and its applications.

IEEE Transactions onKnowledge and Data Engineering , 26(3):698–710, 2014.[22] Lei Yu and Huan Liu. Efﬁcient feature selection via analysis of relevance and redundancy.

J. Mach. Learn. Res. ,5:1205–1224, December 2004.[23] M. Pratama, E. Dimla, E. Lughofer, W. Pedrycz, and T. Tjahjowidodo. Online tool condition monitoring based onparsimonious ensemble+.

IEEE Transactions on Cybernetics , pages 1–14, 2018.[24] R. Elwell and R. Polikar. Incremental learning of concept drift in nonstationary environments.

IEEE TransactionsNeural Networks , 22(10):1517–1531, October 2011.[25] G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In

Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , 2014.[26] I. Frias-Blanco, J. d. Campo-Avila, G. Ramos-Jimenez, R. Morales-Bueno, A. Ortiz-Diaz, and Y. Caballero-Mota.Online and non-parametric drift detection methods based on hoeffdings bounds.

IEEE Transactions on Knowledgeand Data Engineering , 2015.[27] Ali Pesaranghader, Herna Viktor, and Eric Paquet. Reservoir of diverse adaptive learners and stacking fasthoeffding drift detection methods for evolving data streams.

Machine Learning , Jun 2018.16

PREPRINT - D

ECEMBER

10, 2019[28] M. Pratama, W. Pedrycz, and E. Lughofer. Evolving ensemble fuzzy classiﬁer.

IEEE Transactions on FuzzySystems , pages 1–1, 2018.[29] Jeremy Z. Kolter and Marcus A. Maloof. Dynamic weighted majority: A new ensemble method for trackingconcept drift. In

Proceedings of the Third IEEE International Conference on Data Mining , ICDM ’03, pages123–, Washington, DC, USA, 2003. IEEE Computer Society.[30] M. Pratama, S. G. Anavatti, and E. Lughofer. Geneﬁs: toward an effective localist network.

IEEE Transactions onFuzzy Systems , 2014.[31] M. Pratama, P. P. Angelov, E. Lughofer, and M-J Er. Parsimonious random vector functional link network for datastreams.

Information Sciences , 2018.[32] M. Pratama, S. G. Anavatti, M-J. Er, and E. D. Lughofer. pClass: an effective classiﬁer for streaming examples.

IEEE Transactions on Fuzzy Systems , 23(2):369–386, 2015.[33] Plamen P. Angelov, Edwin Lughofer, and Xiaowei Zhou. Evolving fuzzy classiﬁers using different modelarchitectures.

Fuzzy Sets and Systems , 159(23):3160–3182, 2008.[34] Edwin Lughofer, Carlos Cernuda, Stefan Kindermann, and Mahardhika Pratama. Generalized smart evolvingfuzzy systems.

Evolving Systems , 6(4):269–292, 2015.[35] Edwin Lughofer. On-line incremental feature weighting in evolving fuzzy classiﬁers.

Fuzzy Sets and Systems ,163(1):1–23, 2011.[36] P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 2002.[37] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: Asimple way to prevent neural networks from overﬁtting.

J. Mach. Learn. Res. , 15(1):1929–1958, January 2014.[38] Edwin Lughofer. Single-pass active learning with conﬂict and ignorance.

Evolving Systems , 3(4):251–271, Dec2012.[39] P. Angelov and D. P Filev. An approach to online identiﬁcation of takagi-sugeno fuzzy models.

IEEE Transactionson Systems, Man, and Cybernetics, Part B (Cybernetics) , 2004.[40] M. Pratama, S. G. Anavatti, P. Angelov, and E. Lughofer. PANFIS: a novel incremental learning machine.

IEEETransactions on Neural Networks and Learning Systems , 2014.[41] Guang bin Huang, P. Saratchandran, and Narashiman Sundararajan. A generalized growing and pruning rbf (ggap-rbf) neural network for function approximation.

IEEE TRANSACTIONS ON NEURAL NETWORKS , 16:57–67,2005.[42] Ammar Shaker and Edwin Lughofer. Self-adaptive and local strategies for a smooth treatment of drifts in datastreams.

Evolving Systems , 5(4):239–257, 2014.[43] Y. Xu, K-W. Wong, and C-S. Leung. Generalized RLS approach to the training of neural networks.

IEEETransactions on Neural Networks , 17(1):19–34, 2006.[44] A. K. Das, K. Subramanian, and S. Sundaram. An evolving interval type-2 neurofuzzy inference system and itsmetacognitive sequential learning algorithm.

IEEE Transactions on Fuzzy Systems , 2015.[45] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu,Razvan Pascanu, and Raia Hadsell. Progressive neural networks.

CoRR , abs/1606.04671, 2016.[46] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting withhard attention to the task. In

Proceedings of the 35th International Conference on Machine Learning , 2018.[47] Joao Gama.

Knowledge Discovery from Data Streams . Chapman & Hall/CRC, 1st edition, 2010.[48] W. N. Street and Y-S Kim. A streaming ensemble algorithm (sea) for large-scale classiﬁcation. In

Proceedings ofthe Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’01, pages377–382, New York, NY, USA, 2001. ACM.[49] G. Ditzler and R. Polikar. Incremental learning of concept drift from streaming imbalanced data.