[PDF] Curse of Dimensionality for TSK Fuzzy Neural Networks: Explanation and Solutions

Abstract

Takagi-Sugeno-Kang (TSK) fuzzy system with Gaussian membership functions (MFs) is one of the most widely used fuzzy systems in machine learning. However, it usually has difficulty handling high-dimensional datasets. This paper explores why TSK fuzzy systems with Gaussian MFs may fail on high-dimensional inputs. After transforming defuzzification to an equivalent form of softmax function, we find that the poor performance is due to the saturation of softmax. We show that two defuzzification operations, LogTSK and HTSK, the latter of which is first proposed in this paper, can avoid the saturation. Experimental results on datasets with various dimensionalities validated our analysis and demonstrated the effectiveness of LogTSK and HTSK.

Full PDF

aa r X i v : . [ c s . L G ] F e b Curse of Dimensionality for TSK Fuzzy NeuralNetworks: Explanation and Solutions

Yuqi Cui, Dongrui Wu and Yifan Xu

School of Artiﬁcial Intelligence and AutomationHuazhong University of Science and Technology

Wuhan, China { yqcui, drwu, yfxu } @hust.edu.cn Abstract —Takagi-Sugeno-Kang (TSK) fuzzy system withGaussian membership functions (MFs) is one of the most widelyused fuzzy systems in machine learning. However, it usuallyhas difﬁculty handling high-dimensional datasets. This paperexplores why TSK fuzzy systems with Gaussian MFs may failon high-dimensional inputs. After transforming defuzziﬁcationto an equivalent form of softmax function, we ﬁnd that the poorperformance is due to the saturation of softmax . We show thattwo defuzziﬁcation operations, LogTSK and HTSK, the latter ofwhich is ﬁrst proposed in this paper, can avoid the saturation.Experimental results on datasets with various dimensionalitiesvalidated our analysis and demonstrated the effectiveness ofLogTSK and HTSK.

Index Terms —Mini-batch gradient descent, fuzzy neural net-work, high-dimensional TSK fuzzy system, HTSK, LogTSK

I. I

NTRODUCTION

Takagi-Sugeno-Kang (TSK) fuzzy systems [1] haveachieved great success in numerous machine learning applica-tions, including both classiﬁcation and regression. Since a TSKfuzzy system is equivalent to a ﬁve layer neural network [2],[3], it is also known as TSK fuzzy neural network.Fuzzy clustering [4]–[6] and evolutionary algorithms [7],[8] have been used to determine the parameters of TSK fuzzysystems on small datasets. However, their computational costmay be too high for big data. Inspired by its great success indeep learning [9]–[11], mini-batch gradient descent (MBGD)based optimization was recently proposed for training TSKfuzzy systems [12], [13].Traditional optimization algorithms for TSK fuzzy systemsuse grid partition to partition the input space into differentfuzzy regions, whose number grows exponentially with theinput dimensionality. A more popular and ﬂexible way isclustering-based partition, e.g., fuzzy c-means (FCM) [14],EWFCM [15], ESSC [4], [16] and SESSC [6], in which thefuzzy sets in different rules are independent and optimizedseparately.Although the combination of MBGD-based optimizationand clustering-based rule partition can handle the problemof optimizing antecedent parameters on high-dimensionaldatasets, TSK fuzzy systems still have difﬁculty achievingacceptable performance. The main reason is the curse ofdimensionality, which affects all machine learning models.When the input dimensionality is high, the distances betweendata points become very similar [17]. TSK fuzzy systems usually use distance based approaches to compute membershipgrades, so when it comes to high-dimensional datasets, thefuzzy partitions may collapse. For instance, FCM is knownto have trouble handling high-dimensional datasets, becausethe membership grades of different clusters become similar,leading all centers to move to the center of the gravity [18].Most previous works used feature selection or dimensional-ity reduction to cope with high-dimensionality. Model-agnosticfeature selection or dimensionality reduction algorithms, suchas Relief [19] and principal component analysis (PCA) [20],[21], can ﬁlter the features before feeding them into TSKmodels. Neural networks pre-trained on large datasets can alsobe used as feature extractor to generate high-level features withlow dimensionality [22], [23].There are also approaches to select the fuzzy sets in eachrule so that rules may have different numbers of antecedents.For instance, Alcala-Fdez et al. proposed an association anal-ysis based algorithm to select the most representative patternsas rules [24]. C´ozar et al. further improved it by proposing alocal search algorithm to select the optimal fuzzy regions [25].Xu et al. proposed to use the attribute weights learned bysoft subspace fuzzy clustering to remove fuzzy sets with lowweights to build a concise TSK fuzzy system [4]. However,there were few approaches that directly train TSK models onhigh-dimensional datasets.Our previous experiments found that when using MBGD-based optimization, the initialization of the standard deviationof Gaussian membership functions (MFs), σ , is very importantfor high-dimensional datasets, and larger σ may lead tobetter performance. In this paper, we demonstrate that thisimprovement is due to the reduction of saturation caused bythe increase of dimensionality. Furthermore, we validate twoconvenient approaches to accommodate saturation.Our main contributions are: • To the best of our knowledge, we are the ﬁrst to discoverthat the curse of dimensionality in TSK modeling is dueto the saturation of the softmax function. As a result,there exists an upper bound on the number of rules thateach input can ﬁre. Furthermore, the loss landscape of asaturated TSK system is more rugged, leading to worsegeneralization. • We demonstrate that the initialization of δ should becorrelated with the input dimensionality to avoid satu-ation. Based on this, we propose a high-dimensionalTSK (HTSK) algorithm, which can be viewed as a newdefuzziﬁcation operation or initialization strategy. • We validate LogTSK [23] and our proposed HTSK ondatasets with a large range of dimensionality. The resultsindicate that HTSK and LogTSK can not only avoidsaturation, but also are more accurate and more robustthan the vanilla TSK algorithm with simple initialization.The remainder of this paper is organized as follows: Sec-tion II introduces TSK fuzzy systems and the saturation phe-nomenon on high-dimensional datasets. Section III introducesthe details of LogTSK and our proposed HTSK. Section IVvalidates the performances of LogTSK and HTSK on datasetswith various dimensionality. Section V draws conclusions.II. T

RADITIONAL

TSK F

UZZY S YSTEMS ON H IGH - DIMENSIONAL D ATASETS

This section introduces the details of TSK fuzzy system withGaussian MF [26], the equivalence between defuzziﬁcationand softmax function, and the saturation phenomenon of softmax on high-dimensional datasets.

A. TSK Fuzzy Systems

Let the training dataset be D = { x n , y n } Nn =1 , in which x n = [ x n, , ..., x n,D ] T ∈ R D × is a D -dimensional featurevector, and y n ∈ { , , ..., C } can be the corresponding classlabel for a C -class classiﬁcation problem, or y n ∈ R for aregression problem.Suppose a D -input single-output TSK fuzzy system has R rules. The r -th rule can be represented as:Rule r : IF x is X r, and · · · and x D is X r,D , Then y r ( x ) = b r, + D X d =1 b r,d x d , (1)where X r,d ( r = 1 , ..., R ; d = 1 , ..., D ) is the MF for the d -th attribute in the r -th rule, and b r,d , d = 0 , ..., D , are theconsequent parameters. Note that here we only take single-output TSK fuzzy systems as an example, but the phenomenonand conclusion can also be extended to multi-output TSKsystems.Consider Gaussian MFs. The membership degree µ of x d on X r,d is: µ X r,d ( x d ) = exp − ( x d − m r,d ) σ r,d ! , (2)where m r,d and σ r,d are the center and the standard deviationof the Gaussian MF X r,d , respectively.The ﬁnal output of the TSK fuzzy system is: y ( x ) = P Rr =1 f r ( x ) y r ( x ) P Ri =1 f i ( x ) , (3)where f r ( x ) = D Y d =1 µ X r,d ( x d ) = exp − D X d =1 ( x d − m r,d ) σ r,d ! (4) is the ﬁring level of Rule r . We can also re-write (3) as: y ( x ) = R X r =1 f r ( x ) y r ( x ) , (5)where f r ( x ) = f r ( x ) P Ri =1 f i ( x ) (6)is the normalized ﬁring level of Rule r . (5) is the defuzziﬁca-tion operation of TSK fuzzy systems.In this paper, we use k -means clustering to initialize theantecedent parameters m r,d , and MBGD to optimize theparameters b r,d , m r,d and σ r,d . More speciﬁcally, we run k -means clustering and assign the R cluster centers to m r,d asthe centers of the rules. We use different initializations of σ r,d to validate their inﬂuence on the performance of TSK modelson high-dimensional datasets. He initialization [27] is used forthe consequent parameters. B. TSK Fuzzy Systems on High-Dimensional Datasets

When using Gaussian MFs and the product t -norm, we canre-write (6) as: f r ( x ) = f r ( x ) P Ri =1 f i ( x )= exp (cid:16) − P Dd =1 ( x d − m r,d ) σ r,d (cid:17)P Ri =1 exp (cid:16) − P Dd =1 ( x d − m i,d ) σ i,d (cid:17) . (7)Replacing − P Dd =1 ( x d − m r,d ) σ r,d with Z r , we can observe that f r is a typical softmax function: f r ( x ) = exp( Z r ) P Ri =1 exp( Z i ) , (8)where Z r < ∀ x . We can also show that, as the dimension-ality increases, Z r decreases, which causes the saturation of softmax [28].Let Z = [ Z , ..., Z R ] and f = [ f , ..., f R ] . In a three-rule TSK fuzzy system for low-dimensional task, if Z =[ − . , − . , − . , then f = [0 . , . , . . As thedimensionality increases, Z may increase to, for example, [ − , − , − , and then f = [1 , × − , × − ] , whichmeans the ﬁnal prediction is dominated by one rule. In otherwords, f in (8) with high-dimensional inputs tends to onlygive non-zero ﬁring level to the rule with the maximum Z r .In order to avoid numeric underﬂow, we compute thenormalized ﬁring level by a common trick: f r ( x ) = exp( Z r − Z max ) P Ri =1 exp( Z i − Z max ) , (9)where Z max = max( Z , ..., Z R ) . In this paper, we considerthat a rule is ﬁred by x when the corresponding normalizedﬁring level f r ( x ) > − .We generate a two-class toy dataset following Gaussiandistribution x i ∼ N (0 , with the dimensionality varyingrom 5 to 2,000 for pilot experiments. The labels are alsogenerated randomly. We initialize σ following Gaussian distri-bution σ ∼ N ( h, . , h = 1 , , , , and train TSK modelswith different R for 30 epochs. The number of rules ﬁred byeach input with different dimensionality at different trainingepochs is shown in Fig. 1. The number of ﬁred rules decreasesrapidly as the dimensionality increases when h = 1 . For aparticular input dimensionality D , there exists an upper boundof the number of ﬁred rules, i.e., larger R would not alwaysincrease the number of ﬁred rules. Increasing h can mitigatethe saturation phenomenon to a certain extent and increase theupper bound of the number of ﬁred rules. ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V ,QLW ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V (SRFK h = 1 R = 200R = 150R = 100R = 50R = 10 (a) ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V ,QLW ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V (SRFK h = 5 R = 200R = 150R = 100R = 50R = 10 (b) ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V ,QLW ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V (SRFK h = 10 R = 200R = 150R = 100R = 50R = 10 (c) ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V ,QLW ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V (SRFK h = 50 R = 200R = 150R = 100R = 50R = 10 (d)Fig. 1. The average number of ﬁred rules versus the input dimensionalityon randomly generated datasets. σ of TSK models is initialized by Gaussiandistribution σ ∼ N ( h, . , h = 1 , , , . The ﬁrst and second columnsrepresent the model before training and after 30 epochs of training, respec-tively. Although each high-dimensional input feature vector canonly ﬁre a limited number of rules due to saturation, differentinputs may ﬁre different subsets of rules, which means evertrule is useful to the ﬁnal predictions. We compute the average normalized ﬁring level of the r -th rule during training by: A r = 1 N N X n =1 f r ( x n ) . (10)We train TSK models with 60 rules and computed the 5%-95% percentiles of A r , r = 1 , ..., R , during training on datasetBooks from Amazon product review datasets. The details ofthis dataset will be introduced in Section IV-A. We repeat theexperiments ten times and show the average results in Fig. 2.Except a small number of rules with high A r , most rules barelycontribute to the prediction. This phenomenon doesn’t changeas the training goes on. (SRFKV A r P P P P P Fig. 2. Different percentiles of A r , r = 1 , ..., R versus the training epochs. C. Enhance the Performance of TSK Fuzzy Systems on High-Dimensional Datasets

The easiest way to mitigate saturation is to increase thescale of σ . As indicated by (7) and (8), increasing the scale of σ also increases the value of Z r to avoid saturation. Similartricks have already been used for training TSK models withfuzzy clustering algorithms, such as FCM [14], ESSC [16] andSESSC [6]. The parameter σ is computed by: σ r,d = h " N X n =1 U n,r ( x n,d − V r,d ) , N X i =1 U i,r / , (11)where U n,r is the membership grade of x n in the r -th cluster, V r, · is the center of the r -th cluster, and h is used to adjust thescale of σ r,d . The larger h is, the smaller | Z r | is. For MBGD-based optimization, we can directly initialize σ with a propervalue to avoid saturation in training. However, the proper scaleparameter h for σ usually depends on the characteristics of thetask, which requires trial-and-error, or time-consuming cross-validation.A better way is to use a scaling factor depending on thedimensionality D to constrain the range of | Z r | . A similarapproach is used in the Transformer [29], in which a scalingfactor / √ d k is used to constrain the value of QK T . Whenthe distribution of the constrained Z r is no longer dependingon the dimensionality D , all we have to do is to choose oneproper initialization range of σ suitable for most datasets.Alternatively, we can use other normalization approacheshich are insensitive to the scale of Z r . For instance, we canreplace the defuzziﬁcation by f r ( Z r ) = Z r k [ Z , ..., Z R ] k (12)or f r ( Z r ) = Z r k [ Z , ..., Z R ] k , (13)so that f r ( Z r ) = f r ( hZ r ) ∀ h > .III. D EFUZZIFICATION FOR H IGH -D IMENSIONAL P ROBLEMS

This section introduces LogTSK proposed by Du etal. [23] and our proposed HTSK. Both are suitable for high-dimensional problems.

A. LogTSK

Recently, an algorithm called TCRFN was proposed forpredicting driver’s fatigue using the combination of con-volutional neural network (CNN) and recurrent TSK fuzzysystem [23]. Within TCRFN, a logarithm transformation of f r was proposed to “amplify the small differences on ﬁringlevels” . The ﬁring level and normalized ﬁring level of the r -th rule in TCRFN are: f logr = − f r = 1 P Dd =1 ( x d − m r,d ) σ r,d f logr = f logr P Ri =1 f logr . (14)The ﬁnal output is: y ( x ) = R X r =1 f logr ( x ) y r ( x ) . (15)We denote the TSK fuzzy system with this log-transformeddefuzziﬁcation LogTSK in this paper. Substituting Z r into (14)gives f logr = − /Z r − P Ri =1 /Z i = − /Z r k [1 /Z , ..., /Z R ] k , (16)i.e., LogTSK avoids the saturation by changing the normaliza-tion from softmax to L normalization. Since L normalizationis not affected by the scale of Z r , LogTSK can make TSKfuzzy systems trainable on high-dimensional datasets. B. Our Proposed HTSK

We propose a simple but very effective approach, HTSK(high-dimensional TSK), to enable TSK fuzzy systems todeal with datasets with any dimensionality, by avoiding thesaturation in (8). HTSK constrains the scale of | Z r | by simplychanging the sum operator in Z r to average: Z ′ r = − D D X d =1 ( x d − m r,d ) σ r,d . (17) We can understand this transformation from the perspective ofdefuzziﬁcation. (5) can be rewritten as: y ( x ) = R X r =1 f ′ r ( x ) y r ( x ) , (18)where f ′ r ( x ) = f r ( x ) /D P Ri =1 f i ( x ) /D = exp( Z ′ r ) P Ri =1 exp( Z ′ i ) . (19)In this way, the scale of | Z ′ r | no longer depends on thedimensionality D . Even in a very high dimensional space, ifthe input feature vectors are properly pre-processed (z-scoreor zero-one normalization, etc.), we can still guarantee thestability of HTSK.HTSK is equivalent to adaptively increasing σ √ D times inthe vanilla TSK, i.e., the initialization of σ should be correlatedwith the input dimensionality. The vanilla TSK fuzzy systemis a special case of HTSK when setting D = 1 .IV. R ESULTS

In this section, we validate the performances of LogTSKand our proposed HTSK on multiple datasets with varyingsize and input dimensionality.

A. Datasets

Fourteen datasets with the dimensionality D varying from10 to 4,955 were used. Their details are summarized in Table I.For FashionMNIST and MNIST, we used the ofﬁcial training-test partition. For other datasets, we randomly selected 70%samples for training and the remaining 30% for test. TABLE IS

UMMARY OF THE FOURTEEN DATASETS .Dataset Num. of features Num. of samples Num. of classesVowel

10 990 11Vehicle

18 596 4Biodeg

41 1,055 2Sensit

100 78,823 3Usps

256 7,291 10Books

400 2,000 2DVD

400 1,999 2ELEC

400 1,998 2Kitchen

400 1,999 2Isolet

617 1,560 26MNIST

784 60,000 10FashionMNIST

784 60,000 10Colon https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation https://jmcauley.ucsd.edu/data/amazon/ https://archive.ics.uci.edu/ml/datasets/isolet http://yann.lecun.com/exdb/mnist/ https://github.com/zalandoresearch/fashion-mnist . Algorithms We compared the following ﬁve algorithms: • PCA-TSK: We ﬁrst perform PCA, and keep only a fewcomponents that capture 95% of the variance, to reducethe dimensionality, and then train a vanilla TSK fuzzysystem introduced in Section II. The parameter σ is ini-tialized following Gaussian distribution σ ∼ N (1 , . . • TSK- h : This is the vanilla TSK fuzzy system introducedin Section II. The parameter σ is initialized followingGaussian distribution σ ∼ N ( h, . . We set h to { , , , } to validate the inﬂuence of saturation onthe generalization performance. • TSK-BN-UR: This is the TSK-BN-UR algorithm in [13].The weight for UR is selected by the validation set. Theparameter σ is initialized following Gaussian distribution σ ∼ N (1 , . . • LogTSK: TSK with the log-transformed defuzziﬁcationintroduced in Section III-A. The parameter σ is initializedfollowing Gaussian distribution σ ∼ N (1 , . . Otherparameters are initialized by the method described inSection II-A. • HTSK: This is our proposed HTSK in Section III-B. Theparameter σ is initialized following Gaussian distribution σ ∼ N (1 , . .All parameters except σ were initialized as described inSection II-A. All models were trained using MBGD-basedoptimization. We used Adam [10] optimizer. The learning ratewas set to 0.01, which was the best learning rate chosen bycross-validation on most datasets. The batch size was set to2,048 for MNIST and FashionMNIST, and 512 for all otherdatasets. If the batch size was larger than the total number ofsamples N t in the training set, then we set it to min( N t , .We randomly selected 10% samples from the training set asthe validation set for early-stopping. The maximum numberof epochs was set to 200, and the patience of early-stoppingwas 20. The best model on the validation set was used fortesting. We ran all TSK algorithms with the number of rules R = 30 . All algorithms were repeated ten times and theaverage performance was reported.Note that the aim of this paper is not to pursue the state-of-the-art performance on each dataset, so we didn’t use cross-validation to select the best hyper-parameters on each dataset,such as the number of rules. We only aim to demonstratewhy TSK fuzzy systems perform poorly on high-dimensionaldatasets, and the improvement of HTSK and LogTSK. C. Generalization Performances

The average test accuracies of the eight TSK algorithmswith 30 rules are shown in Table II. The best accuracy oneach dataset is marked in bold. We can observe that: • On average, HTSK and LogTSK had similar perfor-mance, and both outperformed other TSK algorithmson a large range of dimensionality. TSK- and TSK- performed well on datasets within a certain rangeof dimensionality, but they were not always optimal when the dimensionality changed. For instance, on Colon, h = 50 was better than h = 5 or , but on Vowel, h = 1 or were better than h = 10 or . However,HTSK and LogTSK always achieved optimal or close-to-optimal performance on those datasets. The results alsoindicate that the initialization of h should be correlatedwith D , and h = √ D is a robust initialization strategyfor datasets with a large range of dimensionality. • PCA-TSK performed the worst, which may be because ofthe loss of information during dimensionality reduction. Italso shows the necessity of training TSK models directlyon high-dimensional features. • In [13], TSK-BN-UR outperformed TSK- on low-dimensional datasets, but this paper shows that it doesnot cope well with high dimensional datasets.We also show the test accuracies of the eight TSK algorithmswith different number of rules in Fig. 3. HTSK and LogTSKoutperformed other TSK algorithms, regardless of R . D. Number of Fired Rules

We analyzed the number of ﬁred rules by each input onHTSK and LogTSK, and show the results in Fig. 4. Thedataset used here is same as the one in Fig. 1. Both ﬁguresshow that in HTSK and LogTSK, each high-dimensional inputﬁres almost all rules, even with a small initial σ . However,when the number of rules is large, for instance, R = 200 ,LogTSK’s number of ﬁred rules is less than 200, but HTSK’snumber of ﬁred rules is still 200. This may be caused by the L normalization of LogTSK, making the normalized ﬁringlevel sparser than HTSK. E. Gradient and Loss Landscape

Figs. 1 and 4 show that h ≥ can counteract mostof the inﬂuence caused by saturation when D < , .Therefore, the performances of TSK- , TSK- and TSK- are very similar to HTSK and LogTSK on datasets withdimensionalities in that range.To study if the limited number of ﬁred rules is the onlyreason causing the decrease of generalization performance,we further analyze the gradient and the loss landscape duringtraining. Because the scale of σ affects the gradients’ absolutevalues, we only compare the L norm of the gradients forTSK- , HTSK, and LogTSK. The parameter σ was initializedfollowing Gaussian distribution σ ∼ N (1 , . . We recordedthe L norm of the gradient of the antecedent parameters m and σ during training on the Books dataset. Fig. 5(a) and (b)show that the gradient of antecedent parameters from TSK- is signiﬁcantly larger than HTSK and LogTSK, especially inthe initial training phase.We also visualize the loss landscape on the gradient direc-tion using the approach in [30]. Speciﬁcally, for each updatestep, we compute the gradient w.r.t. the loss and take one stepfurther using a ﬁxed step: η × the gradient ( η = 1 ). Then, werecord the loss as the parameters move in that direction. Whenthe initial parameters from different runs are the same, theloss’ variation represents the smoothness of the loss landscape. ABLE IIA

VERAGE ACCURACIES OF THE EIGHT

TSK

ALGORITHMS WITH RULES ON THE FOURTEEN DATASETS .PCA-TSK TSK-BN-UR TSK- TSK- TSK- TSK- LogTSK HTSKVowel 80.81 87.21 87.91 87.58 55.49 49.83 85.42

Vehicle 70.28 71.73 72.68 75.31 73.80 72.07 75.25

Biodeg 84.86

Books 73.95 75.55 76.42 79.28 78.70 78.83 78.87

DVD 75.53 75.32 76.05 78.97 78.67 78.42

Elec 75.68 78.72 79.45 81.28 81.45

Gisette 93.66 94.14 95.80 95.38

TABLE IIIA

CCURACY RANKS OF THE EIGHT

TSK

ALGORITHMS WITH RULES ON THE FOURTEEN DATASETS .PCA-TSK TSK-BN-UR TSK- TSK- TSK- TSK- LogTSK HTSKVowel 6 4 2 3 7 8 5 1Vehicle 8 7 5 2 4 6 3 1Biodeg 6 1 5 3 8 7 4 2Sensit 7 5 8 4 3 6 1 2USPS 8 7 6 4 3 5 2 1Books 8 7 6 2 5 4 3 1DVD 7 8 6 3 4 5 1 1Elec 8 7 6 4 2 1 3 5Kitchen 8 7 6 1 5 4 3 2Isolet 6 7 8 5 1 4 2 3MNIST 7 6 8 4 2 5 1 3FashionMNIST 7 6 8 4 2 5 1 3Colon 8 7 5 4 6 1 3 1Gisette 8 7 5 6 1 3 2 4Average 7.3 6.1 6.0 3.5 3.8 4.6 2.4 1XPRIUXOHV 7 H V W $ FF X U DF \ 9RZHO 1XPRIUXOHV 7 H V W $ FF X U DF \ 9HKLFOH 1XPRIUXOHV 7 H V W $ FF X U DF \ %LRGHJ 1XPRIUXOHV 7 H V W $ FF X U DF \ 6HQVLW 1XPRIUXOHV 7 H V W $ FF X U DF \ 8636 1XPRIUXOHV 7 H V W $ FF X U DF \ %RRNV 1XPRIUXOHV 7 H V W $ FF X U DF \ '9' 1XPRIUXOHV 7 H V W $ FF X U DF \ (/(& 1XPRIUXOHV 7 H V W $ FF X U DF \ .LWFKHQ 1XPRIUXOHV 7 H V W $ FF X U DF \ ,VROHW 1XPRIUXOHV 7 H V W $ FF X U DF \ 01,67 1XPRIUXOHV 7 H V W $ FF X U DF \ )DVKLRQ01,67 1XPRIUXOHV 7 H V W $ FF X U DF \ &RORQ 1XPRIUXOHV 7 H V W $ FF X U DF \ *LVHWWH 1XPRIUXOHV 7 H V W $ FF X U DF \ $YHUDJH 3&$76.76.%18576.176.576.1076.50+76./RJ76. Fig. 3. Test accuracies of the eight TSK algorithms on the fourteen datasets with different number of rules. ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V ,QLW ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V (SRFK +76. R = 200R = 150R = 100R = 50R = 10 (a) ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V ,QLW ,QSXWGLPHQVLRQ 1 X P R I I L U H G U X O H V (SRFK /RJ76. R = 200R = 150R = 100R = 50R = 10 (b)Fig. 4. The number of ﬁred rules by each data point using (a) HTSK, and (b)LogTSK. The ﬁrst and second columns represent the model before trainingand after 30 epochs of training, respectively.

Smaller variation means that the loss landscape is ﬂatter, i.e.,the gradient would not oscillate when a large learning rate isused, and the gradient is more predictable.Fig. 5(c) and (d) show the smoothness of the loss landscapeand the corresponding test accuracies versus the number ofbatches. The loss landscapes of the algorithms that can miti-gate the saturation are similar, and all are signiﬁcantly ﬂatterthan TSK- . After the model converges on the test accuracy,the variations of the vanilla TSK methods become even larger.This indicates that the gradient of TSK methods is more likelyto oscillate during training, which means optimization is moredifﬁcult when saturation exists.The test accuracies of the six TSK algorithms in Fig. 5(d)also demonstrate that, when the loss landscape is more rugged,the generalization performance is worse. Besides, when h istoo large, for instance, h = 50 , the generalization performancealso decreases. This means ﬁnding the proper h is veryimportant, and HTSK and LogTSK should be better choicesfor training TSK models. F. Parameter Sensitivity of HTSK and LogTSK

In the above experiments, we directly set the scale parameter h = 1 for HTSK and LogTSK. We also compared the gener-alization performance of HTSK and LogTSK using different h . The test accuracies versus different h on ﬁve datasets areshown in Fig. 6. LogTSK is insensitive to h , and HTSK isinsensitive to h when h ≥ . . Smaller h for HTSK leads tolarger | Z r | , which may still cause saturation.In general, h = 1 is a good choice for both HTSK andLogTSK. V. C ONCLUSIONS

In this paper, we demonstrated that the poor performance ofTSK fuzzy systems with Gaussian MFs on high-dimensional (SRFKV L J U D G L H Q W 3DUDPHWHUm 76.1+76./RJ76. (a) %DWFK L J U D G L H Q W 3DUDPHWHUσ 76.1+76./RJ76. (b) %DWFK / R VV O D QG V FD S H 76.176.576.1076.50+76./RJ76. (c) %DWFK 7 H V W $ FF X U DF \ 76.176.576.1076.50+76./RJ76. (d)Fig. 5. (a)-(b) The L gradients of antecedent parameters m and σ versusthe training epochs. (c)-(d) The loss landscape and the corresponding testaccuracy versus the training batch. All experiments in (a)-(d) were conductedon the Books dataset and repeated ten times. datasets is due to the saturation of the softmax function. Higherdimensionality causes more severe saturation, making eachinput ﬁre only very few rules, the gradients of the antecedentparameters become larger and the loss landscape becomemore rugged. We pointed out that the initialization of σ inTSK should be correlated with the input dimensionality toavoid saturation, and proposed HTSK that can handle any-dimensional datasets. We analyzed the performance of twodefuzziﬁcation algorithms (LogTSK and our proposed HTSK)on datasets with a large range of dimensionality. Experimental h 7 H V W $ FF X U DF \ +76. %RRNV(/(& '9'.LWFKHQ ,VROHW (a) h 7 H V W $ FF X U DF \ /RJ76. %RRNV(/(& '9'.LWFKHQ ,VROHW (b)Fig. 6. Test accuracy versus different h of (a) HTSK and (b) LogTSK onﬁve datasets. results validated that HTSK and LogTSK can reduce thesaturation, and both have stable performance.R EFERENCES[1] T. Takagi and M. Sugeno, “Fuzzy identiﬁcation of systems and itsapplications to modeling and control,”

IEEE Trans. on Systems, Man,and Cybernetics , no. 1, pp. 116–132, 1985.[2] J. S. R. Jang, “ANFIS: adaptive-network-based fuzzy inference system,”

IEEE Trans. on Systems, Man, and Cybernetics , vol. 23, no. 3, pp. 665–685, 1993.[3] D. Wu, C.-T. Lin, J. Huang, and Z. Zeng, “On the functional equivalenceof TSK fuzzy systems to neural networks, mixture of experts, CART, andstacking ensemble regression,”

IEEE Trans. on Fuzzy Systems , vol. 28,no. 10, pp. 2570–2580, 2020.[4] P. Xu, Z. Deng, C. Cui, T. Zhang, K.-S. Choi, S. Gu, J. Wang, andS. Wang, “Concise fuzzy system modeling integrating soft subspaceclustering and sparse learning,”

IEEE Trans. on Fuzzy Systems , vol. 27,no. 11, pp. 2176–2189, 2019.[5] Z. Deng, K.-S. Choi, F.-L. Chung, and S. Wang, “Scalable TSKfuzzy modeling for very large datasets using minimal-enclosing-ballapproximation,”

IEEE Trans. on Fuzzy Systems , vol. 19, no. 2, pp. 210–226, 2010.[6] Y. Cui, H. Wang, and D. Wu, “Supervised enhanced soft subspaceclustering (SESSC) for TSK fuzzy classiﬁers,” arXiv:2002.12404 , 2020.[7] Y. Shi, R. Eberhart, and Y. Chen, “Implementation of evolutionary fuzzysystems,”

IEEE Trans. on Fuzzy Systems , vol. 7, no. 2, pp. 109–119,1999.[8] D. Wu and W. W. Tan, “Genetic learning and performance evaluationof interval type-2 fuzzy logic controllers,”

Engineering Applications ofArtiﬁcial Intelligence , vol. 19, no. 8, pp. 829–841, 2006.[9] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importanceof initialization and momentum in deep learning,” in

Proc. Int’l Conf.on Machine Learning , Atlanta, GA, Jun. 2013, pp. 1139–1147.[10] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

Proc. Int’l Conf. on Learning Representations , San Diego, CA, May2015.[11] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methodswith dynamic bound of learning rate,” in

Proc. Int’l Conf. on LearningRepresentations , New Orleans, LA, May 2019. [12] D. Wu, Y. Yuan, J. Huang, and Y. Tan, “Optimize TSK fuzzy systemsfor regression problems: Minibatch gradient descent with regularization,DropRule, and AdaBound (MBGD-RDA),”

IEEE Trans. on Fuzzy Sys-tems , vol. 28, no. 5, pp. 1003–1015, 2019.[13] Y. Cui, D. Wu, and J. Huang, “Optimize TSK fuzzy systems forclassiﬁcation problems: Mini-batch gradient descent with uniform reg-ularization and batch normalization,”

IEEE Trans. on Fuzzy Systems ,2020, early access.[14] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-meansclustering algorithm,”

Computers & Geosciences , vol. 10, no. 2-3, pp.191–203, 1984.[15] J. Zhou, L. Chen, C. P. Chen, Y. Zhang, and H.-X. Li, “Fuzzy clusteringwith the entropy of attribute weights,”

Neurocomputing , vol. 198, pp.125–134, 2016.[16] Z. Deng, K.-S. Choi, F.-L. Chung, and S. Wang, “Enhanced softsubspace clustering integrating within-cluster and between-cluster in-formation,”

Pattern Recognition , vol. 43, no. 3, pp. 767–781, 2010.[17] M. E. Houle, H.-P. Kriegel, P. Kr¨oger, E. Schubert, and A. Zimek,“Can shared-neighbor distances defeat the curse of dimensionality?” in

Proc. Int’l Conf. on Scientiﬁc and Statistical Database Management ,Heidelberg, Germany, Jul. 2010, pp. 482–500.[18] R. Winkler, F. Klawonn, and R. Kruse, “Fuzzy c-means in high dimen-sional spaces,”

Int’l Journal of Fuzzy System Applications , vol. 1, no. 1,pp. 1–16, 2011.[19] R. J. Urbanowicz, M. Meeker, W. La Cava, R. S. Olson, and J. H. Moore,“Relief-based feature selection: Introduction and review,”

Journal ofBiomedical Informatics , vol. 85, pp. 189–203, 2018.[20] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”

Chemometrics and Intelligent Laboratory Systems , vol. 2, no. 1-3, pp.37–52, 1987.[21] X. Gu and X. Cheng, “Distilling a deep neural network into a Takagi-Sugeno-Kang fuzzy inference system,” arXiv:2010.04974 , 2020.[22] Y. Deng, Z. Ren, Y. Kong, F. Bao, and Q. Dai, “A hierarchical fusedfuzzy deep neural network for data classiﬁcation,”

IEEE Trans. on FuzzySystems , vol. 25, no. 4, pp. 1006–1012, 2016.[23] G. L. Du, Z. Wang, C. Li, and P. X. Liu, “A TSK-type convolutionalrecurrent fuzzy network for predicting driving fatigue,”

IEEE Trans. onFuzzy Systems , 2020, early access.[24] J. Alcala-Fdez, R. Alcala, and F. Herrera, “A fuzzy association rule-based classiﬁcation model for high-dimensional problems with geneticrule selection and lateral tuning,”

IEEE Trans. on Fuzzy systems , vol. 19,no. 5, pp. 857–872, 2011.[25] J. C´ozar, J. A. G´amez et al. , “Learning compact zero-order tsk fuzzyrule-based systems for high-dimensional problems using an apriori+local search approach,”

Information Sciences , vol. 433, pp. 1–16, 2018.[26] D. Wu and J. M. Mendel, “Recommendations on designing practicalinterval type-2 fuzzy systems,”

Engineering Applications of ArtiﬁcialIntelligence , vol. 95, pp. 182–193, 2019.[27] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,” in

Proc. IEEE Int’l Conf. on Computer Vision , Santiago, Chile, Dec. 2015,pp. 1026–1034.[28] B. Chen, W. Deng, and J. Du, “Noisy softmax: Improving the generaliza-tion ability of dcnn via postponing the early softmax saturation,” in

Proc.IEEE Conf. on Computer Vision and Pattern Recognition , Honolulu, HI,Jul. 2017, pp. 5372–5381.[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Proc. Advances in Neural Information Processing Systems , vol. 30, LongBeach, CA, Dec. 2017, pp. 5998–6008.[30] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch nor-malization help optimization?” in