[PDF] A Multiple Classifier Approach for Concatenate-Designed Neural Networks

Abstract

This article introduces a multiple classifier method to improve the performance of concatenate-designed neural networks, such as ResNet and DenseNet, with the purpose to alleviate the pressure on the final classifier. We give the design of the classifiers, which collects the features produced between the network sets, and present the constituent layers and the activation function for the classifiers, to calculate the classification score of each classifier. We use the L2 normalization method to obtain the classifier score instead of the Softmax normalization. We also determine the conditions that can enhance convergence. As a result, the proposed classifiers are able to improve the accuracy in the experimental cases significantly, and show that the method not only has better performance than the original models, but also produces faster convergence. Moreover, our classifiers are general and can be applied to all classification related concatenate-designed network models.

Full PDF

MMULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNEDNEURAL NETWORK

KA-HOU CHAN*, SIO-KEI IM, AND WEI KE

Abstract.

This article introduces a multiple classiﬁer method to improvethe performance of concatenate-designed neural networks, such as ResNet andDenseNet, with the purpose to alleviate the pressure on the ﬁnal classiﬁer. Wegive the design of the classiﬁers, which collects the features produced betweenthe network sets, and present the constituent layers and the activation functionfor the classiﬁers, to calculate the classiﬁcation score of each classiﬁer. Weuse the L2 normalization method to obtain the classiﬁer score instead of the Softmax normalization. We also determine the conditions that can enhanceconvergence. As a result, the proposed classiﬁers are able to improve theaccuracy in the experimental cases signiﬁcantly, and show that the methodnot only has better performance than the original models, but also producesfaster convergence. Moreover, our classiﬁers are general and can be applied toall classiﬁcation related concatenate-designed network models. Introduction

Image classiﬁcation is one of main topics of neural networks, starting from thesuccess of AlexNet [1]. The availability of object classiﬁcation can lay the foun-dation for advanced neural systems and is of great signiﬁcance in the research ofperceiving media data, such as face recognition [2–4], medicine image analysis [5–7]and autonomous vehicles [8, 9], etc.

As we know that image classiﬁcation by us-ing neural networks has good performance in over 90.0% of cases, but there is thechallenge of how to overcome the left cases with signiﬁcant variations of poor ofillumination, image blurring and occlusions [10]. Therefore, the current methodsare hardly completely applicable to those cases where errors are strictly intolerable,for instance the autonomous vehicles often miss the traﬃc light when driving fastin the midnight.In order to increase the accuracy, studies on neural network architectures mostlyfocus on several aspects. The easiest way is to improve the accuracy by stackingnumerous layers, but the increasing rate of this method presents a logarithmic curvethat the later impact is very small. Therefore, even if we increase the number ofAlexNet’s layers, the result is only slightly better than nothing [11, 12]. Anotheraspect considers images as the perceptual data that a learned model may describerandom error or noise with redundancy, instead of the strict underlying data dis-tribution [13]. There are many data augmentation and regularization approachesproposed in the preprocessing, such as random cropping [1], ﬂipping [14] and ran-dom erasing [15]. Data augmentation is closely related to oversampling in data

Date : January 15, 2021.

Key words and phrases.

Multiple Classiﬁer and Convergence Enhancement and Concatenate-Designed Neural Network and Softplus and L2 normalization .*Corresponding Author. a r X i v : . [ c s . N E ] J a n KA-HOU CHAN*, SIO-KEI IM, AND WEI KE analysis that can reduce overﬁtting when training machine learning models, by ap-plying data augmentation, one can achieve predictable accuracy improvement [16],but it cannot overcome the shortcomings of the training model itself. The mostchallenging aspect is to develop a new network model by combining layers strate-gically. In order to obtain better performance, the state-of-the-art deeper networkmodels, such as GoogleNet [17], ResNet [18] and VGGNet [14], etc. , are designedwith a large network architecture by stacking convolutional layers, and these mod-els have performed well for image classiﬁcation. With the in-depth study, it canbe found that they all concatenate sets of similar convolutional layers together,and connect one classiﬁer at the end of the model output, see Figure 1. Theyare realized by feedforward neural networks, some features of similar images arealways diluted during several convolutions. For example, to determine the number‘0’ and the letter ‘O’, their main features are similar that can be identiﬁed with theentire shape. Considering the Convolutional Neural Networks (CNNs), the convo-luted results will only retain the local features and discard the global informationwhen pooling is performed with decreasing resolutions, so the decision will be morediﬃcult after the convolution and pooling. However, no matter how good the ﬁt-ting of the network design, the ﬁnal classiﬁer will be under great pressure to makedecisions.Below lists the main contribution of this paper. • In order to alleviate the pressure on the ﬁnal classiﬁer, we introduce multipleclassiﬁers in this process, and then make decisions based on their results. • We present the constituent layers and activation function of the proposedclassiﬁers, the purpose is to calculate the classiﬁcation scores of each clas-siﬁer. • We introduce the L2 normalization method to obtain the classiﬁer scoresinstead of using the Softmax normalization, and determine the conditionsunder which the convergence can be improved. • The proposed method is applicable to the state-of-the-art network modelsand achieves comparable results.In particular, our approach is compatible with all discovered classiﬁcation neuralnetworks, and can further be extended to deeper concatenate-designed models. Weconducted a comprehensive experiment on the CIFAR dataset [19] to show theaccuracy of VGGNet with and without our approach. We show that the accuracyis obviously better with faster convergence. Similar phenomena are also shown onthe ResNet, which indicates that the optimization is eﬀective, and our approach isnot just suitable to a particular network model.The rest of this paper is organized as follows. section 2 goes through the relatedwork. section 3 present the design details and justiﬁcation of the multiple classiﬁerperformance. section 4 shows the experimental results. Finally, section 5 concludesthe paper. 2.

Related Work

The multiple classiﬁer approach has attracted a lot of research attention and hasbeen widely used in many perceived media such as image classiﬁcation and speechrecognition, etc.

Researchers have studied multiple classiﬁers from diﬀerent aspects,including the classiﬁer design and the concurrent rules. Our work can be related tothese three aspects, classiﬁer design, type of classiﬁer outputs and architectures.

ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 3

VGG-16Image

Conv (3 , Conv (64 , Conv (64 , Conv (128 , Conv (128 , Conv (256 , Conv (256 , Conv (256 , Conv (512 , Conv (512 , Conv (512 , Conv (512 , Conv (512 , Linear (4096)

Linear (4096)

Linear (1000)

Softmax

ResNet-18Image

Conv (3 , ResBlock (64 , ResBlock (64 , ResBlock (64 , ResBlock (128 , ResBlock (128 , ResBlock (256 , ResBlock (256 , ResBlock (512 , Linear (1000)

Softmax S e t S e t S e t S e t S e t C l a ss i ﬁ e r Figure 1.

The brief architecture of VGG-16 and ResNet-18, bothof the series has concatenated with ﬁve sets of stacked layers about

Conv and

ResBlock , respectively, and one classiﬁer is connected atthe end of the model output.For the traditional design in neural networks, by combining linear layers andactivation functions, a classiﬁer can have multiple output units and categorizes asample according to the class whose corresponding output gives the highest valueamong the multiple outputs [20]. Also, [21] proposed an error-correcting outputcode method to provide redundancy. Later, [22] proposed an idea of using an

KA-HOU CHAN*, SIO-KEI IM, AND WEI KE additional single-layer perceptron neural network to enhance the error-correctingcapabilities. In particular, a CNN always includes a number of convolutional andpooling layers which are optionally accompanied by fully connected linear design.[23] proved the robustness of their classiﬁer by constructing seven CNNs. It allowsto consider the average error rate obtained as the best results. Moreover, Supportvector machines (SVM), considered as one of the strongest and robustest algorithmin machine learning, was created by [24], and made use of [25]. It has become a well-known approach exploited in many domains [26, 27], such as pattern recognition,classiﬁcation and image processing, and all of them obtained the best performance.Later, [28] modiﬁed the CNN structure by replacing the output layer of the fullyconnected with an SVM classiﬁer. In recent years, [29] involved multiple classi-ﬁers into their proposed model, then averaged all the sets of the classiﬁer outputs.In addition, in order to better realize the function of the classiﬁer in CNN, twostate-of-art classiﬁers Random Forest (RF) and Gradient Boosting (GB) have alsobeen applied to deep learning [30]. Meanwhile, for the types of classiﬁer outputs,researchers considered how to calculate the conﬁdence of the classiﬁer for each im-age category, where each sample was represented by multiple images [31]. Eachclassiﬁer must be trained to increase the conﬁdence of the corresponding category,thus the category can be determined using the conﬁdence of each classiﬁer [32].Further, the results of the multiple classiﬁers can also be used to determine inlate fusion in multi-modal [33] or multi-voting [34] classiﬁcation. Speciﬁcally, theabove method has a set of individual classiﬁers, each classiﬁer makes a decisionon the input set individually, then the method combines their decisions to form acomposite result [35]. In order to invoke the classiﬁer in a concatenate-designedmodel, [36] proposed convolutions as classiﬁers, instead of linear classiﬁers at theend of ResNet. Further, [37] decided to combine the “long-term dependencies”and the ResNet networks in one classiﬁer to show that the accuracy was improvedsigniﬁcantly while maintaining a suitable interference time.Meanwhile, the deeper convolutional architecture was the most important work,demonstrating the powerful functions of concatenate-designed neural networks,which showed that building a deeper network with a tiny convolution kernel waseﬀective in increasing the performance of the CNN-based network models. AfterVGGNet [14], ResNet was ﬁrst proposed by [18]. It greatly alleviated the optimiza-tion diﬃculty and increased it to some hundreds of layers by using skipping connec-tions within their convolutional set. Since then, diﬀerent kinds of inner structureshave been proposed, concentrating on various tasks and consistently achieved thebetter performance in diﬀerent areas [38,39]. Further, [40] introduced the DenseNetwhich passed the input features to the output through a densely connected pathto concatenate the input features with the convoluted output as the

DenseBlock result. Take account to these networks design, they aim to retain more original in-formation to the classiﬁer, so they also tend to connect the input features directlyto the output. However, the width of their connection path increases linearly asthe depth rises, causing the number of training parameters to increase seriously.This limits the building of deeper and wider networks that might further improvethe accuracy.In this work, inspired by the connection from the original information to theclassiﬁer, we aim to adopt the proposed classiﬁer following each layer set. Theseclassiﬁers make interim decisions instead of one decision of the last classiﬁer. Based

ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 5 on these decisions, we then propose a novel combination method and add it to thestate-of-the-art concatenate-designed network that can achieve higher accuracy.3.

Multiple Classifier Strategy

Looking into concatenate-designed networks, there are several sets connected inorder, as shown in Figure 1, composing multiple convolutional layers and varioustypes of blocks. We use h t to denote the output feature of Set t at the t -th stepwith attributes [ batch , channel , width , height ], and h is the original image. For eachstep, Set t refers to the feature extraction function that takes the previous featureas input and output the extracted information,(3.1) h t = Set t ( h t − ) . Then, the classiﬁcation function

Classiﬁer transforms the last output feature h − to a speciﬁc dimension vector (cid:126)c ,(3.2) (cid:126)c = Classiﬁer ( h − ) . We further transform (cid:126)c into a probability vector through the

Softmax function.In Equation 3.1 and Equation 3.2, we encapsulate the network rule of variousconcatenate-designed architectures in a generalized way. This observation showsthat the connection path is essentially an extensible higher order function thatextracts information from the previous states. However, the feature size is reducedand the number of channels is increased when a feature passes through the sets.Although it can capture the major part of the classiﬁcation target, the overallstructure of the image is somewhat lost. In addition, some useful information in thelater sets may also be discarded during the extraction from the earlier sets. Theseproblems become more obvious as the number of connections increases. Therefore,the concatenation usually connects up to ﬁve sets, and the ﬁnal classiﬁer will beunder great pressure to make decisions.In order to address this issue, let’s revisit the network and divide it by thepooling layers. In this view, the architecture can be considered as a simple CNNnetwork if t = 1, similar to LeNet-5 [41] if t = 2, and similar to AlexNet [1] if t = 3, etc. Particularly, the most advanced CNN network also satisﬁes with the t = 5 case [14, 18, 40]. Thus, with the advancement of hardware performance, CNNmodels in diﬀerent eras can be summarized as this series of neural networks withdiﬀerent t values. From the above analysis, we observe that the number of sets canincrease indeﬁnitely. In practical applications, we only consider the part before theclassiﬁer, i.e. , ∃ t • Classiﬁer t ( h t ) ≡ (cid:126)c . Based on this idea, we propose to employthe number of t classiﬁers as(3.3) ∀ t • (cid:88) Classiﬁer t ( h t ) = (cid:88) (cid:126)c t ≡ (cid:126)c. Meanwhile, such a decision contributing strategy makes it possible to compromisethe global structure and the local pattern of an image. Comparatively, the for-mer one gets more original information and the latter one obtains the extractedinformation. All classiﬁers make decisions based on their recent acquired featuresindependently. This strategy can provide more references for decision making, es-pecially in controversial cases such as ‘0’ and ‘O’, ‘1’ and ‘l’. Multiple classiﬁerscan alleviate the perplexity, leading to high redundancy.

KA-HOU CHAN*, SIO-KEI IM, AND WEI KE

Set − PoolingLinear · · ·

Linear F C Softmax C l a ss i ﬁ e r h − ~c Figure 2.

A general procedure of classiﬁer: often consists of one

Pooling layer and multiple

Linear layers. These

Linear layers areoften collectively referred to as Fully-Connected (FC) layers.3.1.

Classiﬁer Design.

Before we present our classiﬁer design, we ﬁrst analyzethose classiﬁers used in the current neural networks. It can be found that theprocedure of their stacked layers is the same (see Figure 2). The classiﬁer alwaysreceives the features produced by the last set, and must consider following dynamicattributes: width × height ,: the size of the image varies with diﬀerent applications, butthe Fully-Connected (FC) layer only accepts a ﬁxed size. Classiﬁers haveto pool a ﬁx size in order to facilitate feature extraction from all sizes of h − . channel ,: diﬀerent from the image size, this attribute of the original image h always equals to 3, corresponding to the RGB channels, and then increasesto 64, 128, 256 and 512 (up to 1024 in DenseNet) through the variouspredeﬁned set.Therefore, the number of input features of an FC layer is channel × width × height ,and a vector (cid:126)c is produced as the result. According to this design, when appliedto our classiﬁer that satisﬁes Equation 3.3, there are multiple classiﬁers to collectevery output feature from h to h − . The number of classiﬁers is the same as thenumber of concatenated sets, and we use Classiﬁer t to denote the t -th classiﬁerwe employed. Corresponding to the order of Set t , the ﬁrst problem is to ﬁx thechannel diﬀerences, because the concatenation operation used in Equation 3.3 isnot viable when the size of the features changes. Thus, our classiﬁer makes use ofone convolutional layer to adjust the number of channel. Regardless of the outputfeature h t , we also aim to increase their channels to match the last feature h − witha small convolutional kernel size (3 × × × ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 7 activation function to ensure that the score is positive. We do not recommend usingthe

ReLU because they become inactive for essentially all inputs smaller than zero.In this state, no gradients ﬂow backward through the neuron, and so the neuron fallsinto a permanent inactive state, becoming a dead neuron, and will not be conduciveto score any category. In view of this, our design tends to use the

Softplus thatcan be viewed as a smooth version of

ReLU , which is monotonic and diﬀerentiable,having a positive ﬁrst order derivative in R . By stacking the above layers, we havecompleted the feature extraction within the proposed classiﬁers, and each classiﬁergenerates a score vector for each category.3.2. Score Normalization.

In addition to feature extraction, we must also nor-malize the score vector to collect the result of each classiﬁer. We design that theclassiﬁer should provide the conﬁdence rate (score) rather than making a classiﬁca-tion decision. The result is regarded as the scores for each category. Usually, thispart can be achieved directly through the

Softmax function, but the convergenceis slow in practice, so we introduce the method of using the L2 normalization toenhance the convergence of this part.We ﬁrst review the general form of the L1 and L2 normalization, for any f ( x i )and f ( x i ), L1 ( f ( x i )) = f ( x i ) (cid:80) k f ( x k ) , L2 ( f ( x i )) = f ( x i ) (cid:112)(cid:80) k f ( x k ) , where N denote the number of categories, and the respect partial derivatives are, ∂ L1 ( f ( x i )) ∂x j = 1 (cid:80) k f ( x k ) (cid:18) − f ( x i ) (cid:80) k f ( x k ) (cid:19) ∂f ( x j ) ∂x j ,∂ L2 ( f ( x i )) ∂x j = 1 (cid:112) ( (cid:80) k f ( x k )) (cid:18) − f ( x i ) f ( x j ) (cid:80) k f ( x k ) (cid:19) ∂f ( x j ) ∂x j . It is worth noting that the L1 ( f ( x )) becomes the Softmax normalization if f ( x ) = e x . The normalization method we propose here is L2 ( f ( x )) with f ( x ) = √ e x .We use S ( x ) and L ( x ) to denote Softmax and the proposed normalization method,respectively. According to above discussion, we formulate the ﬁnal form as, S ( x i ) = L1 ( e x i ) = e x i (cid:80) k e x k ,L ( x i ) = L2 (cid:16) √ e x i (cid:17) = (cid:115) e x i (cid:80) k e x k . The corresponding partial derivatives become,(3.4) ∂S ( x i ) ∂x j = 1 (cid:80) k e x k (cid:18) − e x i (cid:80) k e x k (cid:19) e x j , (3.5) ∂L ( x i ) ∂x j = 1 (cid:112)(cid:80) k e x k (cid:18) − √ e x i √ e x j (cid:80) k e x k (cid:19) √ e x j . KA-HOU CHAN*, SIO-KEI IM, AND WEI KE

Since our goal is to enhance the convergence, we must ﬁnd a condition that satisﬁesthe proposition about the partial derivatives, ∂L ( x i ) ∂x j ≥ ∂S ( x i ) ∂x j , i.e. ,1 (cid:112)(cid:80) k e x k (cid:18) − √ e x i √ e x j (cid:80) k e x k (cid:19) √ e x j ≥ (cid:80) k e x k (cid:18) − e x i (cid:80) k e x k (cid:19) e x j . We simplify this and obtain,(3.6) (cid:88) k e x k ≤ e x i . It can be found that Equation 3.6 leads to the necessary condition

N < e x i , whichcan be rewritten as, x i > ln N , when we assume all x i >

0, because there is a

Softplus function at the end of thefeature extraction. In summary, by using our normalization method with necessarycondition x , x , · · · , x k > ln N , the enhancement of convergence becomes moreand more obvious with the increase of accuracy. Based on the theory, the proposedclassiﬁers must have the lower bound ln N . In fact, the Softplus function alwayssatisﬁes with x > ≥ ln N for N ≥ Set t Conv ( h t . ch , h − . ch ) MaxPool ( h t . size ) BatchNoraml D o w n s a m p l e FlattenLinear ( h − . ch , N ) Softplus F C √ e x L2 S c o r e C l a ss i ﬁ e r t h t ~c t Figure 3.

The complete internal design of the proposed classi-ﬁer. h t is the t -th output feature produced from Set t , h − is thelast output feature, which is the same as in Figure 2, h. ch and h. size denote the number of channels and the size of feature h ,respectively, N denotes the number of categories.As shown in Figure 3, we detail the structure of all the stacked layers in the clas-siﬁer. There are three parts of the procedures for each classiﬁer. First, Downsample performs the feature extraction. Next, FC projects the extracted information into ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 9 each category as a reference. Finally,

Score normalizes these references into theoutput vector (cid:126)c t for the ﬁnal decision making with other classiﬁers.3.3. Network Architectures.

Following the last stage of the classiﬁcation neu-ral network, we must ﬁnally provide a probability vector to calculate the crossentropy loss through the

Softmax function. Considering the variable length offuture concatenate-designed network, the number of vectors (cid:126)c t produced by ourclassiﬁer that are used in other networks may also be diﬀerent. Therefore, we wantto maintain its scalability and use the sum of the vectors in Equation 3.3 as the ﬁnaloutput of the multiple classiﬁer strategy. The overall design of the proposed methodcan inherit the backbone architecture of any concatenate-designed neural network,making it easy to implement and apply to other tasks. This can be achieved simplyby adding a classiﬁer following each Set t to the existing classiﬁcation networks.Under a well optimized deep learning platform, each classiﬁer requires only a ﬁxedamount of computational cost and memory consumption, making the deploymentvery eﬃcient. Image Set Classiﬁer Set Classiﬁer ⊕ Set Classiﬁer ⊕ Set Classiﬁer ⊕ Set Classiﬁer ⊕ Softmax h h h h h h ~c ~c ~c ~c ~c Figure 4.

The complete structure of proposed classiﬁcationmethod.

Set t is contributed by the original network design (likeFigure 1). The proposed Classiﬁer t collects output feature h t , andthen produces a vector (cid:126)c t as a reference for ﬁnal decision, insteadof using a classiﬁer at the end.As listed in Table 1 and Table 2, we measure the model complexity by countingthe total number of training parameters within each neural network, and measurethe computational cost of each deeper model using the ﬂoating-point operationsper second (FLOPS). As found in the results, the required parameters of the neuralnetwork when using the proposed multiple classiﬁers are about 50.0% more than Table 1.

The architecture and complexity of our re-implementedconcatenate-designed neural networks. We present their requiredtraining parameters and computational cost using the FLOPS withinput size of one 3 × ×

32 image.

Stage

VGG16 ResNet18 DLA34 DenseNet121

In order to evaluate the proposed method on a variety of state-of-art classiﬁcationmodels, we applied our approach to the CIFAR-10 and CIFAR-100 datasets [19],which are labeled by 10 and 100 classes color images, respectively, with 50k fortraining and 10k for testing. All our experiments are conducted on an NVIDIAGeForce RTX 2080 Ti with 11.0GB of video memory. In order to compare theproposed strategy with the original networks, we re-implement the VGG16 [14],

ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 11

Table 2.

The architecture and complexity of our re-implementedconcatenate-designed neural networks with the proposed multipleclassiﬁer strategy. We present their required training parametersand computational cost using the FLOPs with input size of one3 × ×

32 image.

Stage

VGG16 ResNet18 DLA34 DenseNet121

Set (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × Set (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Proposed Classiﬁer (cid:20) × , × N (cid:21) × (cid:20) × , × N (cid:21) × (cid:20) × , × N (cid:21) × (cid:20) × , × N (cid:21) × . × . × . × . × FLOPS 0 . × . × . × . × ResNet18 [18], DLA34 [43] and DenseNet121 [40] with and without our multipleclassiﬁers in PyTorch [44], respectively, and used the advanced gradient-relatedoptimizer, the Adam [45] method with a learning rate of 0.001. All experimentsuse the same dataset in each test with a batch size of 100 per iteration set, andwith the same conﬁguration and the same number of neural nodes, as shown inTable 1 and Table 2. For a justiﬁed comparison, we also train the original and theproposed models with the same training procedure. We use random cropping andhorizontal ﬂipping with color normalization, followed by the random erasure dataaugmentation [15]. We target to process 300 epochs to compare the accuracy andconvergence of the various models. In addition, there is a scheduler for adjustingthe learning rate. It reduces the learning rate when the loss becomes stagnant. . . . . . . . . . . . . .

00 0 . . . . A cc u r a c y Original VGG16: CIFAR-10Porposed VGG16: CIFAR-10Original VGG16: CIFAR-100Porposed VGG16: CIFAR-100

Figure 5.

The accuracy of training data after 300 epoch in theoriginal and proposed VGG16.For more credibility, each model has been tested eight times. However, we onlyshow the results with the best accuracy in the ﬁgures from Figure 5 to Figure 8, withthe purpose to visualize the convergence during the training period. As expectedin subsection 3.2, we see that the convergence contributed by the proposed methodis signiﬁcantly improved. They converge faster with increasing accuracy. All theexperiments can be well trained within 150 epochs. Note that in addition to the(best) results shown in these ﬁgures, the convergences of all the experiments usingthe proposed method are better than those of the original when the accuracy isgreater than 0.5. On the other hand, Table 1 and Table 2 show that the proposedclassiﬁer requires more training parameters, and the FLOPS also increases, butthe increase of the FLOPS can almost overcome the time increase caused by theincreased parameters, so this is only a little more than the original time.

Table 3.

The average accuracy of CIFAR-10 test data with theerror range.

Method

VGG16 ResNet18 DLA34 DenseNet121

Original . ± . . ± . . ± . . ± . Proposed . ± . . ± . . ± . . ± . ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 13 . . . . . . . . . . . . .

00 0 . . . . A cc u r a c y Original ResNet18: CIFAR-10Porposed ResNet18: CIFAR-10Original ResNet18: CIFAR-100Porposed ResNet18: CIFAR-100

Figure 6.

The accuracy of training data after 300 epoch in theoriginal and proposed ResNet18.

Table 4.

The average accuracy of CIFAR-100 test data with theerror range.

Method

VGG16 ResNet18 DLA34 DenseNet121

Original . ± . . ± . . ± . . ± . Proposed . ± . . ± . . ± . . ± . . . . . . . . . . . . . .

00 0 . . . . A cc u r a c y Original DLA34: CIFAR-10Porposed DLA34: CIFAR-10Original DLA34: CIFAR-100Porposed DLA34: CIFAR-100

Figure 7.

The accuracy of training data after 300 epoch in theoriginal and proposed DenseNet181.the detail. Please ﬁnd the complete source code and experimental results in thesupplemental ﬁles. 5.

Conclusion

We present a multiple classiﬁer method that can improve the performance froma new perspective, and as the number of modules multiplying their connectivity getgreater, the method is more eﬀective. By adjusting the concatenated architecturesused for classiﬁcation tasks, we identify the need for multiple classiﬁers to partic-ipate, and make the ﬁnal decision according to their results. We further discoverthe condition to enhance the convergence, and embed it into the proposed classiﬁer.Compared with the original model, our method is more accurate, and can makeuse of parameters and computations more eﬃciently. Experiments show that thedominant architectures can all be improved by using the multiple classiﬁers. Thegap of accuracy improvement is obvious.

References [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation with deep convolutionalneural networks, in: NIPS, 2012, pp. 1106–1114.[2] S. Lawrence, C. L. Giles, A. C. Tsoi, A. D. Back, Face recognition: a convolutional neural-network approach, IEEE Trans. Neural Networks 8 (1) (1997) 98–113.

ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 15 . . . . . . . . . . . . .

00 0 . . . . A cc u r a c y Original DenseNet121: CIFAR-10Porposed DenseNet121: CIFAR-10Original DenseNet121: CIFAR-100Porposed DenseNet121: CIFAR-100

Figure 8.

The accuracy of training data after 300 epoch in theoriginal and proposed DenseNet181. [3] Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: Face recognition with very deep neural net-works, CoRR abs/1502.00873 (2015).[4] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, R. Chellappa, An all-in-one convolutionalneural network for face analysis, in: FG, IEEE Computer Society, 2017, pp. 17–24.[5] D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis, Annual Review ofBiomedical Engineering 19 (1) (2017) 221–248.[6] J. Jiang, P. R. Trundle, J. Ren, Medical image analysis with artiﬁcial neural networks, Com-put. Medical Imaging Graph. 34 (8) (2010) 617–631.[7] W. Nawaz, S. Ahmed, A. Tahir, H. A. Khan, Classiﬁcation of breast cancer histology imagesusing ALEXNET, in: ICIAR, Vol. 10882 of Lecture Notes in Computer Science, Springer,2018, pp. 869–876.[8] Y. Tian, K. Pei, S. Jana, B. Ray, Deeptest: automated testing of deep-neural-network-drivenautonomous cars, in: ICSE, ACM, 2018, pp. 303–314.[9] D. K. Kim, T. Chen, Deep neural network for real-time autonomous indoor navigation, CoRRabs/1511.04668 (2015).[10] S. Gong, M. Cristani, C. C. Loy, T. M. Hospedales, The re-identiﬁcation challenge, in: PersonRe-Identiﬁcation, Advances in Computer Vision and Pattern Recognition, Springer, 2014, pp.1–20.[11] L. Xiao, Q. Yan, S. Deng, Scene classiﬁcation with improved alexnet model, in: ISKE, IEEE,2017, pp. 1–6.[12] L. Wang, C. Lee, Z. Tu, S. Lazebnik, Training deeper convolutional networks with deepsupervision, CoRR abs/1505.02496 (2015).[13] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requiresrethinking generalization, in: ICLR, OpenReview.net, 2017. [14] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recogni-tion, in: ICLR, 2015.[15] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: AAAI,AAAI Press, 2020, pp. 13001–13008.[16] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, J.Big Data 6 (2019) 60.[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,A. Rabinovich, Going deeper with convolutions, in: CVPR, IEEE Computer Society, 2015,pp. 1–9.[18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR,IEEE Computer Society, 2016, pp. 770–778.[19] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009).[20] J. S. Bridle, Probabilistic interpretation of feedforward classiﬁcation network outputs, withrelationships to statistical pattern recognition, in: NATO Neurocomputing, Vol. 68 of NATOASI Series, Springer, 1989, pp. 227–236.[21] T. G. Dietterich, G. Bakiri, Solving multiclass learning problems via error-correcting outputcodes, J. Artif. Intell. Res. 2 (1995) 263–286.[22] J.-D. Zhou, X.-D. Wang, H.-J. Zhou, Y.-H. Cui, S. Jing, Coding design for error correctingoutput codes based on perceptron, Optical Engineering 51 (5) (2012) 1–7.[23] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber, Convolutional neural networkcommittees for handwritten character classiﬁcation, in: ICDAR, IEEE Computer Society,2011, pp. 1135–1139.[24] V. Vapnik, An overview of statistical learning theory, IEEE Trans. Neural Networks 10 (5)(1999) 988–999.[25] T. Joachims, Making large-scale svm learning practical, Technical report, Dortmund (1998).[26] H. Byun, S. Lee, A survey on pattern recognition applications of support vector machines,Int. J. Pattern Recognit. Artif. Intell. 17 (3) (2003) 459–486.[27] L. Naranjo, C. J. Perez, J. Mart´ın, Y. Campos-Roca, A two-stage variable selection andclassiﬁcation approach for parkinson’s disease detection by using voice recording replications,Comput. Methods Programs Biomed. 142 (2017) 147–156.[28] M. Elleuch, R. Maalej, M. Kherallah, A new design based-svm of the CNN classiﬁer archi-tecture with dropout for oﬄine arabic handwritten recognition, in: ICCS, Vol. 80 of ProcediaComputer Science, Elsevier, 2016, pp. 1712–1723.[29] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal,D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, K. W. Wilson, CNN architecturesfor large-scale audio classiﬁcation, in: ICASSP, IEEE, 2017, pp. 131–135.[30] J. E. Ball, D. T. Anderson, C. S. Chan, A comprehensive survey of deep learning in remotesensing: Theories, tools and challenges for the community, CoRR abs/1709.00308 (2017).[31] A. Mandelbaum, D. Weinshall, Distance-based conﬁdence score for neural network classiﬁers,CoRR abs/1709.09844 (2017).[32] N. Ueda, Optimal linear combination of neural networks for improving classiﬁcation perfor-mance, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2) (2000) 207–215.[33] K. Lai, D. Liu, S. Chang, M. Chen, Learning sample speciﬁc weights for late fusion, IEEETrans. Image Process. 24 (9) (2015) 2772–2783.[34] H. Cao, S. Bernard, L. Heutte, R. Sabourin, Dynamic voting in multi-view learning for ra-diomics applications, in: S+SSPR, Vol. 11004 of Lecture Notes in Computer Science, Springer,2018, pp. 32–41.[35] J. D´ıez-Pastor, J. J. Rodr´ıguez, C. I. Garc´ıa-Osorio, L. I. Kuncheva, Diversity techniquesimprove the performance of the best imbalance learning ensembles, Inf. Sci. 325 (2015) 98–117.[36] Z. Wu, C. Shen, A. van den Hengel, Wider or deeper: Revisiting the resnet model for visualrecognition, Pattern Recognit. 90 (2019) 119–133.[37] M. Hammann, M. Kraus, S. Shafaei, A. C. Knoll, Identity recognition in intelligent cars withbehavioral data and lstm-resnet classiﬁer, CoRR abs/2003.00770 (2020).[38] Y. Chen, X. Jin, B. Kang, J. Feng, S. Yan, Sharing residual units through collective tensorfactorization in deep neural networks, CoRR abs/1703.02180 (2017).[39] S. Xie, R. B. Girshick, P. Doll´ar, Z. Tu, K. He, Aggregated residual transformations for deepneural networks, in: CVPR, IEEE Computer Society, 2017, pp. 5987–5995.

ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 17 [40] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutionalnetworks, in: CVPR, IEEE Computer Society, 2017, pp. 2261–2269.[41] Y. LeCun, L. Bottou, Y. Bengio, P. Haﬀner, Gradient-based learning applied to documentrecognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.[42] S. Ioﬀe, C. Szegedy, Batch normalization: Accelerating deep network training by reducinginternal covariate shift, in: ICML, Vol. 37 of JMLR Workshop and Conference Proceedings,JMLR.org, 2015, pp. 448–456.[43] F. Yu, D. Wang, E. Shelhamer, T. Darrell, Deep layer aggregation, in: CVPR, IEEE ComputerSociety, 2018, pp. 2403–2412.[44] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, A. Lerer, Automatic diﬀerentiation in pytorch (2017).[45] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: ICLR (Poster), 2015.

School of Applied Sciences, Macao Polytechnic Institute, Macao, China

Email address : [email protected] Macao Polytechnic Institute, Macao, ChinaSchool of Applied Sciences, Macao Polytechnic Institute, Macao, China

Email address ::