A Multiple Classifier Approach for Concatenate-Designed Neural Networks
MMULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNEDNEURAL NETWORK
KA-HOU CHAN*, SIO-KEI IM, AND WEI KE
Abstract.
This article introduces a multiple classifier method to improvethe performance of concatenate-designed neural networks, such as ResNet andDenseNet, with the purpose to alleviate the pressure on the final classifier. Wegive the design of the classifiers, which collects the features produced betweenthe network sets, and present the constituent layers and the activation functionfor the classifiers, to calculate the classification score of each classifier. Weuse the L2 normalization method to obtain the classifier score instead of the Softmax normalization. We also determine the conditions that can enhanceconvergence. As a result, the proposed classifiers are able to improve theaccuracy in the experimental cases significantly, and show that the methodnot only has better performance than the original models, but also producesfaster convergence. Moreover, our classifiers are general and can be applied toall classification related concatenate-designed network models. Introduction
Image classification is one of main topics of neural networks, starting from thesuccess of AlexNet [1]. The availability of object classification can lay the foun-dation for advanced neural systems and is of great significance in the research ofperceiving media data, such as face recognition [2–4], medicine image analysis [5–7]and autonomous vehicles [8, 9], etc.
As we know that image classification by us-ing neural networks has good performance in over 90.0% of cases, but there is thechallenge of how to overcome the left cases with significant variations of poor ofillumination, image blurring and occlusions [10]. Therefore, the current methodsare hardly completely applicable to those cases where errors are strictly intolerable,for instance the autonomous vehicles often miss the traffic light when driving fastin the midnight.In order to increase the accuracy, studies on neural network architectures mostlyfocus on several aspects. The easiest way is to improve the accuracy by stackingnumerous layers, but the increasing rate of this method presents a logarithmic curvethat the later impact is very small. Therefore, even if we increase the number ofAlexNet’s layers, the result is only slightly better than nothing [11, 12]. Anotheraspect considers images as the perceptual data that a learned model may describerandom error or noise with redundancy, instead of the strict underlying data dis-tribution [13]. There are many data augmentation and regularization approachesproposed in the preprocessing, such as random cropping [1], flipping [14] and ran-dom erasing [15]. Data augmentation is closely related to oversampling in data
Date : January 15, 2021.
Key words and phrases.
Multiple Classifier and Convergence Enhancement and Concatenate-Designed Neural Network and Softplus and L2 normalization .*Corresponding Author. a r X i v : . [ c s . N E ] J a n KA-HOU CHAN*, SIO-KEI IM, AND WEI KE analysis that can reduce overfitting when training machine learning models, by ap-plying data augmentation, one can achieve predictable accuracy improvement [16],but it cannot overcome the shortcomings of the training model itself. The mostchallenging aspect is to develop a new network model by combining layers strate-gically. In order to obtain better performance, the state-of-the-art deeper networkmodels, such as GoogleNet [17], ResNet [18] and VGGNet [14], etc. , are designedwith a large network architecture by stacking convolutional layers, and these mod-els have performed well for image classification. With the in-depth study, it canbe found that they all concatenate sets of similar convolutional layers together,and connect one classifier at the end of the model output, see Figure 1. Theyare realized by feedforward neural networks, some features of similar images arealways diluted during several convolutions. For example, to determine the number‘0’ and the letter ‘O’, their main features are similar that can be identified with theentire shape. Considering the Convolutional Neural Networks (CNNs), the convo-luted results will only retain the local features and discard the global informationwhen pooling is performed with decreasing resolutions, so the decision will be moredifficult after the convolution and pooling. However, no matter how good the fit-ting of the network design, the final classifier will be under great pressure to makedecisions.Below lists the main contribution of this paper. • In order to alleviate the pressure on the final classifier, we introduce multipleclassifiers in this process, and then make decisions based on their results. • We present the constituent layers and activation function of the proposedclassifiers, the purpose is to calculate the classification scores of each clas-sifier. • We introduce the L2 normalization method to obtain the classifier scoresinstead of using the Softmax normalization, and determine the conditionsunder which the convergence can be improved. • The proposed method is applicable to the state-of-the-art network modelsand achieves comparable results.In particular, our approach is compatible with all discovered classification neuralnetworks, and can further be extended to deeper concatenate-designed models. Weconducted a comprehensive experiment on the CIFAR dataset [19] to show theaccuracy of VGGNet with and without our approach. We show that the accuracyis obviously better with faster convergence. Similar phenomena are also shown onthe ResNet, which indicates that the optimization is effective, and our approach isnot just suitable to a particular network model.The rest of this paper is organized as follows. section 2 goes through the relatedwork. section 3 present the design details and justification of the multiple classifierperformance. section 4 shows the experimental results. Finally, section 5 concludesthe paper. 2.
Related Work
The multiple classifier approach has attracted a lot of research attention and hasbeen widely used in many perceived media such as image classification and speechrecognition, etc.
Researchers have studied multiple classifiers from different aspects,including the classifier design and the concurrent rules. Our work can be related tothese three aspects, classifier design, type of classifier outputs and architectures.
ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 3
VGG-16Image
Conv (3 , Conv (64 , Conv (64 , Conv (128 , Conv (128 , Conv (256 , Conv (256 , Conv (256 , Conv (512 , Conv (512 , Conv (512 , Conv (512 , Conv (512 , Linear (4096)
Linear (4096)
Linear (1000)
Softmax
ResNet-18Image
Conv (3 , ResBlock (64 , ResBlock (64 , ResBlock (64 , ResBlock (128 , ResBlock (128 , ResBlock (256 , ResBlock (256 , ResBlock (512 , Linear (1000)
Softmax S e t S e t S e t S e t S e t C l a ss i fi e r Figure 1.
The brief architecture of VGG-16 and ResNet-18, bothof the series has concatenated with five sets of stacked layers about
Conv and
ResBlock , respectively, and one classifier is connected atthe end of the model output.For the traditional design in neural networks, by combining linear layers andactivation functions, a classifier can have multiple output units and categorizes asample according to the class whose corresponding output gives the highest valueamong the multiple outputs [20]. Also, [21] proposed an error-correcting outputcode method to provide redundancy. Later, [22] proposed an idea of using an
KA-HOU CHAN*, SIO-KEI IM, AND WEI KE additional single-layer perceptron neural network to enhance the error-correctingcapabilities. In particular, a CNN always includes a number of convolutional andpooling layers which are optionally accompanied by fully connected linear design.[23] proved the robustness of their classifier by constructing seven CNNs. It allowsto consider the average error rate obtained as the best results. Moreover, Supportvector machines (SVM), considered as one of the strongest and robustest algorithmin machine learning, was created by [24], and made use of [25]. It has become a well-known approach exploited in many domains [26, 27], such as pattern recognition,classification and image processing, and all of them obtained the best performance.Later, [28] modified the CNN structure by replacing the output layer of the fullyconnected with an SVM classifier. In recent years, [29] involved multiple classi-fiers into their proposed model, then averaged all the sets of the classifier outputs.In addition, in order to better realize the function of the classifier in CNN, twostate-of-art classifiers Random Forest (RF) and Gradient Boosting (GB) have alsobeen applied to deep learning [30]. Meanwhile, for the types of classifier outputs,researchers considered how to calculate the confidence of the classifier for each im-age category, where each sample was represented by multiple images [31]. Eachclassifier must be trained to increase the confidence of the corresponding category,thus the category can be determined using the confidence of each classifier [32].Further, the results of the multiple classifiers can also be used to determine inlate fusion in multi-modal [33] or multi-voting [34] classification. Specifically, theabove method has a set of individual classifiers, each classifier makes a decisionon the input set individually, then the method combines their decisions to form acomposite result [35]. In order to invoke the classifier in a concatenate-designedmodel, [36] proposed convolutions as classifiers, instead of linear classifiers at theend of ResNet. Further, [37] decided to combine the “long-term dependencies”and the ResNet networks in one classifier to show that the accuracy was improvedsignificantly while maintaining a suitable interference time.Meanwhile, the deeper convolutional architecture was the most important work,demonstrating the powerful functions of concatenate-designed neural networks,which showed that building a deeper network with a tiny convolution kernel waseffective in increasing the performance of the CNN-based network models. AfterVGGNet [14], ResNet was first proposed by [18]. It greatly alleviated the optimiza-tion difficulty and increased it to some hundreds of layers by using skipping connec-tions within their convolutional set. Since then, different kinds of inner structureshave been proposed, concentrating on various tasks and consistently achieved thebetter performance in different areas [38,39]. Further, [40] introduced the DenseNetwhich passed the input features to the output through a densely connected pathto concatenate the input features with the convoluted output as the
DenseBlock result. Take account to these networks design, they aim to retain more original in-formation to the classifier, so they also tend to connect the input features directlyto the output. However, the width of their connection path increases linearly asthe depth rises, causing the number of training parameters to increase seriously.This limits the building of deeper and wider networks that might further improvethe accuracy.In this work, inspired by the connection from the original information to theclassifier, we aim to adopt the proposed classifier following each layer set. Theseclassifiers make interim decisions instead of one decision of the last classifier. Based
ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 5 on these decisions, we then propose a novel combination method and add it to thestate-of-the-art concatenate-designed network that can achieve higher accuracy.3.
Multiple Classifier Strategy
Looking into concatenate-designed networks, there are several sets connected inorder, as shown in Figure 1, composing multiple convolutional layers and varioustypes of blocks. We use h t to denote the output feature of Set t at the t -th stepwith attributes [ batch , channel , width , height ], and h is the original image. For eachstep, Set t refers to the feature extraction function that takes the previous featureas input and output the extracted information,(3.1) h t = Set t ( h t − ) . Then, the classification function
Classifier transforms the last output feature h − to a specific dimension vector (cid:126)c ,(3.2) (cid:126)c = Classifier ( h − ) . We further transform (cid:126)c into a probability vector through the
Softmax function.In Equation 3.1 and Equation 3.2, we encapsulate the network rule of variousconcatenate-designed architectures in a generalized way. This observation showsthat the connection path is essentially an extensible higher order function thatextracts information from the previous states. However, the feature size is reducedand the number of channels is increased when a feature passes through the sets.Although it can capture the major part of the classification target, the overallstructure of the image is somewhat lost. In addition, some useful information in thelater sets may also be discarded during the extraction from the earlier sets. Theseproblems become more obvious as the number of connections increases. Therefore,the concatenation usually connects up to five sets, and the final classifier will beunder great pressure to make decisions.In order to address this issue, let’s revisit the network and divide it by thepooling layers. In this view, the architecture can be considered as a simple CNNnetwork if t = 1, similar to LeNet-5 [41] if t = 2, and similar to AlexNet [1] if t = 3, etc. Particularly, the most advanced CNN network also satisfies with the t = 5 case [14, 18, 40]. Thus, with the advancement of hardware performance, CNNmodels in different eras can be summarized as this series of neural networks withdifferent t values. From the above analysis, we observe that the number of sets canincrease indefinitely. In practical applications, we only consider the part before theclassifier, i.e. , ∃ t • Classifier t ( h t ) ≡ (cid:126)c . Based on this idea, we propose to employthe number of t classifiers as(3.3) ∀ t • (cid:88) Classifier t ( h t ) = (cid:88) (cid:126)c t ≡ (cid:126)c. Meanwhile, such a decision contributing strategy makes it possible to compromisethe global structure and the local pattern of an image. Comparatively, the for-mer one gets more original information and the latter one obtains the extractedinformation. All classifiers make decisions based on their recent acquired featuresindependently. This strategy can provide more references for decision making, es-pecially in controversial cases such as ‘0’ and ‘O’, ‘1’ and ‘l’. Multiple classifierscan alleviate the perplexity, leading to high redundancy.
KA-HOU CHAN*, SIO-KEI IM, AND WEI KE
Set − PoolingLinear · · ·
Linear F C Softmax C l a ss i fi e r h − ~c Figure 2.
A general procedure of classifier: often consists of one
Pooling layer and multiple
Linear layers. These
Linear layers areoften collectively referred to as Fully-Connected (FC) layers.3.1.
Classifier Design.
Before we present our classifier design, we first analyzethose classifiers used in the current neural networks. It can be found that theprocedure of their stacked layers is the same (see Figure 2). The classifier alwaysreceives the features produced by the last set, and must consider following dynamicattributes: width × height ,: the size of the image varies with different applications, butthe Fully-Connected (FC) layer only accepts a fixed size. Classifiers haveto pool a fix size in order to facilitate feature extraction from all sizes of h − . channel ,: different from the image size, this attribute of the original image h always equals to 3, corresponding to the RGB channels, and then increasesto 64, 128, 256 and 512 (up to 1024 in DenseNet) through the variouspredefined set.Therefore, the number of input features of an FC layer is channel × width × height ,and a vector (cid:126)c is produced as the result. According to this design, when appliedto our classifier that satisfies Equation 3.3, there are multiple classifiers to collectevery output feature from h to h − . The number of classifiers is the same as thenumber of concatenated sets, and we use Classifier t to denote the t -th classifierwe employed. Corresponding to the order of Set t , the first problem is to fix thechannel differences, because the concatenation operation used in Equation 3.3 isnot viable when the size of the features changes. Thus, our classifier makes use ofone convolutional layer to adjust the number of channel. Regardless of the outputfeature h t , we also aim to increase their channels to match the last feature h − witha small convolutional kernel size (3 × × × ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 7 activation function to ensure that the score is positive. We do not recommend usingthe
ReLU because they become inactive for essentially all inputs smaller than zero.In this state, no gradients flow backward through the neuron, and so the neuron fallsinto a permanent inactive state, becoming a dead neuron, and will not be conduciveto score any category. In view of this, our design tends to use the
Softplus thatcan be viewed as a smooth version of
ReLU , which is monotonic and differentiable,having a positive first order derivative in R . By stacking the above layers, we havecompleted the feature extraction within the proposed classifiers, and each classifiergenerates a score vector for each category.3.2. Score Normalization.
In addition to feature extraction, we must also nor-malize the score vector to collect the result of each classifier. We design that theclassifier should provide the confidence rate (score) rather than making a classifica-tion decision. The result is regarded as the scores for each category. Usually, thispart can be achieved directly through the
Softmax function, but the convergenceis slow in practice, so we introduce the method of using the L2 normalization toenhance the convergence of this part.We first review the general form of the L1 and L2 normalization, for any f ( x i )and f ( x i ), L1 ( f ( x i )) = f ( x i ) (cid:80) k f ( x k ) , L2 ( f ( x i )) = f ( x i ) (cid:112)(cid:80) k f ( x k ) , where N denote the number of categories, and the respect partial derivatives are, ∂ L1 ( f ( x i )) ∂x j = 1 (cid:80) k f ( x k ) (cid:18) − f ( x i ) (cid:80) k f ( x k ) (cid:19) ∂f ( x j ) ∂x j ,∂ L2 ( f ( x i )) ∂x j = 1 (cid:112) ( (cid:80) k f ( x k )) (cid:18) − f ( x i ) f ( x j ) (cid:80) k f ( x k ) (cid:19) ∂f ( x j ) ∂x j . It is worth noting that the L1 ( f ( x )) becomes the Softmax normalization if f ( x ) = e x . The normalization method we propose here is L2 ( f ( x )) with f ( x ) = √ e x .We use S ( x ) and L ( x ) to denote Softmax and the proposed normalization method,respectively. According to above discussion, we formulate the final form as, S ( x i ) = L1 ( e x i ) = e x i (cid:80) k e x k ,L ( x i ) = L2 (cid:16) √ e x i (cid:17) = (cid:115) e x i (cid:80) k e x k . The corresponding partial derivatives become,(3.4) ∂S ( x i ) ∂x j = 1 (cid:80) k e x k (cid:18) − e x i (cid:80) k e x k (cid:19) e x j , (3.5) ∂L ( x i ) ∂x j = 1 (cid:112)(cid:80) k e x k (cid:18) − √ e x i √ e x j (cid:80) k e x k (cid:19) √ e x j . KA-HOU CHAN*, SIO-KEI IM, AND WEI KE
Since our goal is to enhance the convergence, we must find a condition that satisfiesthe proposition about the partial derivatives, ∂L ( x i ) ∂x j ≥ ∂S ( x i ) ∂x j , i.e. ,1 (cid:112)(cid:80) k e x k (cid:18) − √ e x i √ e x j (cid:80) k e x k (cid:19) √ e x j ≥ (cid:80) k e x k (cid:18) − e x i (cid:80) k e x k (cid:19) e x j . We simplify this and obtain,(3.6) (cid:88) k e x k ≤ e x i . It can be found that Equation 3.6 leads to the necessary condition
N < e x i , whichcan be rewritten as, x i > ln N , when we assume all x i >
0, because there is a
Softplus function at the end of thefeature extraction. In summary, by using our normalization method with necessarycondition x , x , · · · , x k > ln N , the enhancement of convergence becomes moreand more obvious with the increase of accuracy. Based on the theory, the proposedclassifiers must have the lower bound ln N . In fact, the Softplus function alwayssatisfies with x > ≥ ln N for N ≥ Set t Conv ( h t . ch , h − . ch ) MaxPool ( h t . size ) BatchNoraml D o w n s a m p l e FlattenLinear ( h − . ch , N ) Softplus F C √ e x L2 S c o r e C l a ss i fi e r t h t ~c t Figure 3.
The complete internal design of the proposed classi-fier. h t is the t -th output feature produced from Set t , h − is thelast output feature, which is the same as in Figure 2, h. ch and h. size denote the number of channels and the size of feature h ,respectively, N denotes the number of categories.As shown in Figure 3, we detail the structure of all the stacked layers in the clas-sifier. There are three parts of the procedures for each classifier. First, Downsample performs the feature extraction. Next, FC projects the extracted information into ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 9 each category as a reference. Finally,
Score normalizes these references into theoutput vector (cid:126)c t for the final decision making with other classifiers.3.3. Network Architectures.
Following the last stage of the classification neu-ral network, we must finally provide a probability vector to calculate the crossentropy loss through the
Softmax function. Considering the variable length offuture concatenate-designed network, the number of vectors (cid:126)c t produced by ourclassifier that are used in other networks may also be different. Therefore, we wantto maintain its scalability and use the sum of the vectors in Equation 3.3 as the finaloutput of the multiple classifier strategy. The overall design of the proposed methodcan inherit the backbone architecture of any concatenate-designed neural network,making it easy to implement and apply to other tasks. This can be achieved simplyby adding a classifier following each Set t to the existing classification networks.Under a well optimized deep learning platform, each classifier requires only a fixedamount of computational cost and memory consumption, making the deploymentvery efficient. Image Set Classifier Set Classifier ⊕ Set Classifier ⊕ Set Classifier ⊕ Set Classifier ⊕ Softmax h h h h h h ~c ~c ~c ~c ~c Figure 4.
The complete structure of proposed classificationmethod.
Set t is contributed by the original network design (likeFigure 1). The proposed Classifier t collects output feature h t , andthen produces a vector (cid:126)c t as a reference for final decision, insteadof using a classifier at the end.As listed in Table 1 and Table 2, we measure the model complexity by countingthe total number of training parameters within each neural network, and measurethe computational cost of each deeper model using the floating-point operationsper second (FLOPS). As found in the results, the required parameters of the neuralnetwork when using the proposed multiple classifiers are about 50.0% more than Table 1.
The architecture and complexity of our re-implementedconcatenate-designed neural networks. We present their requiredtraining parameters and computational cost using the FLOPS withinput size of one 3 × ×
32 image.
Stage
VGG16 ResNet18 DLA34 DenseNet121
Set (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × Set (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Classifier × × × N (cid:2) × N (cid:3) (cid:2) × N (cid:3) (cid:2) × N (cid:3) Training parameters 15 . × . × . × . × FLOPS 0 . × . × . × . × that of the original network, and the FLOPS reaches up to 200.0% in a trainingepoch. 4. Experiments Result and Discussion
In order to evaluate the proposed method on a variety of state-of-art classificationmodels, we applied our approach to the CIFAR-10 and CIFAR-100 datasets [19],which are labeled by 10 and 100 classes color images, respectively, with 50k fortraining and 10k for testing. All our experiments are conducted on an NVIDIAGeForce RTX 2080 Ti with 11.0GB of video memory. In order to compare theproposed strategy with the original networks, we re-implement the VGG16 [14],
ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 11
Table 2.
The architecture and complexity of our re-implementedconcatenate-designed neural networks with the proposed multipleclassifier strategy. We present their required training parametersand computational cost using the FLOPs with input size of one3 × ×
32 image.
Stage
VGG16 ResNet18 DLA34 DenseNet121
Set (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Set (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × × (cid:2) × , (cid:3) × (cid:20) × , × , (cid:21) × Set (cid:20) × , × , (cid:21) × (cid:2) × , (cid:3) × Proposed Classifier (cid:20) × , × N (cid:21) × (cid:20) × , × N (cid:21) × (cid:20) × , × N (cid:21) × (cid:20) × , × N (cid:21) × . × . × . × . × FLOPS 0 . × . × . × . × ResNet18 [18], DLA34 [43] and DenseNet121 [40] with and without our multipleclassifiers in PyTorch [44], respectively, and used the advanced gradient-relatedoptimizer, the Adam [45] method with a learning rate of 0.001. All experimentsuse the same dataset in each test with a batch size of 100 per iteration set, andwith the same configuration and the same number of neural nodes, as shown inTable 1 and Table 2. For a justified comparison, we also train the original and theproposed models with the same training procedure. We use random cropping andhorizontal flipping with color normalization, followed by the random erasure dataaugmentation [15]. We target to process 300 epochs to compare the accuracy andconvergence of the various models. In addition, there is a scheduler for adjustingthe learning rate. It reduces the learning rate when the loss becomes stagnant. . . . . . . . . . . . . .
00 0 . . . . A cc u r a c y Original VGG16: CIFAR-10Porposed VGG16: CIFAR-10Original VGG16: CIFAR-100Porposed VGG16: CIFAR-100
Figure 5.
The accuracy of training data after 300 epoch in theoriginal and proposed VGG16.For more credibility, each model has been tested eight times. However, we onlyshow the results with the best accuracy in the figures from Figure 5 to Figure 8, withthe purpose to visualize the convergence during the training period. As expectedin subsection 3.2, we see that the convergence contributed by the proposed methodis significantly improved. They converge faster with increasing accuracy. All theexperiments can be well trained within 150 epochs. Note that in addition to the(best) results shown in these figures, the convergences of all the experiments usingthe proposed method are better than those of the original when the accuracy isgreater than 0.5. On the other hand, Table 1 and Table 2 show that the proposedclassifier requires more training parameters, and the FLOPS also increases, butthe increase of the FLOPS can almost overcome the time increase caused by theincreased parameters, so this is only a little more than the original time.
Table 3.
The average accuracy of CIFAR-10 test data with theerror range.
Method
VGG16 ResNet18 DLA34 DenseNet121
Original . ± . . ± . . ± . . ± . Proposed . ± . . ± . . ± . . ± . ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 13 . . . . . . . . . . . . .
00 0 . . . . A cc u r a c y Original ResNet18: CIFAR-10Porposed ResNet18: CIFAR-10Original ResNet18: CIFAR-100Porposed ResNet18: CIFAR-100
Figure 6.
The accuracy of training data after 300 epoch in theoriginal and proposed ResNet18.
Table 4.
The average accuracy of CIFAR-100 test data with theerror range.
Method
VGG16 ResNet18 DLA34 DenseNet121
Original . ± . . ± . . ± . . ± . Proposed . ± . . ± . . ± . . ± . . . . . . . . . . . . . .
00 0 . . . . A cc u r a c y Original DLA34: CIFAR-10Porposed DLA34: CIFAR-10Original DLA34: CIFAR-100Porposed DLA34: CIFAR-100
Figure 7.
The accuracy of training data after 300 epoch in theoriginal and proposed DenseNet181.the detail. Please find the complete source code and experimental results in thesupplemental files. 5.
Conclusion
We present a multiple classifier method that can improve the performance froma new perspective, and as the number of modules multiplying their connectivity getgreater, the method is more effective. By adjusting the concatenated architecturesused for classification tasks, we identify the need for multiple classifiers to partic-ipate, and make the final decision according to their results. We further discoverthe condition to enhance the convergence, and embed it into the proposed classifier.Compared with the original model, our method is more accurate, and can makeuse of parameters and computations more efficiently. Experiments show that thedominant architectures can all be improved by using the multiple classifiers. Thegap of accuracy improvement is obvious.
References [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutionalneural networks, in: NIPS, 2012, pp. 1106–1114.[2] S. Lawrence, C. L. Giles, A. C. Tsoi, A. D. Back, Face recognition: a convolutional neural-network approach, IEEE Trans. Neural Networks 8 (1) (1997) 98–113.
ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 15 . . . . . . . . . . . . .
00 0 . . . . A cc u r a c y Original DenseNet121: CIFAR-10Porposed DenseNet121: CIFAR-10Original DenseNet121: CIFAR-100Porposed DenseNet121: CIFAR-100
Figure 8.
The accuracy of training data after 300 epoch in theoriginal and proposed DenseNet181. [3] Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: Face recognition with very deep neural net-works, CoRR abs/1502.00873 (2015).[4] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, R. Chellappa, An all-in-one convolutionalneural network for face analysis, in: FG, IEEE Computer Society, 2017, pp. 17–24.[5] D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis, Annual Review ofBiomedical Engineering 19 (1) (2017) 221–248.[6] J. Jiang, P. R. Trundle, J. Ren, Medical image analysis with artificial neural networks, Com-put. Medical Imaging Graph. 34 (8) (2010) 617–631.[7] W. Nawaz, S. Ahmed, A. Tahir, H. A. Khan, Classification of breast cancer histology imagesusing ALEXNET, in: ICIAR, Vol. 10882 of Lecture Notes in Computer Science, Springer,2018, pp. 869–876.[8] Y. Tian, K. Pei, S. Jana, B. Ray, Deeptest: automated testing of deep-neural-network-drivenautonomous cars, in: ICSE, ACM, 2018, pp. 303–314.[9] D. K. Kim, T. Chen, Deep neural network for real-time autonomous indoor navigation, CoRRabs/1511.04668 (2015).[10] S. Gong, M. Cristani, C. C. Loy, T. M. Hospedales, The re-identification challenge, in: PersonRe-Identification, Advances in Computer Vision and Pattern Recognition, Springer, 2014, pp.1–20.[11] L. Xiao, Q. Yan, S. Deng, Scene classification with improved alexnet model, in: ISKE, IEEE,2017, pp. 1–6.[12] L. Wang, C. Lee, Z. Tu, S. Lazebnik, Training deeper convolutional networks with deepsupervision, CoRR abs/1505.02496 (2015).[13] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requiresrethinking generalization, in: ICLR, OpenReview.net, 2017. [14] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recogni-tion, in: ICLR, 2015.[15] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: AAAI,AAAI Press, 2020, pp. 13001–13008.[16] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, J.Big Data 6 (2019) 60.[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,A. Rabinovich, Going deeper with convolutions, in: CVPR, IEEE Computer Society, 2015,pp. 1–9.[18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR,IEEE Computer Society, 2016, pp. 770–778.[19] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images (2009).[20] J. S. Bridle, Probabilistic interpretation of feedforward classification network outputs, withrelationships to statistical pattern recognition, in: NATO Neurocomputing, Vol. 68 of NATOASI Series, Springer, 1989, pp. 227–236.[21] T. G. Dietterich, G. Bakiri, Solving multiclass learning problems via error-correcting outputcodes, J. Artif. Intell. Res. 2 (1995) 263–286.[22] J.-D. Zhou, X.-D. Wang, H.-J. Zhou, Y.-H. Cui, S. Jing, Coding design for error correctingoutput codes based on perceptron, Optical Engineering 51 (5) (2012) 1–7.[23] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber, Convolutional neural networkcommittees for handwritten character classification, in: ICDAR, IEEE Computer Society,2011, pp. 1135–1139.[24] V. Vapnik, An overview of statistical learning theory, IEEE Trans. Neural Networks 10 (5)(1999) 988–999.[25] T. Joachims, Making large-scale svm learning practical, Technical report, Dortmund (1998).[26] H. Byun, S. Lee, A survey on pattern recognition applications of support vector machines,Int. J. Pattern Recognit. Artif. Intell. 17 (3) (2003) 459–486.[27] L. Naranjo, C. J. Perez, J. Mart´ın, Y. Campos-Roca, A two-stage variable selection andclassification approach for parkinson’s disease detection by using voice recording replications,Comput. Methods Programs Biomed. 142 (2017) 147–156.[28] M. Elleuch, R. Maalej, M. Kherallah, A new design based-svm of the CNN classifier archi-tecture with dropout for offline arabic handwritten recognition, in: ICCS, Vol. 80 of ProcediaComputer Science, Elsevier, 2016, pp. 1712–1723.[29] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal,D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, K. W. Wilson, CNN architecturesfor large-scale audio classification, in: ICASSP, IEEE, 2017, pp. 131–135.[30] J. E. Ball, D. T. Anderson, C. S. Chan, A comprehensive survey of deep learning in remotesensing: Theories, tools and challenges for the community, CoRR abs/1709.00308 (2017).[31] A. Mandelbaum, D. Weinshall, Distance-based confidence score for neural network classifiers,CoRR abs/1709.09844 (2017).[32] N. Ueda, Optimal linear combination of neural networks for improving classification perfor-mance, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2) (2000) 207–215.[33] K. Lai, D. Liu, S. Chang, M. Chen, Learning sample specific weights for late fusion, IEEETrans. Image Process. 24 (9) (2015) 2772–2783.[34] H. Cao, S. Bernard, L. Heutte, R. Sabourin, Dynamic voting in multi-view learning for ra-diomics applications, in: S+SSPR, Vol. 11004 of Lecture Notes in Computer Science, Springer,2018, pp. 32–41.[35] J. D´ıez-Pastor, J. J. Rodr´ıguez, C. I. Garc´ıa-Osorio, L. I. Kuncheva, Diversity techniquesimprove the performance of the best imbalance learning ensembles, Inf. Sci. 325 (2015) 98–117.[36] Z. Wu, C. Shen, A. van den Hengel, Wider or deeper: Revisiting the resnet model for visualrecognition, Pattern Recognit. 90 (2019) 119–133.[37] M. Hammann, M. Kraus, S. Shafaei, A. C. Knoll, Identity recognition in intelligent cars withbehavioral data and lstm-resnet classifier, CoRR abs/2003.00770 (2020).[38] Y. Chen, X. Jin, B. Kang, J. Feng, S. Yan, Sharing residual units through collective tensorfactorization in deep neural networks, CoRR abs/1703.02180 (2017).[39] S. Xie, R. B. Girshick, P. Doll´ar, Z. Tu, K. He, Aggregated residual transformations for deepneural networks, in: CVPR, IEEE Computer Society, 2017, pp. 5987–5995.
ULTIPLE CLASSIFIER FOR CONCATENATE-DESIGNED NEURAL NETWORK 17 [40] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutionalnetworks, in: CVPR, IEEE Computer Society, 2017, pp. 2261–2269.[41] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to documentrecognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.[42] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducinginternal covariate shift, in: ICML, Vol. 37 of JMLR Workshop and Conference Proceedings,JMLR.org, 2015, pp. 448–456.[43] F. Yu, D. Wang, E. Shelhamer, T. Darrell, Deep layer aggregation, in: CVPR, IEEE ComputerSociety, 2018, pp. 2403–2412.[44] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, A. Lerer, Automatic differentiation in pytorch (2017).[45] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: ICLR (Poster), 2015.
School of Applied Sciences, Macao Polytechnic Institute, Macao, China
Email address : [email protected] Macao Polytechnic Institute, Macao, ChinaSchool of Applied Sciences, Macao Polytechnic Institute, Macao, China
Email address ::