[PDF] PK-GCN: Prior Knowledge Assisted Image Classification using Graph Convolution Networks

Abstract

Deep learning has gained great success in various classification tasks. Typically, deep learning models learn underlying features directly from data, and no underlying relationship between classes are included. Similarity between classes can influence the performance of classification. In this article, we propose a method that incorporates class similarity knowledge into convolutional neural networks models using a graph convolution layer. We evaluate our method on two benchmark image datasets: MNIST and CIFAR10, and analyze the results on different data and model sizes. Experimental results show that our model can improve classification accuracy, especially when the amount of available data is small.

Full PDF

PPK-GCN: P

RIOR K NOWLEDGE A SSISTED I MAGE C LASSIFICATION USING G RAPH C ONVOLUTION N ETWORKS

A P

REPRINT

Xueli Xiao

Computer Science DepartmentGeorgia State UniversityAtlanta, GA 30303 [email protected]

Chunyan Ji

Computer Science DepartmentGeorgia State UniversityAtlanta, GA 30303 [email protected]

Thosini Bamunu Mudiyanselage

Computer Science DepartmentGeorgia State UniversityAtlanta, GA 30303 [email protected]

Yi Pan

Computer Science DepartmentGeorgia State UniversityAtlanta, GA 30303 [email protected]

September 28, 2020 A BSTRACT

Deep learning has gained great success in various classiﬁcation tasks. Typically, deep learning modelslearn underlying features directly from data, and no underlying relationship between classes areincluded. Similarity between classes can inﬂuence the performance of classiﬁcation. In this article,we propose a method that incorporates class similarity knowledge into convolutional neural networksmodels using a graph convolution layer. We evaluate our method on two benchmark image datasets:MNIST and CIFAR10, and analyze the results on different data and model sizes. Experimental resultsshow that our model can improve classiﬁcation accuracy, especially when the amount of availabledata is small.

Keywords

Deep Learning · Prior Knowledge · Convolutional Neural Networks · Graph Convolutional Networks · Multi-Class Classiﬁcation · Class Similarity

Deep learning has been successful in various domains: computer vision [1], speech recognition [2], natural languagesprocessing [3], bioinformatics [4] and so on. And deep learning models have shown great performances on classiﬁcationtasks. For example, convolutional neural networks (CNNs) are frequently used for image classiﬁcation. CNN modelstake images as inputs, and output the probabilities of images belonging to certain classes. In multiclass imageclassiﬁcation, Image features are extracted and learned through convolutional kernels, and eventually mapped to oneclass among some other classes. Typically during this process, CNNs learn from the information contained withinthe images only, and no inter-class relationships are incorporated in the learning. The performance of classiﬁcationscan greatly be impacted by the similarity between classes. For example, a cat would have a larger chance of beingmisidentiﬁed as a dog, than as an airplane. By learning the similarity among classes and incorporating them in ourmodels, we could potentially improve classiﬁcation performances.This work attempts to incorporate class similarity information into deep learning models to improve multi-classclassiﬁcation performance. Speciﬁcally, we experiment on image classiﬁcation using CNNs. Our method is not taskspeciﬁc, and has the potential of being applied to classiﬁcation on other types of data besides images. There are someprior works that take class similarity into consideration. Lee et al. proposed Dropmax [5], a method which randomlydrops classes adaptively, to improve the accuracy of deep learning models. During the class dropping, classes that a r X i v : . [ c s . C V ] S e p K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networksare more similar to the input data have larger chances of being kept. Chen et al. [6] used word semantic similarity tocalculate the underlying structures between labels, and used a graph convolutional network (GCN) augmented classiﬁerto do classiﬁcation. Their method needs external information to deﬁne a label graph and initiate node representations.Using external information can have issues. For example, in many cases, labels’ word semantic similarity cannot reﬂectthe real similarity of the data. And to get the embeddings for classes, we need to rely on other machine learning models.Inspired by Chen et al.’s method [6], we propose our method that does not rely on any external information, and can beapplied to much wider ranges of classiﬁcation problems. Misclassiﬁcation graph derived from the validation data setcan be used for class similarity, and class embeddings can be extracted from model weights. The goal of our work isnot to beat the state-of-art on the benchmark image classiﬁcation dataset. Instead, we use the datasets to evaluate theeffect of our method on various data and model sizes, and analyze under which scenarios does our method work thebest. The contribution of our work is as follows. • Our method incorporates class similarity knowledge into deep learning models to improve classiﬁcationaccuracy. • We use a novel way of deﬁning class similarities using misclassiﬁcation graphs on the validation dataset. Inour deﬁnition, similarities have directions. • We propose a novel two-stage training model. The ﬁrst stage training obtains class similarity knowledge, andthe second training stage combines the original model’s output with the convoluted class similarity graphoutputs, and improves classiﬁcation performance. • Our method does not require extra external information. Class similarity knowledge is extracted through thedataset itself. Class and data embeddings are both obtained from trained network weights. • We analyze the effect of adding class similarity knowledge with different model and dataset sizes.The rest of our work is organized as follows: section II is related work. Section III introduces our method: how classsimilarity knowledge is extracted and incorporated in the deep learning classiﬁcation model. Section IV describes dexperiment settings and results. Section V includes conclusion and possible future directions.

There are various ways of improving classiﬁcation accuracy: tune model hyperparameters[7][8][9], ﬁnd more data totrain the model, data augmentation[10], add more features, transfer learning[11], model ensemble[12] and so on.Model hyperparameters play an important role in model performances. The size of the model decides how well themodel can ﬁt the data. Learning rates need to be tuned carefully to help the model converge to a better optimum[13]. Convolution window decides how much local information is examined at a time. And there are many morehyperparameters that are essential to models’ performance. Deciding a good set of hyperparameters is difﬁcult work,and there are many research works in this area [7][8][9][14][15][16][17].Deep learning models also rely on the abundance of data. And data augmentation, a method which manufactures datawith existing data can greatly help with model performance. Popular image augmentation methods include ﬂipping,translating, and rotating images. And there are novel image augmentation methods such as overlaying two images [18]and masking part of the image [19].Model ensemble takes advantage of multiple diverse models and combines the predictions of various models on thetask. It is a very powerful technique. In Kaggle competition, it is common to see the top results utilizing ensemble [12].Popular model regularization technique, dropout [20], also behaves like training an ensemble of submodels.

Adding information / features could also improve model performances. Sometimes the information does not comefrom the data itself, but other sources. There are various ways of incorporating prior knowledge into deep learning.Diligenti et al. [21] used ﬁrst order logic rules and translated them to constraints which are incorporated during thebackpropagation process. Towell et al. [22] proposed a hybrid training method: ﬁrst translate logic rules into a neuralnetwork and then use the neural network to train classiﬁed examples. Xu et al. [23] derived a semantic loss functionthat bridges between deep learning outputs and logic constraints. Hu et al. [24] proposed a method that iterativelydistills information in logic rules into weights of neural networks. Ding et al. [25] integrated prior knowledge in indoorobject recognition problems. Color knowledge for indoor objects and object appearance frequencies are generated in2K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networksthe form of vectors and used to modify deep learning model outputs. Jiang et al. [26]incorporated semantic correlationsbetween objects in a scene for object detection. Stewart et al. [27] performed supervised learning to detect objects usingdomain knowledge instead of data labels, this can be applied to problems where labels are scarce.One kind of knowledge that can be incorporated is class similarity. In many works, label relations are incorporatedin deep learning models. Word taxonomy can be used to improve image object recognition[28]. Associated imagecaptions can improve entry-level labels of visual objects. Structured inferences can be made by incorporating labelrelations. GCN augmented neural network classiﬁers can be used to incorporate underlying label structures.

There are various ways to obtain similarity between classes. Label relations can be used to deﬁne class similarities [29],where semantic similarity of labels can be computed. The problem is in many situations, label relations may not be ableto fully capture the similarities between the actual data. Sometimes label relations cannot represent meaningful classsimilarities at all.Class similarity knowledge does not necessarily need to come from external sources. It may be extracted from the dataitself. Arino et al.[30] proposed to use misclassiﬁcation ratios of trained deep neural networks to get class similarities.Their proposed method uses symmetrical similarity between classes.

CNNs are successful in capturing the inner patterns of Euclidean data. However, lots of data in real life scenariosexist in the form of graphs. For example, social networks are graph based: the nodes are people and the edges are theconnections between them. Chemical molecules, atoms held together by chemical bonds, can naturally be modeledas graphs. Analyzing their graphical structure can determine their chemical properties. In trafﬁc networks, points ofintersections are linked together by roads, and we can predict the trafﬁc of these intersections in future times. Imagescan be thought of as a special kind of graph, where adjacent pixels are connected together forming a pixel grid. Whenimages are fed through a CNN model, the nearby pixels are being convoluted and local spatial information is retained.CNNs cannot learn from graph data with more complex relations. Similarly graph convolution can be performed ongraph data, where each node can learn the weighted average of its neighbors’ information. A graph convolutionalnetwork (GCN) [31] does the following graph convolution operation: H ( l +1) = σ ( ˜ D − ˜ A ˜ D − H ( l ) W ( l ) ) (1)Where ˜ A is the adjacency matrix with self-loops, ˜ D is the node degree matrix, W is a layer-speciﬁc trainable weightmatrix and sigma is an activation function. H ( l ) is the output from the l th layer and H (0) is input. During graphconvolution, each node aggregates information from its neighbors. In deep learning multiclass classiﬁcation tasks, data are mapped to one of the predeﬁned classes. The model extractsfeatures from data and no inter-class relationship is considered. Prior works have shown that incorporating classsimilarity knowledge can improve classiﬁcation performances [5][30]. We incorporate class similarity knowledge intodeep learning models using graph convolution to improve classiﬁcation accuracy.The class similarity knowledge is extracted directly through training from data, and no additional external information isrequired. Information directly learned from data should represent similarity more accurately than external ones. We trainthe model on the training dataset, and obtain the misclassiﬁcation graph on the validation dataset. The misclassiﬁcationgraph contains information about how often one class is misclassiﬁed as another class. If data in one class is frequentlymisclassiﬁed as another class, we consider that the former class is similar to the latter class.The vector representations of classes and data are extracted from learned model weights and hidden layer outputsrespectively. Together with the class similarity graph, class and data representations are sent to a graph convolutionlayer. The graph convolution adjusts the results according to class similarity knowledge, and class scores are ﬁnallysent to softmax activation function to get the ﬁnal classiﬁcation results.

We use misclassiﬁcation graphs to represent class similarities. If data in one class is often misclassiﬁed as another class,we consider the two classes have high similarity. Our misclassiﬁcation graph is directed, which means similarities3K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution NetworksFigure 1: A Misclassiﬁcation Graph example on the MNIST Dataset.can be directional. Class A can be very similar to Class B, but not the other way around. This directional similaritycan be observed from the misclassiﬁcation information. In experiments we can observe one class being frequentlymisclassiﬁed as another class, but not the reverse way.The misclassiﬁcation graph is built based on a trained model’s performance on the validation dataset. Figure 1 is anexample of a misclassiﬁcation graph. A CNN model is trained on a downsampled MNIST dataset. After the model istrained, we evaluate its performance on the validation dataset. Mistakes made on the validation dataset are recordedand plotted as a graph. Each node represents a class in the dataset, and edges denote how frequent data from one classare being misclassiﬁed as another class. The thicker the edge, the more misclassiﬁcation between the two classes. Asshown in Figure 1, there are 10 classes in the MNIST dataset: 0 - 10. Edges between some classes are thicker, forexample: Class 8 to Class 1. This means lots of images that are actually 8’s are mistaken as 1’s. Note that edges havedirections. While 8’s are easily mistaken as 1’s, barely any 1’s are misclassiﬁed as 8’s.

Figure 2 shows the overall architecture of our model. There are two stages of training. In Stage 1, we train an originalCNN model without the graph convolution layer. In the illustration we use a CNN model with two convolutionallayers and a fully connected layer as an example. The model is trained for enough epochs such that it has learnedthe training data well. Then a misclassiﬁcation graph is obtained by feeding the validation data through the model.Data and class embeddings are also extracted from the CNN model to produce vector representations for graph nodes.After we obtain the misclassiﬁcation graph, and node representations, we can enter Stage 2 of training. In this stage, agraph convolution layer is added after the fully connected layer. The graph convolution contains the misclassiﬁcationinformation. This layer takes in the latent data representation, and class embedding information and does convolutionamong the classes, and outputs new data and class embeddings with aggregated neighborhood information. The newdata and class embeddings are further used to calculate class scores and ﬁnally sent to the softmax activation function toproduce ﬁnal results.The graph convolution layer takes in data and class embeddings, and outputs new vectors that contain aggregatedneighborhood information. Figure 3 shows a ﬁve-class graph convolution example. The nodes in the graph representindividual classes. There are ﬁve nodes in this example corresponding to ﬁve classes. The edges between nodesrepresent how often one class is misclassiﬁed as another class. Edges have directions. Nodes are represented usingvectors of numbers. In our case, the vectors are concatenations of data embedding and class embeddings.Both data embedding and class embedding can be obtained from the original CNN model. Data embeddings areobtained by getting the outputs from the layer right before Softmax classiﬁer, as shown in Figure 4. Class embeddingsare obtained from the weights connecting the classiﬁer and the layer before. Figure 5 shows how the class embeddingsare obtained. Notice that the data and class embeddings have the same dimensions. And in the original CNN model, theinner products of data and class embeddings produce class sores.Data embedding and class embedding together form the graph node vector representations.After the node vector representations pass through the graph convolution layer, the graph convoluted outputs are furthertransformed to get class scores before sent to the Softmax activation function. Vector representations for data andclasses are extracted from the original CNN model. The data embedding for data d is (cid:126)d = { d , ...d k , ...d n } . The classembedding for class i is (cid:126)c i = { c i , ...c ik , ...c in } . In the original CNN model, the dot product of (cid:126)d and (cid:126)c i produces the4K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networks CONVCONVFLCSoftmax

Output

CONVCONVFLCSoftmax

Graph ConvolutionOutputStage 1 Stage 2

Get Misclassiﬁcation Graph

Input Input

Figure 2: Overall model architecture. In Stage 1 training, mislassiﬁcation graph is obtained from the the validation data.The graph represents class similarity information and is fed into the next stage. In Stage 2 training, a graph convolutionlayer is added to incorporate class similarity knowledge into the training. x3x1 x2x4x5

GCN h1 h2h3h4h5Data Embedding: Class Embeddings:

Figure 3: The graph convolutional network takes in the data embedding concatenated with class embeddings, performsneighborhood aggregation, and outputs new node representations.class i score of d . The vector representation for node i is the concatenation of (cid:126)d and (cid:126)c i : (cid:126)x i = { d , ...d n , ...c i , ...c in } .After graph convolution, the output for node i is: (cid:126)h i = { h , ...h n , h n +1 , ...h n } . We need to transform the outputsfrom graph convolution to class scores before feeding them to softmax activation function.We use two ways of transforming graph convoluted outputs to class scores. This will give us two variants of graphconvolution assisted model: PK-GCN-1 and PK-GCN-2. Figure 6 describes the difference between these two variants.In PK-GCN-1, the outputs of the graph convolution layer are directly used for producing class scores. In PK-GCN-2,a fully connected layer is added after the graph convolution layer. The fully connected layer merges the inputs andoutputs of the graph convolution layer. The input to the fully connected layer is the input to the graph convolution layerconcatenated with the output of the graph convolution layer.In PK-GCN-1, we produce class score s i for class i according to the following formula:5K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networks Data Embedding outputInput Data

Figure 4: Obtain data embedding from the model. output

Class Embedding for Class 5Class Embeddings for All Classes output

Figure 5: Obtain class embeddings for all classes from the model. s i = k = n (cid:88) k =1 d k h i ( n + k ) + k = n (cid:88) k =1 c ik h ik (2)For PK-GCN-2, we add a fully connected layer after the graph convolution layer. This layer merges the original dataand class embeddings and the graph convoluted output together. Let the dimension of the output of the fully connectedlayer be l . We use the following formula to produce class score s i for class i: s i = k = l (cid:88) k =1 q k q k + l (3)6K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networks CONVCONVFLCSoftmax

Graph ConvolutionOutputInput

CONVCONVFLCSoftmax

Graph ConvolutionOutputInput

FLC

Concatenation

Figure 6: Two ways of incorporating the graph convolution layer.where the output of the fully connected layer is (cid:126)q , and its dimension is l .Our method can be summarized into the following steps:1. Train a base CNN model until convergence. This is stage 1 training.2. Feed the validation dataset through the CNN model to get the misclassiﬁcation graph. The graph is weightedand bidirectional.3. Extract vector representation of each of the m classes (cid:126)c , (cid:126)c , . . . (cid:126)c m from trained model weights. (Figure 5)4. Obtain vector representations of data d from the last hidden layer outputs. (Figure 4)5. Add a graph convolution layer to the base CNN model. Node i in the graph is represented by (cid:126)x i which is datarepresentation (from step 4) concatenated with class i representation (from step 5). (cid:126)x i = { (cid:126)d, (cid:126)c i } .6. (PK-GCN-2 Only) Add a fully connected layer after the convolution layer.7. (PK-GCN-1 Only) Produce class scores using equation 2.8. (PK-GCN-2 Only) Produce class scores using equation 3.9. Continue training the model with the new layers until convergence. This is stage 2 training. We perform our experiments on MNIST dataset and CIFAR-10. In our experiment, different CNN model size anddifferent data size is experimented to evaluate how adding class similarity knowledge helps in different circumstances.

The MNIST dataset contains 10 classes of handwritten digits: 0-9. The dataset provides a train-test split, with 60000training and 10000 testing. The CIFAR-10 [32] dataset contains 60000 color images of size 32X32 in 10 classes: eachclass has 6000 images. There are 50000 images for training, and 10000 for testing.

We use a two-stage training method for our models. In stage one training, a base CNN model is used to obtain themisclassiﬁcation graph for class similarity. In stage two, a graph convolution layer is added to incorporate class7K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networkssimilarity knowledge. The baselines for our method are the base CNN models we use in stage one training. We makesure the baseline and our proposed model are trained for the same number of epochs using the same optimizer.

We evaluate our method on the MNIST dataset. We experimented with two CNN base models. The ﬁrst contains 1convolution layer, the second contains two convolution layers. We also experiment with different train and validationdata sizes. We make sure that the training and validation dataset have a balanced number of data from each class. Wereport the accuracy of the baseline and our models on the test dataset, which contains 10000 images.Table 1 shows the results comparison between our model and the original CNN. When using base model 1, the originalCNN model was trained for 200 epochs. PK-GCN-1 and PK-GCN-2 were trained for 40 epochs in the ﬁrst trainingstage, and 160 epochs in the second.When we use base model 2, the original CNN model was trained for 200 epochs.PK-GCN-1 and PK-GCN-2 were trained for 80 epochs in the ﬁrst training stage, and 120 epochs in the second. Alltraining uses AdaDelta optimizer with the same setup. And all models were trained for the same number of epochs forfair comparison.From the table, we can see that our model outperforms the base model by as much as 1.56. And generally, theimprovements are bigger when the amount of available training data is smaller.Table 1: Accuracy comparison on MNIST of the original CNN model and PK-GCN models with various data sizes. ‘X’Means no accuracy improvement is observed.

Data size 300| 500| 1000| 1500| 2000| 2500| 3000|(Train|Validation) 300 500 1000 1500 2000 2500 3000

Base model 1: Original CNN 85.30% 88.78% 93.07% 94.42% 95.13% 95.78% 96.26%1 conv layer PK-GCN-1 85.74% 89.49% 93.97% 94.90% 95.62% 96.03% 96.32%(+0.44) (+0.71) (+0.9) (+0.48) (+0.49) (+0.25) (+0.06)PK-GCN-2 86.28% 89.51% 94.12% 94.94% 95.66% 95.98% 96.50%(+0.98) (+0.73) (+1.05) (+0.53) (+0.53) (+0.2) (+0.24)Base model 2 Original CNN 89.67% 92.08% 95.43% 96.00% 97.00% 97.44% 97.70%2 conv layers PK-GCN-1 91.16% 93.46% x 96.30% 97.01% 97.46% 97.74%(+1.49) (+1.38) (+0.30) (+0.01) (+0.02) (+0.04)PK-GCN-2 91.23% 93.19% x 96.26% x x 97.80%(+1.56) (+1.11) (+0.26) (+0.10)

We evaluate our method on the CIFAR-10 dataset. The base model we use is VGG-11. We experiment with variousdata sizes and the models are evaluated on a test dataset with 10000 images. Table 2 shows the results comparisonbetween our model and the original CNN on CIFAR-10. The original CNN model is trained for 300 epochs. PK-GCN-1and PK-GCN-2 were trained for 100 epochs in the ﬁrst training stage, and 200 epochs in the second. All training usesAdaDelta optimizer with the same setup.When using VGG-11 as the base model, we can see that our model outperforms the baseline by as much as 2.55.

In our work, we deﬁne the similarity between classes using the misclassiﬁcation graph produced on the validationdataset, and use a graph convolution layer to incorporate that information into training. Experiment results on benchmarkimage classiﬁcation datasets show that incorporating class similarity knowledge can improve multi-class classiﬁcationaccuracy, especially when the amount of available data is small.Instead of obtaining the misclassiﬁcation graph from validation data, rules can be used to deﬁne class relations. Therelationships between classes are fuzzy. In the future, we plan to incorporate fuzzy logic [33] [34][35]and rough settheories [36][37][38] to our work to deﬁne class relations. A graph attention [39] layer can also be used in place of thegraph convolution layer. The advantage of graph attention is that we do not need to know the edge information in the8K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution NetworksTable 2: Test accuracy comparison on CIFAR-10 of the original CNN model and PK-GCN models with various datasizes.

Data size 300| 500| 1000| 1500| 2000| 2500| 3000|(Train|Validation) 300 500 1000 1500 2000 2500 3000

Base model: Original CNN 33.34% 35.51% 44.33% 50.09% 52.81% 56.50% 60.56%VGG-11 PK-GCN-1 35.37% 36.01% 46.88% 51.16% 53.69% 56.98% 61.04%(+2.03) (+0.50) (+2.55) (+1.07) (+0.88) (+0.48) (+0.48)PK-GCN-2 34.93% 37.45% 45.72% 51.21% 53.25% 56.54% 60.82%(+1.59) (+1.94) (+2.39) (+1.12) (+0.44) (+0.04) (+0.26)graph. The edges are learned through training. We could potentially study the edges learned by graph attention to see ifthey correlate to class similarities.

Acknowledgment

The authors acknowledge molecular basis of disease (MBD) at Georgia State University for supporting this research,as well as the high performance computing resources at Georgia State University (https://ursa.research.gsu.edu/high-performance-computing) for providing GPU resources. This research is also supported in part by a NVIDIA AcademicHardware Grant. The authors thank the Extreme Science and Engineering Discovery Environment (XSEDE)[40], whichis supported by National Science Foundation grant number ACI-1548562. Speciﬁcally, the authors used the Bridgessystem[41], which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).The authors also thank the National Science Foundation of China (No. 61603313) and the Fundamental Research Fundsfor the Central Universities (No. 2682017CX097) for supporting this work.

References [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.[2] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior,Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep Neural Networks for AcousticModeling in Speech Recognition.

Signal Processing Magazine, IEEE , 29(6):82–97, 2012.[3] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.

MEMORY Neural Computation , 9(8):1735–1780, 1997.[4] Min Zeng, Min Li, Fang-Xiang Wu, Yaohang Li, and Yi Pan. DeepEP: a deep learning framework for identifyingessential proteins.

BMC Bioinformatics , 20(S16):506, 12 2019.[5] Hae Beom Lee, Juho Lee, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. DropMax: Adaptive variationalsoftmax. In

Proceedings of the 32Nd International Conference on Neural Information Processing Systems , pages927–937, Montr\&\ arXiv preprint arXiv:1710.04908v2 , 2016.[7] Steven R. Young, Derek C. Rose, Thomas P. Karnowski, Seung-Hwan Lim, and Robert M. Patton. Optimizingdeep learning hyper-parameters through an evolutionary algorithm. In

Proceedings of the Workshop on MachineLearning in High-Performance Computing Environments - MLHPC ’15 , pages 1–5, Austin, Texas, USA, 2015.ACM Press.[8] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, and AlexKurakin. Large-Scale Evolution of Image Classiﬁers. In

Proceedings of the 34th International Conference onMachine Learning - Volume 70 , pages 2902–2911, Sydney, NSW, Australia, 2017.[9] Xueli Xiao, Ming Yan, Sunitha Basodi, Chunyan Ji, and Yi Pan. Efﬁcient Hyperparameter Optimization in DeepLearning Using a Variable Length Genetic Algorithm. arXiv preprint arXiv:2006.12703 , 2020.[10] Connor Shorten and Taghi M. Khoshgoftaar. A survey on Image Data Augmentation for Deep Learning.

Journalof Big Data , 6(1):60, 12 2019. 9K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networks[11] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. A survey of transfer learning.

Journal of Big Data ,3(1):9, 12 2016.[12] Omer Sagi and Lior Rokach. Ensemble learning: A survey.

Wiley Interdisciplinary Reviews: Data Mining andKnowledge Discovery , 8(4), 7 2018.[13] Sunitha Basodi, Chunyan Ji, Haiping Zhang, and Yi Pan. Gradient Ampliﬁcation: An efﬁcient way to train deepneural networks. arXiv preprint arXiv:2006.10560 , 6 2020.[14] James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization.

J. Mach. Learn. Res. ,13:281–305, 2012.[15] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing Neural Network Architectures usingReinforcement Learning. arXiv preprint arXiv:1611.02167 , 11 2016.[16] Alejandro Baldominos, Yago Saez, and Pedro Isasi. Evolutionary convolutional neural networks: An applicationto handwriting recognition.

Neurocomputing , 283:38–52, 3 2018.[17] Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programming approach to designingconvolutional neural network architectures. In

Proceedings of the Genetic and Evolutionary ComputationConference on - GECCO ’17 , pages 497–504, Berlin, Germany, 4 2017. ACM Press.[18] Hiroshi Inoue. Data Augmentation by Pairing Samples for Images Classiﬁcation. arXiv preprintarXiv:1801.02929v2 , 2018.[19] Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. Data Augmentation using Random Image Cropping andPatching for Deep CNNs. arXiv preprint arXiv:1811.09030v2 , 2018.[20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, and Ruslan Salakhutdinov. Dropout: A Simple Way toPrevent Neural Networks from Overﬁtting.

Journal of Machine Learning Research , 15:1929–1958, 2014.[21] Michelangelo Diligenti, Soumali Roychowdhury, and Marco Gori. Integrating prior knowledge into deep learning.

Proceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017 , 2018-Janua:920–923, 2018.[22] Geoffrey G Towell and Jude W Shavlik. Knowledge-Based Artiﬁcial Neural Networks.

Artif. Intell. , 70(1-2):119–165, 1994.[23] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van Den Broeck. A Semantic Loss Function for DeepLearning with Symbolic Knowledge. arXiv preprint arXiv:1711.11157 , 2018.[24] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric P Xing. Harnessing Deep Neural Networkswith Logic Rules. In

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 2410–2420, Berlin, Germany, 2016. Association for Computational Linguistics.[25] Xintao Ding, Yonglong Luo, Qingde Li, Yongqiang Cheng, Guorong Cai, Robert Munnoch, Dongfei Xue,Qingying Yu, Xiaoyao Zheng, and Bing Wang. Prior knowledge-based deep learning method for indoor objectrecognition and application.

Systems Science & Control Engineering , 6(1):249–257, 1 2018.[26] Chenhan Jiang, Hang Xu, Xiaodan Liang, and Liang Lin. Hybrid Knowledge Routed Modules for Large-scaleObject Detection. In

Proceedings of the 32nd International Conference on Neural Information Processing Systems ,pages 1559–1570, Montreal, Canada, 2018. Curran Associates Inc.[27] Russell Stewart and Stefano Ermon. Label-Free Supervision of Neural Networks with Physics and DomainKnowledge. In

Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence , pages 2576–2582, SanFrancisco, California, USA, 2017. AAAI Press.[28] Sung Ju Hwang, Kristen Grauman, and Sha Fei. Semantic Kernel Forests from Multiple Taxonomies. In

Advancesin neural information processing systems , pages 1718–1726, 2012.[29] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, andHartwig Adam. Large-Scale Object Classiﬁcation using Label Relation Graphs. In

Computer Vision – ECCV ,pages 48–64, 2014.[30] Kazuma Arino and Yohei Kikuta. ClassSim: Similarity between Classes Deﬁned by Misclassiﬁcation Ratios ofTrained Classiﬁers. arXiv preprint arXiv:1802.01267v1 , 2018.[31] Thomas N Kipf and Max Welling. Semi-supervised Classiﬁcation With Graph Convolutional Networks. arXivpreprint arXiv:1609.02907v4 , 2017.[32] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, Department ofComputer Science, University of Toronto, 2009. 10K-GCN: Prior Knowledge Assisted Image Classiﬁcation using Graph Convolution Networks[33] Jie Hu, Yi Pan, Tianrui Li, and Yan Yang. TW-Co-MFC: Two-level weighted collaborative fuzzy clustering basedon maximum entropy for multi-view data.

Tsinghua Science and Technology , 26(2):185–198, 4 2020.[34] TK Bamunu Mudiyanselage, X Xiao, Y Zhang, and Y Pan. Deep Fuzzy Neural Networks for Biomarker Selectionfor Accurate Cancer Detection.

IEEE Transactions on Fuzzy Systems , 2019.[35] Hui Liu, Jie Li, Yan-Qing Zhang, and Yi Pan. An adaptive genetic fuzzy multi-path routing protocol for wirelessad-hoc networks. In

Sixth International Conference on Software Engineering, Artiﬁcial Intelligence, Networkingand Parallel/Distributed Computing and First ACIS International Workshop on Self-Assembling Wireless Network ,pages 468–475, 2005.[36] Junbo Zhang, Jian-Syuan Wong, Tianrui Li, and Yi Pan. A comparison of parallel large-scale knowledgeacquisition using rough set theory on different MapReduce runtime systems.

International Journal of ApproximateReasoning , 55(3):896–907, 3 2014.[37] Junbo Zhang, Jian-Syuan Wong, Yi Pan, and Tianrui Li. A Parallel Matrix-Based Method for ComputingApproximations in Incomplete Information Systems.

IEEE Transactions on Knowledge and Data Engineering ,27(2):326–339, 2 2015.[38] Junbo Zhang, Yun Zhu, Yi Pan, and Tianrui Li. Efﬁcient parallel boolean matrix based algorithms for computingcomposite rough set approximations.

Information Sciences , 329:287–302, 2 2016.[39] Petar Veliˇckovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lì, and Yoshua Bengio. GraphAttention Networks. arXiv preprint arXiv:1710.10903v3 , 2017.[40] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D.Peterson, R. Roskies, J. Scott, and N. Wilkins-Diehr. Xsede: Accelerating scientiﬁc discovery.

Computing inScience and Engineering , 16(05):62–74, sep 2014.[41] Nicholas A. Nystrom, Michael J. Levine, Ralph Z. Roskies, and J. Ray Scott. Bridges: A uniquely ﬂexible hpcresource for new communities and data analytics. In