[PDF] Achieving Explainability for Plant Disease Classification with Disentangled Variational Autoencoders

Abstract

Agricultural image recognition tasks are becoming increasingly dependent on deep learning (DL). Despite its excellent performance, it is difficult to comprehend what type of logic or features DL uses in its decision making. This has become a roadblock for the implementation and development of DL-based image recognition methods because knowing the logic or features used in decision making, such as in a classification task, is very important for verification, algorithm improvement, training data improvement, knowledge extraction, etc. To mitigate such problems, we developed a classification method based on a variational autoencoder architecture that can show not only the location of the most important features but also what variations of that particular feature are used. Using the PlantVillage dataset, we achieved an acceptable level of explainability without sacrificing the accuracy of the classification. Although the proposed method was tested for disease diagnosis in some crops, the method can be extended to other crops as well as other image classification tasks. In the future, we hope to use this explainable artificial intelligence algorithm in disease identification tasks, such as the identification of potato blackleg disease and potato virus Y (PVY), and other image classification tasks.

Full PDF

11 Achieving Explainability for Plant Disease Classification with Disentangled Variational Autoencoders

HABARAGAMUWA Harshana a,* , OISHI Yu a , TANAKA Kenichi a,b a National Agriculture and Food Research Organization (NARO), Japan b Mitsubishi Electric Corporation *Corresponding author at: National Agriculture and Food Research Organization, E-mail address: [email protected]

Abstract

Keywords—

Explainable deep learning, Variational autoencoder, Feature visualization, Crop disease classification, Disentangled representation

1. Introduction

Deep learning methods, especially methods that use deep convolutional neural networks (DCNNs), often appear in the agricultural image recognition field [1], [2]. However, once an image is recognized, the rationale behind the DCNN recognitions (decisions) needs to be explained, and this is still an active research field [3]. There are many reasons we need an explanation of the DCNN decision. One reason is that the DCNN may unintentionally learn a false feature (an artifact) for discrimination [4], [5]. To illustrate, an example of the wolf and dog classification can be used. In this case, most of the wolf images contained snow in the background. As a result, the model learned the snow as an indicator for wolves [5], which is not a good indicator for wolves. Moreover, machine learning algorithms, including DCNN, can learn cognitive biases (annotations made with unreasonable assumptions) [6], from the training data. Another reason is that the DCNN tends to use textures [7], for classification. A robust DCNN may prefer shape over texture because texture may be an unreliable feature in real-life conditions [7]. To resolve the abovementioned issues, the user may need an explanation of the DCNN decision. Moreover, we need to learn from the data to see the relationships, which were created by DCNN.

Most popular DCNN decision explaining algorithms show which area of an image is the most important for a decision (heat map) [8]–[10]. However, heat-maps do not show the exact feature that was used in the classification (e.g., the color or the shape). For example, if the yellowing and wrinkling are two features on the same position of a leaf, a heat map cannot clarify which of those was used in the classification. In addition, a dataset contains variations of a feature (e.g., different degrees of yellowing), and users need to understand how these variations are used as features in the classification. For these reasons, explanations such as the heat map are not very effective when users want to know exactly which features were used and which variations of the features are available in the dataset. In this research, we focus on developing a classification decision explanation method based on the variation of features, which is capable of showing what the classification system considers a feature while maintaining an acceptable classification accuracy.

2. Development of Explainable AI Algorithm

We developed a new deep learning method that can explain its decision using “classifiable latent features,” which are defined as features that are suitable for the classification of classes, and “non-classifiable latent features,” which are defined as features that are less related to the classification. Our method shows the characteristic classifiable features of the classes for classification by separating the classifiable and non-classifiable latent features. Then the classifiable features can be used to train a classifier for the intended classification task. After a classification of an image is performed, we can identify the most important classifiable features. Hereinafter, the proposed model that can explain its decisions using explainable classifiable latent features is called ECLF.

Figure 1 Major stages of the ECLF system

Figure 1 shows the major stages of the ECLF. In stage 1, latent features are divided into two categories (classifiable and non-classifiable) using a variational autoencoder (VAE). In stage 2, the classifiable features are used to train a classifier that classifies images into their respective classes. In stage 3, the features that are important for a given classification of an image are visualized. In addition, we also created a complementary model, the explainable classifiable latent features class-specific (ECLF-CS) system, which can extract class-specific information. In this research, our main objective was to develop a system that can be used to explain the classification decision of plant leaf images (diseased and healthy) and to show the disentangled or separated variations of the features, which were used in the classification. To confirm the explainability and accuracy of our developed system, we verified the following. 1.

The effect of dimensionality on disentanglement (total correlation), reconstruction accuracy, classification accuracy, and decision explanations. 2.

The effect of disentanglement (total correlation) on reconstruction accuracy, classification accuracy, and decision explanations. 3.

The visual quality of classification decision explanations. 4.

ECLF-CS classification accuracy and explainability performances. The ECLF is described in detail in Section 2.1 and ECLF-CS and its comparison to ECLF in Section 2.2. We show the experimental results of ECLF in Sections 4.2.1-4.2.5 and those of ECLF-CS in Sections 4.2.6. and 4.2.7.

We have three main targets that are essential for building explainability in the VAE training stage of our system: 1.

To separate the classifiable and non-classifiable features in our system. 2.

To learn the disentangled representation (Section 2.1.1.1) for these features, which encourages the reduction of correlation. 3.

To make a human interpretable representation from our system, which helps the users identify the important features from the reconstruction. VAE is an algorithm that tries to approximate a posterior distribution of the latent variable given a data point [11], which is represented by an image in our system. ECLF attempts to separate the latent feature vector ( 𝒛 ) of VAE into classifiable features and non-classifiable features by training an adversarial discriminator 𝑓 𝑑 () on the 𝒛 . First, the latent vector ( 𝒛 ), which is produced by the encoder, is divided into two parts: a classifiable feature vector ( CFV ) and a non-classifiable feature vector (

NCFV ) by the following procedure (Figure 2). If we input an image 𝒙 into the encoder function 𝑔(𝒙), it produces parameters to a multivariate Gaussian distribution with diagonal variance ( 𝝁 - mean vector and 𝝈 - diagonal variance vector), where 𝑖 is the image index, 𝑞 𝜙 is the posterior distribution of 𝒛 given 𝒙 . The production of 𝒛 is shown in Equation (1): 𝒛 ∼ 𝑞 𝜙 (𝐳|𝐱 𝑖 ) = 𝑁(𝐳; 𝝁 𝑖 , (𝝈 ) 𝑖 𝑰) , (1) 𝑁 is the normal distribution with a diagonal variance which is represented by (𝝈 ) 𝑖 𝑰. 𝒛 = [𝑪𝑭𝑽, 𝑵𝑪𝑭𝑽] . (2) In this case, we can divide the CFV and

NCFV parameters as

𝑪𝑭𝑽 = 𝑁(𝐳; 𝝁 𝑐𝑓𝑣𝑖 , (𝝈 ) 𝑐𝑓𝑣𝑖 𝑰) , (3) 𝑵𝑪𝑭𝑽 = 𝑁(𝐳; 𝝁 𝑛𝑐𝑓𝑣𝑖 , (𝝈 ) 𝑛𝑐𝑓𝑣𝑖 𝑰) , (4) If the decoder function is 𝑓(𝒙) , then we can call the reconstructed image 𝒙 ′ by using Equation (5). 𝒙 ′ = 𝑓([𝑪𝑭𝑽, 𝑵𝑪𝑭𝑽]) . (5) Figure 2 Training procedure of the VAE and adversarial discriminator

Figure 2 shows the simplified training procedure for the adversarial discriminator and variational autoencoder. The adversarial discriminator is used in the training of

NCFV in VAE training. The red arrow shows the discriminative loss from the adversarial discriminator. This discriminative loss attempts to remove classifiable features from the

NCFV , and the yellow arrow shows the reconstruction loss from the decoder to the encoder. An adversarial discriminator function 𝑓 𝑑 () tries to learn the class of the input image using only 𝑵𝑪𝑭𝑽 [12], so 𝐶 𝑑 is the class assigned to 𝒙 by 𝑓 𝑑 () using only 𝑵𝑪𝑭𝑽 . Given that the actual class of 𝒙 is 𝐶 𝑔𝑡 (the ground truth class), we train 𝑓 𝑑 () using the classification loss between 𝐶 𝑔𝑡 and 𝐶 𝑑 . In contrast to [12], which used an autoencoder, we used a part of the VAE output, and we did not condition the CFV using an attribute label because the algorithm will find the classifiable attributes or features . 𝐶 𝑑 = 𝑓 𝑑 (𝑁𝐶𝐹𝑉) , (6) ℒ 𝑑 = 𝑙𝑜𝑠𝑠𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛(𝐶 𝑑 , ¬𝐶 𝑔𝑡 ) . (7) If the adversarial discriminator can learn to discriminate the classes with good accuracy, then the NCFV contains information that can be used to clearly separate classes. Because adversarial training is conducted, if the accuracy is high in 𝐶 𝑑 , prediction is considered as a loss ℒ 𝑑 for the encoder (Equation (7)), so that the decoder is discouraged from producing classifiable features in the NCFV . Gradually, the encoder also learns to produce classifiable features in the

CFV . We try to minimize the VAE loss function while minimizing ℒ 𝑑 . We used a convolutional encoder and decoder for the architecture of the system. We need to ensure that the features used extracted in the encoder are properly represented in the decoder. To achieve this, we construct the decoder with the same weights as the encoder, which is a transposed version of the weights of the encoder. The authors of [13], has proved that the hidden layer activations in a DNNs can be recovered using a generative network. Thus, we assume that our VAE encoder can be recovered with the decoder given proper conditions such as weight sharing. The restrictive nature of the VAE loss function, which has an information bottleneck property [14], [15], may prevent NCFV features from going through the

𝑪𝑭𝑽 . To support the formation of classifiable features, we added a supportive classifier to the

CFV . The classifier plays a supportive rather than a major role in our system . The supportive classifier will provide feedback to VAE on the classifiability of the features. The supportive classifier (Figure 3), which is trained in parallel to the discriminative classifier, performs an opposite function to the adversarial classifier (Figure 2).

Figure 3 Role of supportive classifier in the VAE training

Figure 3 shows how the supportive classifier is used in the training of the VAE; in contrast to the red arrow in Figure 2, in Figure 3, the loss (green arrow) is created such that the features produced in the

CFV support the classification in the supportive classifier. However, we set this loss to play a less prominent role compared to the adversarial classifier. Although this classifier can be used for the final classification (stage 2 in Figure 1), the hyperparameters of this classifier are set to work with the VAE training. This may provide a suboptimal classification accuracy; for this reason, we use a separate classifier as the final classifier.

One of our main targets was to encourage the disentanglement in latent feature vector 𝒛 , in particular in CFV if [ 𝑐𝑓 , 𝑐𝑓 , 𝑐𝑓 … … … … … . 𝑐𝑓 𝑛 ] = CFV . (8) Then, when we decode the latent vector while changing one

CFV feature 𝑐𝑓 𝑛 , only the image features that correspond to 𝑐𝑓 𝑛 will change in the 𝒙 ′ . This is a necessary condition in visualizing which latent feature corresponds to which image features in 𝒙 ′ . Although the definition of disentanglement is under debate [16], in recent years, several methods were proposed to help variational autoencoders to learn disentangled latent features [15], [17], [18]. We also used the algorithm explained in [19], to improve the disentanglement between our features in training (Equation (9)). This algorithm worked well because it could factorize the different factors of disentanglement so that we could minimize the required factors only. ℒ 𝛽 = ∑ (𝔼 𝑞 [log 𝑝( 𝑥 𝑛 |𝑧)] − 𝛽 𝐾𝐿 (𝑞(𝑧|𝑥 𝑛 )||𝑝(𝑧)) 𝑁𝑛=1 . (9)

We attempt to increase the evidence lower bound, which is ℒ 𝛽 . To increase it, we can reduce the second term of the 𝐾𝐿 divergence between two probability distributions 𝑞(𝑧|𝑥 𝑛 ) and 𝑝(𝑧) of Equation (9), which can be considered the information bottleneck term [14], [15]. 𝑁 is the number of images in a batch, and 𝛽 is a coefficient that is responsible for controlling the information bottleneck. According to [19], we can divide the second term into three components. 𝑖𝑛𝑑𝑒𝑥 𝐶𝑜𝑑𝑒 𝑀𝐼 = 𝐾𝐿 (𝑞(𝑧, 𝑛)||𝑞(𝑧)𝑝(𝑛)), (10)

𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = 𝐾𝐿 (𝑞(𝑧)||∏ 𝑞(𝑧 𝑗 ) 𝑗 ) , (11) 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑤𝑖𝑠𝑒 𝐾𝐿 = ∑ 𝐾𝐿(𝑞(𝑧 𝑗 ) 𝑗 ||𝑝(𝑧 𝑗 ) ), (12) 𝔼 𝑝(𝑛) [𝐾𝐿 (𝑞(𝑧|𝑛)||𝑝(𝑧))] = 𝑖𝑛𝑑𝑒𝑥 𝐶𝑜𝑑𝑒 𝑀𝐼 + 𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛( 𝑻𝑪) + 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑤𝑖𝑠𝑒 𝐾𝐿(𝑫𝑲𝑳). ( Of these terms, the total correlation term, which not only reduces correlation but also encourages independence, is given a special focus.

For the training, we used the total loss function, shown in Equation 14, where reconstruction loss is 𝐿 𝑟𝑐 , training loss of the VAE is 𝐿 𝑉𝐴𝐸 , supportive classifier loss is 𝐿 𝑠 , discriminative regularization loss is 𝐿 𝑟𝑑 , and 𝛼, 𝜖, 𝜀, 𝛽, 𝛾 are used as constants in the training process. 𝐿 𝑉𝐴𝐸 = 𝐿 𝑟𝑐 + 𝛼𝐿 𝑟𝑑 + 𝜖𝐿 𝑑 + 𝜀𝐿 𝑠 + + 𝛽𝑇𝐶 + 𝛾𝐷𝐾𝐿 . (14) In this function, we can divide the losses into three main categories. The 𝐿 𝑟𝑐 + 𝛼𝐿 𝑟𝑑 can be referred to as the reconstruction losses, which is helpful in identifying the features reconstructed or changed by changing the 𝑐𝑓 𝑛 of the VAE; 𝐿 𝑑 + 𝐿 𝑠 can be referred to as feature separation classifiable losses, and 𝑻𝑪 + 𝑫𝑲𝑳 can be referred to as the disentangling terms.

Many researchers who work with explainable algorithms are concerned about the trade-off between explainability and accuracy [20], [21]. Explainability has many definitions in the deep learning literature [22]. In this research, we consider different aspects of explainability, such as interpretability of the classifiable features and their distribution in the dataset, interpretability of the linear classifier decision, and human understandability of the produced explanations. In our system, accuracy can be defined as the final classification accuracy or the classification loss of the classifier. The first target of the VAE training stage, which is separating the CFV and

NCFV , can be considered the accuracy term of the system. The second target, which considers disentanglement, and the third term, which considers human understandability are responsible for the explainability of the system. Considering the above terms, we can see the compromise between the first and second targets since we are moving towards lowering the number of dimensions, which makes the representation easy to understand, and towards making the features disentangled. However, the reduction of dimensionality may affect feature formation in CFV and

NCFV . Since our system uses the reconstructed image for visualization, we can also claim that the explainability targets themselves involve a compromise as well. Therefore, there is a trade-off between the 2 nd and 3 rd targets. Higher disentanglement leads to higher loss in the reconstruction of 𝒙 ′ so we can consider the disentanglement objective act as the deciding factor for the explainability and accuracy. The classifier is responsible for recognizing the classes using the

CFV . After we completed the training of VAE, we trained the classifier on the

CFV . Because the encoder produces parameters for the distribution,

CFV , as shown in Equation (3), we cannot train the classifier on this distribution. Because this might reduce the classification accuracy, we selected the value that has the highest likelihood to appear, which is, 𝝁 𝑐𝑓𝑣𝑖 . We use this value to train the classifier while keeping the encoder weights fixed. After the classifier is trained, we can use it to make the predictions (final classification). For the predictions, 𝝁 𝑐𝑓𝑣𝑖 is used as the input to the classifier. Once the prediction is made, we need to explain that prediction using the features we learned in the VAE. If we use a linear classifier, it is easy to see which features are more important because the relationship is linear between the features and the prediction. However, for a more generalized approach, it is better to use a classifier algorithm that can use nonlinear as well as linear relationships for the classification. It is possible to use many types of nonlinear classifiers in this part of the algorithm. Through this visualization, we try to determine which features played an important role in selecting one class from another. When the classifier makes the decision on an image, it uses a local decision boundary to decide between classes; on a zoomed-in level, this decision boundary can be approximated using a linear classifier. This method (local interpretable model-agnostic explanation; LIME) was introduced in [5]. It uses a standard segmentation algorithm to create image superpixels (sets of pixels with the same properties), then masking other superpixels to make sample points (input point to the algorithm). We used the LIME method because we were interested in local decision boundaries, even though there are many methods such as guided back propagation [23], layer-wise relevance propagation [24], and Grad-CAM [9], to show the importance of a feature to the decision. Figure 4 LIME explaining electric guitar detection [5]

Figure 4 shows that the LIME algorithm determines the pixels responsible for the selection of the acoustic guitar class. However, the neural networks cannot show the variation of features. In our method, we train the VAE to show the variation of classifiable features, so it becomes a model-dependent explanation because of the way we create the sample points.

Figure 5 LIME toy example [5]

In Figure 5, the blue and pink backgrounds show the nonlinear function of a deep learning model. This nonlinear model cannot be approximated with a linear function; LIME attempts to explain the bright red cross point by sampling instances (data points) and sending them to the nonlinear function to obtain the predictions; they are weighed according to the proximity, which is shown by the size of the marker. The faithful local explanation is shown by the dashed line [5]. We used Monte Carlo sampling around its CFV in classifiable feature space using 𝜇 𝑐𝑓𝑣𝑖 , 𝜎 of the CFV in order to produce training sample points for the linear classifier. These sample points can be classified into several classes. We focus on two classes that we are interested in to explain the decision: one class is selected by the classifier as the maximum likelihood class (this can be different from the 𝐶 𝑔𝑡 ), and the other class is to determine the difference from the maximum likelihood class. Then, we use the samples with the lowest probability (closest point to the decision boundary) as training samples for a linear classifier, in contrast to [5], which performed sample weighting according to the distance from the decision boundary. Thus, we can find a linear boundary for the classification. Because this is a linear classifier, the weights of the classifier are directly proportional to the importance of the feature at that point. Since the weights of the linear classifier for both classes, Class A and Class B, play important roles in the linear classification, we used the following formula to determine the importance of each feature. If 𝒄 is the class, 𝑾 is the weight vector, and 𝐼𝑀 is the importance. 𝑰𝑴 𝒄 = 𝑾 𝒄 × 𝑪𝑭𝑽 𝒄 , (15) 𝑰𝑴 = 𝑰𝑴 𝑨 − 𝑰𝑴 𝑩 . (16) We chose the maximum likelihood class over the 𝐶 𝑔𝑡 because sometimes the classifier produces incorrect classifications. In that case, we must explain why that decision was made by the classifier. Once the most important features are selected, we need to know exactly what this feature represents. To know that, we select the important features and perform feature interpolation, where we change a given feature value toward the other class keeping the other feature values constant. We visualize the change which corresponds to a particular image and classifiable features in the reconstructed image 𝒙 ′ by combining Equations (5) and (8). 𝒙 ′ = 𝑓([[𝑐𝑓 , 𝑐𝑓 , 𝑐𝑓 … . . 𝑐𝑓 𝑛 … . . ], 𝑵𝑪𝑭𝑽]) , (17) △ 𝒙 ′ = 𝝏𝑓𝝏𝑐𝑓 𝑥 . (18) When a classifiable feature is visualized to exactly understand that feature, we need to visualize the way that feature responds to the change in the latent variable 𝑐𝑓 𝑛 . Then, we can visualize the change by changing 𝑐𝑓 𝑛 and visualizing the changes in 𝒙 ′ . This is shown in Equation (18). The visualization is performed by changing each classifiable feature vector towards the mean value of the other class. There are two ways to visualize this: a factor-wise change and a total change. First, we explain the factor-wise change method. Figure 6 shows how the original class point from Class A is changed to the mean point in Class B. Figure 6 Feature interpolation for visualization

It is simple to select feature interpolation in an arbitrary manner. However, this may lead to misunderstandings about the decision boundary. We need to visualize how the features change in the local area where the decision is made. As Figure 6 shows, we make a representative data point by averaging the data points of Class B after we call that point the mean point (the closest point may be not a very good representation of Class B). Then, we change the individual feature value 𝑐𝑓 𝑛 from the original data point to the mean point. Although this is the ideal condition, we may need to perform feature interpolation starting before the original point and ending after the mean point. Where 𝑐𝑓 𝑛𝐴𝐼 is the changed input feature for the visualization, 𝑐𝑓 𝑛𝐴 is the feature value of the Class A at the original point, and 𝑐𝑓 𝑛𝐵𝑀 is the Class B mean point feature value; 𝑛 is the feature number for the visualized feature, and 𝑘 is an interpolation constant. Equation (19) shows how the interpolation vector is created. 𝑐𝑓 𝑛𝐴𝐼 = 𝑐𝑓 𝑛𝐴 − ( 𝑐𝑓 𝑛𝐴 − 𝑐𝑓 𝑛𝐵𝑀 ) × 𝑘 . (19) Figure 7 Classification explanation overview

Figure 7 shows how important feature selection and visualization are conducted in the ECLF system: 1. An initial classification is made. 2. The classifier makes a linear approximation of the decision boundary. 3. The most important features are selected based on the linear approximation. 4. One of the most important features, 𝑐𝑓 in this instance, is visualized by interpolating to 𝑐𝑓 which is shown in Equation (19). The bottom row shows the results of the interpolation: the variation of the most important feature that contributed to the classification decision. We can use a direct visualization of the feature by changing the specific feature in the CFV of the trained VAE. This will provide a general idea of the classifiable features. Another visualization method is to visualize the difference between the original point of Class A and the mean point of Class B. To visualize this difference, we obtain an image from a class (image A), we find the CFV for Class A, which we can call

CFV a . Then, we determine all Class B samples (images) whose latent vector can be called CFV b in classifiable feature space. We can visualize the decoded image of CFV a and the decoded image of CFV b . Then, we can travel from CFV a to the mean of CFV b , to visualize what features change. This can be used for individual features in the classifiable vector or the total classifiable feature vector. However, this does not show which individual features of CFV are used differently for the classification.

Even though the ECLF model can show the difference between classifiable features of individual instances of classes, it cannot show class-specific features. To avoid this complication, we also developed a system that can show the class-specific features of a

CFV . A class-specific feature vector can be defined as a feature vector that is specific to one class and can separate that class from at least one other class. Although it is possible to use

ECLF-CS in multiclass situations, we only consider two-class situations in this paper. ECLF-CS uses a latent vector that contains classifiable features specific to classes in two separate vectors, and we can call these vectors

𝑪𝑭𝑽𝑺 and 𝑪𝑭𝑽𝑺 ; 𝑪𝑭𝑽𝑺 and 𝑪𝑭𝑽𝑺 are assigned to two classes, which we can call 𝑐 and 𝑐 . In this case, the latent vector 𝒛 can be expressed as 𝒛 = [𝑪𝑭𝑽𝑺 , 𝑵𝑪𝑭𝑽, 𝑪𝑭𝑽𝑺 ]. (20) In the encoding phase, an input image 𝒙 is sent through the encoder to produce the 𝒛 vector. After that, the input to the decoder is decided based on the class of 𝒙 . Only the vector assigned to that class and the 𝑵𝑪𝑭𝑽 are passed to the decoder; for example, if the class is 𝑐 , the vector passed to the decoder is [ 𝑪𝑭𝑽𝑺 , 𝑵𝑪𝑭𝑽] . For the 𝑵𝑪𝑭𝑽 , the adversarial loss is used as in Equation (7). The number of classes that can be trained is limited in this approach, and training is slow compared to ECLF because of the class-specific training procedure.

As explained in Section 2.2, ECLF and ECLF-CS features have different characteristics; an ECLF feature shows a characteristic of the whole dataset, while ECLF-CS only shows a classifiable characteristic of one category. Thus, ECLF-CS features can provide direct information on the presence of a given category.

The classifiers for ECLF-CS are trained the same way as the classifier for ECLF. Since we have two class-specific vectors (

CFVS and CFVS ), we merge them before training them. 𝑪𝑭𝑽𝑺 = [𝑪𝑭𝑽𝑺 , 𝑪𝑭𝑽𝑺 ]. (21) The important features are determined and visualized according to the procedure described in Section 2.1.3. In contrast to ECLF, when a feature is selected as important in ECLF-CS, the feature visualization is performed depending on the feature vector the feature came from. If it came from 𝑪𝑭𝑽𝑺 , it is possible that 𝑪𝑭𝑽𝑺 was not used in producing 𝒙 ′ and vice versa. In training, the convolutional layers were pretrained using the whole PlantVillage dataset [25], [26], which contains diseased and healthy plant leaves. Then, training on specific datasets was conducted; we will discuss the datasets in Section 4.1.1.

Figure 8 Network architecture of the VAE

As Figure 8 shows, the network architecture of the VAE convolutional network had five layers. Fully connected layer FC1 had 8,192 input and output, fully connected layers FC-d and FC-c, which produced the

CFV and

NCFV , had input of 2,048 each. The fully connected layer FC-v, which produced the log value of the variation ( log(𝝈 ) ) for the variation of variables in CFV and

NCFV (discussed in Section 2.1.1), had a size of 4,098, and the output size of fully connected networks was determined by the latent vector. The input and output sizes of the network are 128 × 128 × 3.

All classifiers and discriminators had three fully connected layers with rectified linear unit (ReLU). The output size of the

CFV determines the input size of the supportive classifier . The VAE is pretrained using the entire PlantVillage dataset for 106,000 iterations, using only convolutional parts, which act as a pure encoder–decoder architecture.

Figure 9 Encoder–decoder architecture for pretraining

Figure 9 shows the encoder–decoder architecture, which was used in the pretraining of the VAE. Up to 120,000 iterations, the system was trained using only 𝐿 𝑟𝑐 + 𝜖𝐿 𝑑 + 𝜀𝐿 𝑠 of Equation (14). The warmup phase [27], was then used on 𝐿 𝑟𝑑 , 𝑇𝐶 , and 𝐷𝐾𝐿 for training up to 140,000 iterations. Although the warmup phase used KL divergence terms from [26], we used warmup phase also on 𝐿 𝑟𝑑 to balance the learning with KL divergence terms. Although the results were saved and calculated every 20,000 iterations, we always waited until 1,500,000 iterations. Ideally, 𝒙 ′ would be a good representation of 𝒙 , but this is not always possible. In this case, the faithfulness of the reconstruction comes into question. In recent years, many researchers including [28], have tried to solve this problem. Following [28], we also used discriminative regularization loss. A VGG-16 network [29], was used as a discriminative regularizer, and the first three layers of the network were used for discriminative regularization. The training was conducted for 5,000 iterations. The best validation accuracy was used for testing. The classifier in the final classifier training had the same architecture as the supportive classifier.

The linear classifier, which is used in the explanation phase, was trained using samples generated by the VAE, which is explained in Section 2.1.3. The linear classifier input was the same as the size of the

CFV, and the linear classifier had two outputs. We generated 100,000 sample points to train the classifier. Of these 100,000 samples, those that had the lowest softmax probability for the selected class were selected because these samples were the closest points to the decision boundary. In some cases, the original 𝝈 was not sufficient to generate enough samples from the other class for linear classifier training because the point that needed explanation was far from the decision boundary; in such cases, we increased (𝝈 ) 𝑐𝑓𝑣𝑖 until enough samples were generated.

3. Related Work 3.1. Explainability in Agriculture

As in many disciplines, explainability is a very important aspect in agriculture. Therefore, many researchers who developed agricultural image classification algorithms tried to incorporate explainability into their approaches. One of the pioneering approaches to explanation is the use of activation maps [26], [30], and [31], introduced a thresholding approach to visualize activation maps. However, only visualizing the first activation layer is not sufficient for a classification that involves top layers of the neural networks. Thus, researchers started to use methods such as saliency and guided backpropagation, which involve top layers, in their explanations [32]. Another approach used is the occlusion map in which a part of the image is occluded and changes to the activation are observed [33]. To use this method, the user must guess the exact size and shape of the occlusion. In recent years, the Grad-CAM [9], algorithm has become popular among researchers in the agricultural field [34], [35]. Some researchers in the agriculture sector have also developed their own approaches for explaining the classification involving u-net architecture [36]. A good review of explainable approaches can be found in [8]. Most of these methods focused on visualizing the important area. However, these visualizations do not provide an idea about the way a feature is represented in a neural network.

Many attempts have been made to find interpretable representation from data; for example, infoGan [37], Beta-VAE [17], and FactorVAE [18]. However, in our research, we tried to understand the independent variations in the image dataset that can be used for the classification of the dataset.

4. Materials and Experiments 4.1. Data 4.1.1. PlantVillage Dataset Subsets

We used a part of the PlantVillage dataset from [25], [26], which contained 39 classes of images (12 healthy and 27 disease classes). This dataset contains a single leaf in each image. We used the segmented versions of the leaf images, which were created in [26] . If we used the full PlantVillage dataset, the system would have to learn not only the differences between respective diseases and healthy and diseased plants but also the differences between plant types, for example, the differences between grape and potato leaves. In a real field or application condition, the type of plant is known. We only need to know which features in a leaf image the system used to separate one diseased leaf from another diseased leaf or a diseased leaf from a healthy leaf. Therefore, we created original datasets that contained only one type of leaf using the PlantVillage dataset (Table 1). Since some classes like “potato healthy” had a very low number of images, we had to restrict the datasets to 30 validation and testing images per class.

Table 1 Dataset Statistics

Original Dataset Containing Classes Number of Training Samples Number of Validating Samples/Class Number of Testing Samples/Class Grape4

Healthy, Black 3,823 30 30 Rot, Black Measles, Leaf Blight

Grape2

Healthy, Leaf Blight 1,379 30 30

Apple4

Healthy, Black Rot, Rust, Scab 2,931 30 30

Apple2

Healthy, Scab 2,155 30 30

Potato3

Healthy, Early Blight, Late Blight 1,972 30 30

Potato2

Healthy, Early Blight 1,032 30 30

Figure 10 Sample leaves from three original datasets with the dimensionality (Figure 11). This may be due to the relaxation of the information bottleneck, which forces the variables to become independent. Theoretically, it is possible for VAE to achieve very high reconstruction quality given equal or higher dimensions required for the data representation [38]. However, this may adversely affect the explainability of the system since the classifiable features are more distributed in higher dimensions. Even though there is a trade-off between 𝐿 𝑟𝑐 and 𝑇𝐶 in lower dimensions, it seems more advantageous to increase the dimensionality until the 𝐿 𝑟𝑐 reaches the flat regions. We can see that 𝑇𝐶 is increasing with the dimensionality (Figure 11), which means low dimensionality acts as a big supportive factor for the reduction of 𝑇𝐶 [17]. The classification accuracy appears to increase with the dimensionality as expected (Figure 12) for the Apple4 and Grape4 datasets; this may also be due to the relaxation of the information bottleneck, which encourages an increase in the number of classifiable features. Moreover, since a VAE performs a type of principal component analysis [39], we can think of this as an increase in the number of principal components in the explanations. The flattening curve of the accuracy seems to support this hypothesis. Potato3 shows high accuracy even in low dimensions. It seems that 40 dimensions are sufficient for the Potato3 dataset. The potato dataset has prominent features such as large lesions and color changes (Figure 10) for the classification, which may be the reason for the high accuracy with the dataset in low dimensions. When class differences are considered, the apple class seems to reduce the 𝐿 𝑟𝑐 faster than the Figure 13 Feature interpolation visualizations for grape late blight to healthy

In Figure 13, we are showing the top 3 most important features for each dimension for the visualization of the discrimination between grape late blight and healthy classes. From the top of the row of each dimension, the interpolated features are shown in descending order of importance for the classification. On most occasions, we tried to explain the difference between healthy and diseased image categories. Therefore, the features are interpolated from β on the Decision Explanation of ECLF Figure 16 Effect of β on the explanation

Figure 16 shows the effect of β on 320 - dimensional VAE; the visualization was conducted similarly to that in Figure 13. The higher values of β seem to remove the high-frequency features. It became difficult to distinguish the differences between individual features. Thus, without dimensionality constraints, increasing β does not improve explainability. We trained the 320-dimensional latent vector VAE for 1,500,000 iterations. The classification accuracies slightly increased when we moved to 1,500,000 iterations. The reconstruction loss also increased. For the Grape4, Apple4, and Potato3 datasets, classification accuracies were 96.7%, 90.0%, and 94.4%, respectively. The visual quality of the reconstructions increased, although the reconstruction error seemed to increase due to overfitting. We can speculate that the classifiable features require more iterations to train.

Figure 17 Most important feature for the classification from healthy class Figure 17 shows the way the most important feature for the classification gradually changes from diseased to healthy class. We can see the features in the real image represented in the diseased column and the changes in those features from the diseased column to the healthy side and the diseased side. It is seen that a mixture of shape and color changes are considered as one feature by the system. When features travel further from the original point and mean healthy point, we can see that some other disease class features also encroach into the healthy side. This may be because that feature represents the variation of other classes in the database. Thus, the feature may help separate diseased and healthy locally, but it can contain some other class features. We encountered a few difficulties when obtaining explanations. One of the problems was that the reconstruction was not 100% accurate; the decoder recreates the latent vector supplied by the encoder, and some information is lost in the latent vector. This loss of information may be the reason for this problem. Research is being conducted to create a VAE able to generate an image very close to the original image [40]. Moreover, sometimes, the explanations were not easily detectable to the human eye. This is another problem we faced; since the interpolation is proportional to the difference between the original point and the mean point of the other class, some explanation interpolations were too small for an interpretation. Figure 18 Difficult to interpret interpolations

Figure 18 shows a difficult to interpret interpolation, in which the change was so small that the explainability of the interpolation became very low.

Using ECLF-CS, the classification accuracies for the Apple2, Grape2, and Potato2 datasets were 98.3%, 98.3%, and 100.0%, respectively, which were higher than those of ECLF. In particular, the accuracy was higher in the Apple2 dataset, which may be because we conducted class-specific training, and the classifier only had to handle two classes. Even with a very high number of iterations (1,400,000), the classification accuracies did not change significantly; they were 100.0%, 96.8%, and 100.0% for the Apple2, Grape2, and Potato2 datasets, respectively.

We used a 160-dimensional VAE to make ECLF-CS features. We trained the VAE for 1,400,000 iterations. We used the Apple4 and Grape4 classes for this experiment. As previously, we used the lowest loss point, where 𝑳 𝑹𝒄 + 𝑻𝑪 + 𝑫𝑲𝑳 became the minimum. We could visualize two classes by dropping the other class in the visualization. For example, while the diseased side of the encoder produces a diseased vector, the decoder produces the image that is passed to it by the diseased vector. Therefore, there can be some information loss, however, what decoder is producing is what classifier used for the classification. The healthy side also tries to produce an image that is close to a healthy image. These images show us what the VAE sees in the latent space (Figure 19). Figure 19 Reconstruction of healthy and diseased images

Figure 19 shows the difference between the reconstructed diseased and healthy images at the lowest loss point and after 1,400,000 iterations. ECLF-CS also seems to follow the multiclass classification because a higher number of iterations provides better visualizations. Figure 20 Important feature visualization for ECLF-CS

In Figure 20, features are class-specific; however, in the classification stage, they are used to differentiate between classes; thus, it shows which part of the feature is considered as another class in the classification. We can clearly divide the features of two-class classification into classes A and B. Grape feature 1 belongs to the grape late-blight class, and grape feature 2 belongs to the healthy grape class. Apple features 1 and 2 of apple scab belong to the healthy apple class.

In ECLF, features are trained to cross the classes in VAE training, so it is easy to understand how the features behave. On the other hand, in ECLF-CS, features are not trained to cross the classes, so we must visualize the feature variation within the class images. Therefore, we know which features belong to which class. Figure 21 Reconstruction comparison between ECLF and ECLF-CS

Figure 21 shows a reconstruction comparison between ECLF and ECLF-CS features. The reconstruction of ECLF-CS seems to be of better quality. This may be due to the class-specific training we conducted in the VAE training stage. Furthermore, class-specific training may also explain why the apple class showed a larger improvement in the classification accuracy. Compared with multiclass classification, two-class classification seems to handle subtle features more easily. The apple dataset has diseased features that are not very prominent to the eye, such as lesions and darkening.

We used the VGG-11 network [29], to compare the classification accuracy with ECLF and ECLF-CS. For the training of the ECLF for two classes, the same datasets as for the ECLF-CS were used, and the same training and testing parameters that were used in ECLF were used. We must remark that we did not exclusively optimize the VGG-11 network for any class. Table 2 Comparison of classification accuracies * Dataset VGG11 ECLF ECLF-CS Grape4

Grape2

Apple4

Apple2

Potato3

Potato2 * For the ECLF, the accuracies are shown at latent vector 320 and for ECLF-CS, accuracies are shown at latent vector size of 160 For the Grape4 dataset, we achieved almost the same classification accuracy using ECLF-CS and ECLF. However, ECLF has shown a decrease in the classification accuracy in the Apple4 and Potato3 datasets. On the other hand, the ECLF-CS and ECLF showed competitive performances in the two-class classifications. ECLF-CS showed good performance over VGG-11 on the Apple2 dataset. The reason for this performance must be further studied.

5. Conclusions

We devised a new approach that deviates from the conventional important area visualization. Our approach could explain why the classification occurred based on the dataset variation. We trained the proposed networks with the datasets extracted from the PlantVillage dataset and achieved acceptable accuracy with high explainability. There are some limitations including low quality of visualization, which must be improved in future studies. Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

Acknowledgments

We would like to express our deep gratitude to Dr. TAKEYA Masaru for research management in carrying out this research. We also thank the Plant Protection Station of the Ministry of Agriculture, Forestry and Fisheries of Japan (MAFF), the Hokkaido Research Organization (HRO), the Tokachi Federation of Agricultural Cooperatives, the Center for Seeds and Seedlings (NCSS), the Institute of Vegetable and Floriculture Science (NIVFS), the Institute for Agro-Environmental Sciences (NIAES), and the Hokkaido Agricultural Research Center (HARC) of NARO for their invaluable cooperation.

Funding

Part of this research was supported by grants from the Project of the NARO Bio-oriented Technology Research Advancement Institution (Research Program on Development of Innovative Technology, project ID: 01022C) and a research project of the Ministry of Agriculture, Forestry and Fisheries and a public/private R&D investment strategic expansion program (PRISM) of the Cabinet Office of Japan.

References [1] A. Kamilaris and F. X. Prenafeta-Boldú, “Deep learning in agriculture: A survey,” Computers and Electronics in Agriculture , vol. 147. pp. 70–90, 2018. [2] H. Habaragamuwa, Y. Ogawa, T. Suzuki, T. Shiigi, M. Ono, and N. Kondo, “Detecting greenhouse strawberries (mature and immature), using deep convolutional neural network,”

Eng. Agric. Environ. Food , vol. 11, no. 3, pp. 127–138, 2018. [3] A. Barredo Arrieta et al. , “Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI,”

Inf. Fusion , vol. 58, pp. 82–115, 2020. [4] J. T. Leek et al. , “Tackling the widespread and critical impact of batch effects in high-throughput data,”

Nat. Rev. Genet. , vol. 11, no. 10, pp. 733–739, 2010. [5] C. Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, “Why should I trust you?: Explaining the predictions of any classifier,” in

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016, pp. 1135–1144. [6] T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Debiasing word embedding,” in , 2016, pp. 1–9. [7] R. Geirhos, C. Michaelis, F. A. Wichmann, P. Rubisch, M. Bethge, and W. Brendel, “Imagenet-Trained Cnns Are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness,” arXiv Prepr. arXiv1811.12231 , 2018. [8] Y. Toda and F. Okura, “How Convolutional Neural Networks Diagnose Plant Disease,”

Plant Phenomics , vol. 2019, pp. 1–14, 2019. [9] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in

Proceedings of the IEEE international conference on computer vision , 2017, pp. 618–626. [10] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. R. Müller, “Explaining nonlinear classification decisions with deep Taylor decomposition,”

Pattern Recognit. , vol. 65, pp. 211–222, 2017. [11] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” arXiv Prepr. arXiv1906.02691 , 2019. [12] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, and M. Ranzato, “Fader networks: Generating image variations by sliding attribute values,” in , 2017, pp. 5969–5978. [13] S. Arora, Y. Liang, and T. Ma, “Why are deep nets reversible: A simple theory, with implications for training,” arXiv Prepr. arXiv1511.05653 , 2015. [14] A. A. Alemi, I. Fischer, J. V Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv Prepr. arXiv1612.00410 , 2016. [15] C. P. Burgess et al. , “Understanding disentangling in β-VAE,” arXiv Prepr. arXiv:1804.03599 , 2018. [16] E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh, “Disentangling disentanglement in variational autoencoders,” in International Conference on Machine Learning , 2019, pp. 4402–4412. [17] I. Higgins et al. , “beta-vae: Learning basic visual concepts with a constrained variational framework,” 2016. [18] H. Kim and A. Mnih, “Disentangling by factorising,” in in Proceedings of the International Conference on Machine Learnin , 2018, pp. 2649–2658. [19] T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems , 2018, pp. 2615–2625. [20] Y. Luo, H.-H. Tseng, S. Cui, L. Wei, R. K. Ten Haken, and I. El Naqa, “Balancing accuracy and interpretability of machine learning approaches for radiation treatment outcomes modeling,”

BJR|Open , vol. 1, no. 1, p. 20190021, 2019. [21] L. Veiber, K. Allix, Y. Arslan, T. F. Bissyandé, and J. Klein, “Challenges towards production-ready explainable machine learning,” in

OpML 2020 - 2020 USENIX Conference on Operational Machine Learning , 2020, pp. 3–5. [22] N. Xie, G. Ras, M. van Gerven, and D. Doran, “Explainable deep learning: A field guide for the uninitiated,” arXiv Prepr. arXiv2004.14545 , 2020. [23] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv Prepr. arXiv1412.6806 , 2014. [24] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. R. Müller, and W. Samek, “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation,”

PLoS One , vol. 10, no. 7, pp. 1–46, 2015. [25] D. P. Hughes and M. Salathe, “An open access repository of images on plant health to enable the development of mobile disease diagnostics,” arXiv Prepr. arXiv1511.08060 , 2015. [26] S. P. Mohanty, D. P. Hughes, and M. Salathé, “Using deep learning for image-based plant disease detection,” Front. Plant Sci. , vol. 7, pp. 1–10, 2016. [27] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” in

Advances in Neural Information Processing Systems , 2016, vol. 0, pp. 3745–3753. [28] A. Lamb, V. Dumoulin, and A. Courville, “Discriminative Regularization for Generative Models,” arXiv Prepr. arXiv1602.03220 , 2016. [29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556 , 2015. [30] P. Jiang, Y. Chen, B. Liu, D. He, and C. Liang, “Real-Time Detection of Apple Leaf Diseases Using Deep Learning Approach Based on Improved Convolutional Neural Networks,”

IEEE Access , vol. 7, pp. 59069–59080, 2019. [31] S. Ghosal, D. Blystone, A. K. Singh, B. Ganapathysubramanian, A. Singh, and S. Sarkar, “An explainable deep machine vision framework for plant stress phenotyping,”

Proc. Natl. Acad. Sci. U. S. A. , vol. 115, no. 18, pp. 4613–4618, May 2018. [32] M. Brahimi, M. Arsenovic, S. Laraba, S. Sladojevic, K. Boukhalfa, and A. Moussaoui, “Deep Learning for Plant Diseases: Detection and Saliency Map Visualisation,” in

Human and Machine Learning , Springer, Cham, 2018, pp. 93–117. [33] M. Brahimi, K. Boukhalfa, and A. Moussaoui, “Deep Learning for Tomato Diseases: Classification and Symptoms Visualization,”

Appl. Artif. Intell. , vol. 31, no. 4, pp. 299–315, 2017. [34] S. V. Desai, V. N. Balasubramanian, T. Fukatsu, S. Ninomiya, and W. Guo, “Automatic estimation of heading date of paddy rice using deep learning,”

Plant Methods , vol. 15, no. 1, pp. 1–11, 2019. [35] M. F. Hansen et al. , “Towards on-farm pig face recognition using convolutional neural networks,”

Comput. Ind. , vol. 98, pp. 145–152, 2018. [36] M. Brahimi, S. Mahmoudi, K. Boukhalfa, and A. Moussaoui, “Deep interpretable architecture for plant diseases classification,” in , 2019, pp. 111–116.. [37] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets,”

Adv. Neural Inf. Process. Syst. , pp. 2180–2188, 2016. [38] B. Dai and D. Wipf, “Diagnosing and Enhancing VAE Models,” arXiv Prepr. arXiv1903.05789 , 2019. [39] M. Rolinek, D. Zietlow, and G. Martius, “Variational autoencoders pursue PCA directions (by accident),” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 12406–12415. [40] X. Hou, K. Sun, L. Shen, and G. Qiu, “Improving variational autoencoder with deep feature consistent and generative adversarial training,”

Neurocomputing , vol. 341, pp. 183–194, 2019., vol. 341, pp. 183–194, 2019.