ACM Turing Award Celebration Conference - China ( ACM TURC 2021) | 2021

Fighting Adversarial Images With Interpretable Gradients

 
 
 
 

Abstract


Adversarial images are specifically designed to fool neural networks into making a wrong decision about what they are looking at, which severely degrade neural network accuracy. Recently, empirical and theoretical evidence suggests that robust neural network models tend to have better interpretable gradients. Therefore, we speculate that improving the interpretability of the gradients of the neural network models may also help to improve the robustness of the models. Two methods are used to add gradient-dependent constraint terms to the loss function of neural network models and both improve the robustness of the models. The first method adds the fussed lasso penalty term of the saliency maps to the loss function of the neural network models, which makes the saliency maps arrange in a natural way to improve the interpretability of the saliency maps, and uses the gradient enhancement for relu instead of relu to strengthen the constraint of regularization term on saliency maps. In the second method, the cosine similarity penalty term between the input gradients and the image contour is added to the loss function of the model to constrain the approximation between the input gradients and the image contour. This method has a certain biological significance, because the contour information of the image is used in the human visual system to recognize the image. Both methods improve the interpretability of model‘s gradients and the first method exceeds most regularization methods except adversarial training on MNIST and the second method even exceeds the adversarial training under white-box attacks on CIFAR-10 and CIFAR-100.

Volume None
Pages None
DOI 10.1145/3472634.3472644
Language English
Journal ACM Turing Award Celebration Conference - China ( ACM TURC 2021)

Full Text