Active Subspace of Neural Networks: Structural Analysis and Universal Attacks
Chunfeng Cui, Kaiqi Zhang, Talgat Daulbaev, Julia Gusak, Ivan Oseledets, Zheng Zhang
aa r X i v : . [ c s . L G ] A p r Active Subspace of Neural Networks: Structural Analysis and Universal Attacks ∗ Chunfeng Cui † , Kaiqi Zhang † , Talgat Daulbaev ‡ , Julia Gusak ‡ ,Ivan Oseledets § , and Zheng Zhang † Abstract.
Active subspace is a model reduction method widely used in the uncertainty quantification commu-nity. In this paper, we propose analyzing the internal structure and vulnerability of deep neuralnetworks using active subspace. Firstly, we employ the active subspace to measure the number of“active neurons” at each intermediate layer, which indicates that the number of neurons can be re-duced from several thousands to several dozens. This motivates us to change the network structureand to develop a new and more compact network, referred to as ASNet, that has significantly fewermodel parameters. Secondly, we propose analyzing the vulnerability of a neural network using activesubspace by finding an additive universal adversarial attack vector that can misclassify a datasetwith a high probability. Our experiments on CIFAR-10 show that ASNet can achieve 23.98 × pa-rameter and 7.30 × flops reduction. The universal active subspace attack vector can achieve around20% higher attack ratio compared with the existing approaches in our numerical experiments. ThePyTorch codes for this paper are available online . Key words.
Active Subspace, Deep Neural Network, Network Reduction, Universal Adversarial Perturbation
AMS subject classifications.
1. Introduction.
Deep neural networks have achieved impressive performance in manyapplications, such as computer vision [35], nature language processing [58], and speech recog-nition [23]. Most neural networks use deep structure (i.e., many layers) and a huge number ofneurons to achieve a high accuracy and expressive power [44, 19]. However, it is still unclearhow many layers and neurons are necessary. Employing an unnecessarily complicated deepneural network can cause huge extra costs in run-time and hardware resources. Driven byresource-constrained applications such as robotics and internet of things, there is an increasinginterest in building smaller neural networks by removing network redundancy. Representa-tive methods include network pruning and sharing [17, 25, 27, 39, 38], low-rank matrix andtensor factorization [49, 26, 18, 36, 43], parameter quantization [12, 15], knowledge distilla-tion [28, 46], and so forth. However, most existing methods delete model parameters directlywithout changing the network architecture [27, 25, 7, 38].Another important issue of deep neural networks is the lack of robustness. A deep neural ∗ Submitted to the editors on October 2019.
Funding:
Chunfeng Cui, Kaiqi Zhang, and Zheng Zhang are supported by the UCSB start-up grant. TalgatDaulbaev, Julia Gusak, and Ivan Oseledets are supported by the Ministry of Education and Science of the RussianFederation (grant 14.756.31.0001). † University of California Santa Barbara, Santa Barbara, CA, USA ([email protected], [email protected],[email protected]). ‡ Skolkovo Institute of Science and Technology, Moscow, Russia ([email protected],[email protected]). § Skolkovo Institute of Science and Technology and Institute of Numerical Mathematics of Russian Academy ofSciences, Moscow, Russia ([email protected].) Codes are available at: https://github.com/chunfengc/ASNet network is desired to maintain good performance for noisy or corrupted data to be deployed insafety-critical applications such as autonomous driving and medical image analysis. However,recent studies have revealed that many state-of-the-art deep neural networks are vulnerableto small perturbations [54]. A substantial number of methods have been proposed to generateadversarial examples. Representative works can be classified into four classes [52], includingoptimization methods [8, 41, 40, 54], sensitive features [22, 45], geometric transformations[16, 32], and generative models [4]. However, these methods share a fundamental limitation:each perturbation is designed for a given data point, and one has to implement the algorithmagain to generate the perturbation for a new data sample. Recently, several methods havealso been proposed to compute a universal adversarial attack to fool a dataset simultaneously(rather than one data sample) in various applications, such as computer vision [40], speechrecognition [42], audio [1], and text classifier [5]. However, all the above methods only solve aseries of data-dependent sub-problems. In [33], Khrulkov et al. proposed to construct universalperturbation by computing the so-called ( p, q )-singular vectors of the Jacobian matrices ofhidden layers of a network.This paper investigates the above two issues with the active subspace method [48, 9, 10]that was originally developed for uncertainty quantification. The key idea of the active sub-space is to identify the low-dimensional subspace constructed by some important directionsthat can contribute significantly to the variance of the multi-variable function. These di-rections are corresponding to the principal components of the uncentered covariance matrixof gradients. Afterwards, a response surface can be constructed in this low-dimensional sub-space to reduce the number of parameters for partial differential equations [10] and uncertaintyquantification [11]. However, the power of active subspace in analyzing and attacking deepneural networks has not been explored.
The contribution of this manuscript is twofold. • Firstly, we apply the active subspace to some intermediate layers of a deep neural network,and try to answer the following question: how many neurons and layers are important ina deep neural network?
Based on the active subspace, we propose the definition of “activeneurons”. Fig. 1 (a) shows that even though there are tens of thousands of neurons, onlydozens of them are important from the active subspace point of view. Fig. 1 (b) furthershows that most of the neural network parameters are distributed in the last few layers.This motivates us to cut off the tail layers and replace them with a smaller and simpler newframework called ASNet. ASNet contains three parts: the first few layers of a deep neuralnetwork, an active-subspace layer that maps the intermediate neurons to a low-dimensionalsubspace, and a polynomial chaos expansion layer that projects the reduced variables to theoutputs. Our numerical experiments show that the proposed ASNet has much fewer modelparameters than the original one. ASNet can also be combined with existing structured re-training methods (e.g., pruning and quantization) to get better accuracy while using fewermodel parameters. • Secondly, we use active subspace to develop a new universal attack method to fool deepneural networks on a whole data set. We formulate this problem as a ball-constrained lossmaximization problem and propose a heuristic projected gradient descent algorithm to solveit. At each iteration, the ascent direction is the dominant active subspace, and the stepsize
CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS N u m o f n e u r on s (a)Active VGG-19 0 5 10 15Cut-off layer (l)0.00.51.01.52.0 N u m o f p a r a m e t e r s -norm of perturbation020406080100 T e s ti ng acc u r ac y ( % ) (c)AS Random Figure 1: Structural analysis of deep neural networks by the active subspace (AS). All experi-ments are conducted on CIFAR-10 by VGG-19. (a) The number of neurons can be significantlyreduced by the active subspace. Here, the number of active neurons is defined by Definition 3.1with a threshold ǫ = 0 .
05; (b) Most of the parameters are distributed in the last few layers;(c) The active subspace direction can perturb the network significantly.is decided by the backtracking algorithm. Fig. 1 (c) shows that the attack ratio of the activesubspace direction is much higher than that of the random vector.The rest of this manuscript is organized as follows. In Section 2, we review the key idea ofactive subspace. Based on the active-subspace method, Section 3 shows how to find the numberof active neurons in a deep neural network and further proposes a new and compact network,referred to as ASNet. Section 4 develops a new universal adversarial attack method based onactive subspace. The numerical experiments for both ASNet and universal adversarial attacksare presented in Section 5. Finally, we conclude this paper in Section 6.
2. Active Subspace.
Active-subspace is an efficient tool for functional analysis and di-mension reduction. Its key idea is to construct a low-dimensional subspace for the inputvariables in which the function value changes dramatically. Given a continuous function c ( x )with x described by the probability density function ρ ( x ), one can construct an uncenteredcovariance matrix for the gradient: C = E [ ∇ c ( x ) ∇ c ( x ) T ]. Suppose the matrix C admits thefollowing eigenvalue decomposition,(2.1) C = VΛV T , where V includes all orthogonal eigenvectors and(2.2) Λ = diag( λ , · · · , λ n ) , λ ≥ · · · ≥ λ n ≥ C is positive semidefinite.One can split the matrix V into two parts,(2.3) V = [ V , V ] , where V ∈ R n × r and V ∈ R n × ( n − r ) . The subspace spanned by matrix V ∈ R n × r is called an active subspace [48], because c ( x ) issensitive to perturbation vectors inside this subspace . C. CUI, K. ZHANG, T. DAULBAEV, J. GUSAK, I. OSELEDETS, AND Z. ZHANG
Remark (Relationships with the Principal Component Analysis).
Given a set of data X = [ x , . . . , x m ] with each column representing a data sample and each row is zero-mean,the first principal component w inherits the maximal variance from X , namely,(2.4) w = argmax k w k =1 m X i =1 ( w T x i ) = argmax k w k =1 w T XX T w . The variance is maximized when w is the eigenvector associated with the largest eigenvalueof XX T . The first r principal components are the r eigenvectors associated with the r largesteigenvalues of XX T . The main difference with the active subspace is that the principal com-ponent analysis uses the covariance matrix of input data sets X , but the active-subspacemethod uses the covariance matrix of gradient ∇ c ( x ). Hence, a perturbation along the direc-tion w from (2.4) only guarantee the variability in the data, and does not necessarily causea significantly change on the value of c ( x ).The following lemma quantitatively describes that c ( x ) varies more on average along thedirections defined by the columns of V than the directions defined by the columns of V . Lemma 2.2. [10] Suppose c ( x ) is a continuous function and C is obtained from (2.1). Forthe matrices V and V generated by (2.3), and the reduced vector (2.5) z = V T x and ˜ z = V T x , it holds that E x [ ∇ z c ( x ) T ∇ z c ( x )] = λ + . . . + λ r , E x [ ∇ ˜ z c ( x ) T ∇ ˜ z c ( x )] = λ r +1 + . . . + λ n . (2.6) Sketch of proof [10]: E x [ ∇ z c ( x ) T ∇ z c ( x )]=trace (cid:0) E x [ ∇ z c ( x ) ∇ z c ( x ) T ] (cid:1) =trace (cid:0) E x [ V T ∇ x c ( x ) ∇ x c ( x ) T V ] (cid:1) =trace (cid:0) V T CV (cid:1) = λ + . . . + λ r . When λ r +1 = . . . = λ n = 0, Lemma 2.2 implies ∇ ˜ z c ( x ) is zero everywhere, i.e., c ( x ) is˜ z -invariant. In this case, we may reduce x ∈ R n to a low-dimensional vector z = V T x ∈ R r and construct a new response surface g ( z ) to represent c ( x ). Otherwise, if λ r +1 is small,we may still construct a response surface g ( z ) to approximate c ( x ) with a bounded error, asshown in the following lemma. For a fixed z , the best guess for g is the conditional expectationof c given z , i.e.,(2.7) g ( z ) = E ˜ z [ c ( x ) | z ] = Z c ( V z + V ˜ z ) ρ (˜ z | z ) d ˜ z . Based on the Poincar´e inequality, the following approximation error bound is obtained [10].
CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Lemma 2.3.
Assume that c ( x ) is absolutely continuous and square integrable with respectto the probability density function ρ ( x ) , then the approximation function g ( z ) in (2.7) satisfies: (2.8) E [( c ( x ) − g ( z )) ] ≤ O ( λ r +1 + . . . + λ n ) . Sketch of proof [10]: E x [( c ( x ) − g ( z )) ]= E z [ E ˜ z [( c ( x ) − g ( z )) | z ]] ≤ const × E z [ E ˜ z [ ∇ ˜ z c ( x ) T ∇ ˜ z c ( x ) | z ]] (Poincar´e inequality)=const × E x [ ∇ ˜ z c ( x ) T ∇ ˜ z c ( x )]=const × ( λ r +1 + . . . + λ n ) (Lemma 2.2)= O ( λ r +1 + . . . + λ n ) . In other words, the active-subspace approximation error will be small if λ r +1 , . . . , λ n arenegligible.
3. Active Subspace for Structural Analysis and Compression of Deep Neural Networks.
This section applies the active subspace to analyze the internal layers of a deep neural networkto reveal the number of important neurons at each layer. Afterward, a new network calledASNet is built to reduce the storage and computational complexity.
A deep neural network can be described as(3.1) f ( x ) = f L ( f L − . . . ( f ( x ))) , where x ∈ R n is an input, L is the total number of layers, and f l : R n l − → R n l is afunction representing the l -th layer (e.g., combinations of convolution or fully connected,batch normalization, ReLU, or pooling layers). For any 1 ≤ l ≤ L , we rewrite the abovefeed-forward model as a superposition of functions, i.e.,(3.2) f ( x ) = f l post ( f l pre ( x )) , where the pre-model f l pre ( · ) = f l . . . ( f ( · )) denotes all operations before the l -th layer andthe post-model f l post ( · ) = f L . . . ( f l +1 ( · )) denotes all succeeding operations. The intermediateneuron x l = f l pre ( x ) ∈ R n l usually lies in a high dimension. We aim to study whether such ahigh dimensionality is necessary. If not, how can we reduce it? Denote loss( · ) as the loss function, and(3.3) c l ( x ) = loss( f l post ( x )) . The covariance matrix C = E [ ∇ c l ( x ) ∇ c l ( x ) T ] admits the eigenvalue decomposition C = VΛV T with Λ = diag( λ , · · · , λ n l ). We try to extract the active subspace of c l ( x ) and reducethe intermediate vector x to a low dimension. Here the intermediate neuron x , the covariancematrix C , eigenvalues Λ , and eigenvectors V are also related to the layer index l , but weignore the index for simplicity. C. CUI, K. ZHANG, T. DAULBAEV, J. GUSAK, I. OSELEDETS, AND Z. ZHANG
Definition Λ is computed by (2.2). For any layer index 1 ≤ l ≤ L , we define the number of active neurons n l, AS as follows:(3.4) n l,AS = arg min (cid:26) i : λ + . . . + λ i λ + . . . + λ n l ≥ − ǫ (cid:27) , where ǫ > n l, AS -dimensionalfunction with a high accuracy, i.e.,(3.5) g l ( z ) = E ˜ z [ c l ( x ) | z ] . Here z = V T x ∈ R n l,AS plays the role of active neurons, ˜ z = V T x ∈ R n − n l,AS , and V =[ V , V ]. Lemma 3.1.
Suppose the input x is bounded. Consider a deep neural network with thefollowing operations: convolution, fully connected, ReLU, batch normalization, max-pooling,and equipped with the cross entropy loss function. Then for any l ∈ { , . . . , L } , x = f l pre ( x ) ,and c l ( x ) = loss ( f l post ( x )) , the n l,AS -dimensional function g l ( z ) defined in (3.5) satisfies (3.6) E z h ( g l ( z )) i ≤ E x h ( c ( x )) i + O ( ǫ ) . Proof.
Denote c l ( x ) = loss( f L ( . . . ( f l +1 ( x ))), where loss( y ) = − log exp( y b ) P nLi =1 exp( y i ) is the crossentropy loss function, b is the true label, and n L is the total number of classes. We first show c l ( x ) is absolutely continuous and square integrable, and then apply Lemma 2.3 to derive(3.6).Firstly, all components of c l ( x ) are Lipschitz continuous because (1) the convolution,fully connected, and batch normalization operations are all linear; (2) the max pooling andReLU functions are non-expansive. Here, a mapping m is non-expansive if k m ( x ) − m ( y ) k ≤k x − y k ; (3) the cross entropy loss function is smooth with an upper bounded gradient,i.e., k∇ loss( y ) k = k e b − exp( y ) / P n L i =1 exp( y i ) k ≤ √ n L . The composition of two Lipschitzcontinuous functions is also be Lipschitz continuous: suppose the Lipschitz constants for f and f are α and α , respectively, it holds that k f ( f (¯ x )) − f ( f ( x )) k ≤ α k f (¯ x ) − f ( x ) k ≤ α α k ¯ x − x k for any vectors ¯ x and x . By recursively applying the above rule, c l ( x ) is Lipschitzcontinuous: k c l (¯ x ) − c l ( x ) k = k loss( f L ( . . . ( f l +1 (¯ x )))) − loss( f L ( . . . ( f l +1 ( x )))) k ≤ √ n L α L . . . α l +1 k ¯ x − x k . The intermediate neuron x is in a bounded domain because the input x is bounded and allfunctions f i ( · ) are either continuous or non-expansive. Based on the fact that any Lipschitz-continuous function is also absolutely continuous on a compact domain [47], we conclude that c l ( x ) is absolutely continuous.Secondly, because x is bounded and c l ( x ) is continuous, both c l ( x ) and its square integralwill be bounded, i.e., R ( c l ( x ) ρ ( x ) d x < ∞ . CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Finally, by Lemma 2.3, it holds that E x [( c l ( x ) − g l ( z )) ] ≤ O ( λ n l,AS +1 + . . . + λ n ) . From Definition 3.1, we have λ n l,AS +1 + . . . + λ n ≤ ( λ + . . . + λ n ) ǫ = k C / k F ǫ = O ( ǫ ) . In the last equality, we used that k C / k F is upper bounded because c l ( x ) is Lipschitz con-tinuous with a bounded gradient. Consequently, we have E x [( g l ( z )) ]= E x [( g l ( z ) − c l ( x ) + c l ( x )) ] ≤ E x [( c l ( x )) ] + 2 E x [( c l ( x ) − g l ( z )) ]=2 E x [( c ( x )) ] + 2 E x [( c l ( x ) − g l ( z )) ] ≤ E x [( c ( x )) ] + O ( ǫ ) . The proof is completed.The above lemma shows that the active subspace method can reduce the number of neuronsof the l -th layer from n l to n l,AS . The loss for the low-dimensional function g l ( z ) is boundedby two terms: the loss c ( x ) of the original network, and the threshold ǫ related to n l,AS .This loss function is the cross entropy loss, not the classification error. However, it is believedthat a small loss will result in a small classification error. Further, the result in Lemma 3.1 isvalid for thr fixed parameters in the pre-model. In practice, we can fine-tune the pre-modelto achieve better accuracy.Further, a small active neurons n l,AS is critical to get a high compress ratio. From Def-inition 3.1, n l,AS depends on the eigenvalue distribution of the covariance matrix C . For aproper network structure and a good choice of the layer index l , if the eigenvalues of C aredominated by the first few eigenvalues, then n l,AS will be small. For instance, in Fig. 5(a),the eigenvalues for layers 4 ≤ l ≤ This subsection proposes a new network calledASNet that can reduce both the storage and computational cost. Given a deep neural network,we first choose a proper layer l and project the high-dimensional intermediate neurons to a low-dimensional vector in the active subspace. Afterward, the post-model is deleted completelyand replaced with a nonlinear model that maps the low-dimensional active feature vector tothe output directly. This new network, called ASNet, has three parts:(1) Pre-model: the pre-model includes the first l layers of a deep neural network.(2) Active subspace layer: a linear projection from the intermediate neurons to thelow-dimensional active subspace.(3)
Polynomial chaos expansion layer: the polynomial chaos expansion [20, 56] mapsthe active-subspace variables to the output.The initialization for the active subspace layer and polynomial chaos expansion layer arepresented in Sections 3.4 and 3.5, respectively. We can also retrain all the parameters toincrease the accuracy. The whole procedure is illustrated in Fig. 2 (b) and Algorithm 3.1.
C. CUI, K. ZHANG, T. DAULBAEV, J. GUSAK, I. OSELEDETS, AND Z. ZHANG
Algorithm 3.1
The training procedure of the active subspace network (ASNet)
Input:
A pretrained deep neural network, the layer index l , and the number of activeneurons r .Step 1 Initialize the active subspace layer.
The active subspace layer is a linear projec-tion where the projection matrix V ∈ R n × r is computed by Algorithm 3.2. If r is notgiven, we use r = n AS defined in (3.4) by default.Step 2 Initialize the polynomial chaos expansion layer.
The polynomial chaos expan-sion layer is a nonlinear mapping from the reduced active subspace to the outputs, asshown in (3.10). The weights c α is computed by (3.12).Step 3 Construct the ASNet.
Combine the pre-model (the first l layers of the deep neuralnetwork) with the active subspace and polynomial chaos expansion layers as a newnetwork, referred to as ASNet.Step 4 Fine-tuning.
Retrain all the parameters in pre-model, active subspace layer andpolynomial chaos expansion layer in ASNet for several epochs by stochastic gradientdescent.
Output:
A new network ASNet layer 1 layer 2 ... layer L (a) A deep neural network pre-model AS PCE (b) The proposed ASNet
Figure 2: (a) The original deep neural network; (b) The proposed ASNet with three parts: apre-model, an active subspace (AS) layer, and a polynomial chaos expansion (PCE) layer.
This subsection presents an efficient method to projectthe high dimensional neurons to the active subspace. Given a dataset D = { x , . . . , x m } , theempirical covariance matrix is computed by ˆ C = m P mi =1 ∇ c l ( x i ) ∇ c l ( x i ) T . When ReLU isapplied as an activation, c l ( x ) is not differentiable. In this case, ∇ denotes the sub-gradientwith a little abuse of notation.Instead of calculating the eigenvalue decomposition of ˆ C , we compute the singular valuedecomposition of ˆ G to save the computation cost:(3.7) ˆ G = [ ∇ c l ( x ) , . . . , ∇ c l ( x m )] = ˆ V ˆ Σ ˆ U T ∈ R n l × m with ˆ Σ = diag(ˆ σ , · · · , ˆ σ n l ) . The eigenvectors of C are approximated by the left singular vectors ˆ V and the eigenvalues of C are approximated by the singular values of ˆ G , i.e., Λ ≈ ˆ Σ .We use the memory-saving frequent direction method [21] to compute the r dominantsingular value components, i.e., ˆ G ≈ ˆ V r ˆ Σ r ˆ U Tr . Here r is smaller than the total number of CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Algorithm 3.2
The frequent direction algorithm for computing the active subspace
Input:
A dataset with m AS input samples { x j } m AS j =1 , a pre-model f l pre ( · ), a subroutine forcomputing ∇ c l ( x ), and the dimension of truncated singular value decomposition r . Select r samples x i , compute x i = f l pre ( x i ), and construct an initial matrix S ← [ ∇ c l ( x ) , . . . , ∇ c l ( x r )]. for t=1, 2, . . . , do Compute the singular value decomposition
VΣU T ← svd( S ), where Σ =diag( σ , . . . , σ r ). If the maximal number of samples m AS is reached, stop. Update S by the soft-thresholding (3.8). Get a new sample x new0 , compute x new = f l pre ( x new0 ), and replace the last column of S (now all zeros) by the gradient vector S (: , r ) ← ∇ c l ( x new ). end forOutput: The projection matrix V ∈ R n l × r and the singular values Σ ∈ R r × r .samples. The frequent direction approach only stores an n × r matrix S . At the beginning,each column of S ∈ R n × r is initialized by a gradient vector. Then the randomized singularvalue decomposition [24] is used to generate S = UΣV T . Afterwards, S is updated in thefollowing way,(3.8) S ← V q Σ − σ r . Now the last column of S is zero and we replace it with the gradient vector of a new sam-ple. By repeating this process, SS T will approximate ˆ G ˆ G T with a high accuracy and V will approximate the left singular vectors of ˆ G . The algorithm framework is presented inAlgorithm 3.2.After obtaining Σ = diag( σ , . . . , σ r ), we can approximate the number of active neuronsas(3.9) ˆ n l,AS = arg min i : q σ + . . . + σ i p σ + . . . + σ r ≥ − ǫ . Under the condition that σ i → λ i for i = 1 , . . . , r and λ i → i = r + 1 , . . . , n l , (3.9) canapproximate n l,AS in (3.4) with a high accuracy. Further, the projection matrix ˆ V is chosenas the first ˆ n l,AS columns of V . The storage cost is reduced from O ( n l ) to O ( n l r ) and thecomputational cost is reduced from O ( n l r ) to O ( n l r ). We continue to construct a new surrogatemodel to approximate the post-model of a deep neural network. This problem can be re-garded as an uncertainty quantification problem if we set z as a random vector. We choosethe nonlinear polynomial because it has higher expressive power than linear functions.By the polynomial chaos expansion [55], the network output y ∈ R n L is approximated by Figure 3:
Distribution of the first two active subspace variables at the 6-th layer of VGG-19 forCIFAR-10. a linear combination of the orthogonal polynomial basis functions:(3.10) ˆ y ≈ p X | α | =0 c α φ α ( z ) , where | α | = α + . . . + α d . Here φ α ( z ) is a multivariate polynomial basis function chosen based on the probability den-sity function of z . When the parameters z = [ z , . . . , z r ] T are independent, both the jointdensity function and the multi-variable basis function can be decomposed into products ofone-dimensional functions, i.e., ρ ( z ) = ρ ( z ) . . . ρ r ( z r ), φ α ( z ) = φ α ( z ) φ α ( z ) . . . φ α r ( z r ) . The marginal basis function φ α j ( z j ) is uniquely determined by the marginal density function ρ i ( z i ). The scatter plot in Fig. 3 shows that the marginal probability density of e z i is close toa Gaussian distribution.Suppose ρ i ( z i ) follows a Gaussian distribution, then φ α j ( z j ) will be a Hermite polynomial[37], i.e.,(3.11) φ ( z ) = 1 , φ ( z ) = z, φ ( z ) = 4 z − , φ p +1 ( z ) = 2 zφ p ( z ) − pφ p − ( z ) . In general, the elements in z can be non-Gaussian correlated. In this case, the basis functions { φ α ( z ) } can be built via the Gram-Schmidt approach described in [13].The coefficient c α can be computed by a linear least-square optimization. Denote z j =ˆ V T f l pre ( x j ) as the random samples and y j as the network output for j = 1 , . . . , m PCE . Thecoefficient vector c α can be computed by(3.12) min { c α } m PCE m PCE X j =1 k y j − p X | α | =0 c α φ α ( z j ) k . Based on the Nyquist-Shannon sampling theorem, the number of samples to train c α needsto satisfy m PCE ≥ n basis = 2 (cid:0) r + pp (cid:1) . However, this number can be reduced to a smaller set of“important” samples by the D-optimal design [59] or the sparse regularization approach [14]. CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS The polynomial chaos expansion builds a surrogate model to approximate the deep neuralnetwork output y . This idea is similar to the knowledge distillation [28], where a pre-trainedteacher network teaches a smaller student network to learn the output feature. However, ourpolynomial-chaos layer uses one nonlinear projection whereas the knowledge distillation usesa series of layers. Therefore, the polynomial chaos expansion is more efficient in terms ofcomputational and storage cost. The polynomial chaos expansion layer is different from thepolynomial activation because the dimension of z may be different from that of output y .The problem (3.12) is convex and any first order method can get a global optimal solution.Denote the optimal coefficients as c ∗ α and the finial objective value as δ ∗ , i.e.,(3.13) δ ∗ = 1 m PCE m PCE X j =1 k y j − ψ ∗ ( z j ) k , where ψ ∗ ( z j ) = p X | α | =0 c ∗ α φ α ( z j ) . If δ ∗ = 0, the polynomial chaos expansion is a good approximation to the original deep neuralnetwork on the training dataset. However, the approximation loss of the testing dataset maybe large because of the overfitting phenomena.The objective function in (3.12) is an empirical approximation to the expected error(3.14) E ( z , y ) [ k y − ψ ( z ) k ] , where ψ ( z ) = p X | α | =0 c α φ α ( z ) . According to the Hoeffding’s inequality [29], the expected error (3.14) is close to the empiricalerror (3.12) with a high probability. Consequently, the loss for ASNet with polynomial chaosexpansion layer is bounded as follows.
Lemma 3.2.
Suppose that the optimal solution for solving problem (3.12) is c ∗ α , the optimalpolynomial chaos expansion is ψ ∗ ( z ) , and the optimal residue is δ ∗ . Assume that there existconsts a, b such that for all j , k y j − ψ ∗ ( z j ) k ∈ [ a, b ] . Then the loss of ASNet will be upperbounded (3.15) E z [( loss ( ψ ∗ ( z ))) ] ≤ E x [( c ( x )) ] + 2 n L ( δ ∗ + t ) w.p. − γ ∗ , where t is a user-defined threshold, and γ ∗ = exp( − t m PCE ( b − a ) ) .Proof. Since the cross entropy loss function is √ n L -Lipschitz continuous, we have(3.16) E ( y , z ) [(loss( y ) − loss( ψ ∗ ( z ))) ] ≤ n L E ( y , z ) [ k y − ψ ∗ ( z ) k ] , Denote T j = k y j − ψ ∗ ( z j ) k for i = 1 , . . . , n L . {T j } are independent under the assumptionthat the data samples are independent. By the Hoeffding’s inequality, for any constant t , itholds that(3.17) E [ T ] ≤ m PCE X j T j + t w.p. 1 − γ ∗ , with γ ∗ = exp( − t m PCE ( b − a ) ). Equivalently,(3.18) E ( y , z ) [ k y − ψ ∗ ( z ) k ] ≤ δ ∗ + t w.p. 1 − γ ∗ , Consequently, there is E z [(loss( ψ ∗ ( z ))) ] ≤ E y [(loss( y )) ] + 2 E ( y , z ) [(loss( ψ ∗ ( z )) − loss( y )) ] ≤ E x [( c ( x )) ] + 2 n L ( δ ∗ + t ) w.p. 1 − γ ∗ . The last inequality follows from c ( x ) = c l ( x l ) = loss( y ), equations (3.16) and (3.18). Thiscompletes the proof.Lemma 3.2 shows with a high probability 1 − γ ∗ , the expected error of ASNet without fine-tuning is bounded by the pre-trained error of the original network, the accuracy loss in solvingthe polynomial chaos subproblem (3.13), and the number of classes n L . The probability γ ∗ iscontrolled by the threshold t as well as the number of training samples m PCE .In practice, we always re-train ASNet for several epochs and the accuracy of ASNet isbeyond the scope of Lemma 3.2.
The pre-model can be further compressed byvarious techniques such as network pruning and sharing [25], low-rank factorization [43, 36, 18],or data quantization [15, 12]. Denote θ as the weights in ASNet and { x , . . . , x m } as thetraining dataset. Here, θ denotes all the parameters in the pre-model, active subspace layer,and the polynomial chaos expansion layer. We re-train the network by solving the followingregularized optimization problem:(3.19) θ ∗ = arg min θ m m X i =1 loss( f ( θ ; x i )) + λR ( θ ) . Here ( x i , y i ) is a training sample, m is the total number of training samples, loss( · ) is the cross-entropy loss function, R ( θ ) is a regularization function, and λ is a regularization parameter.Different regularization functions can result in different model structures. For instance, an ℓ regularizer R ( θ ) = k θ k [2, 50, 57] will return a sparse weight, an ℓ , -norm regularizerwill result in a column-wise sparse weights, a nuclear norm regularizer will result in low-rankweights. At each iteration, we solve (3.19) by a stochastic proximal gradient decent algorithm[53](3.20) θ k +1 = argmax θ ( θ − θ k ) T g k + 12 α k k θ − θ k k + λR ( θ ) . Here g k = |B k | P i ∈B k ∇ θ loss( f ( θ ; x i ) , y i ) is the stochastic gradient, B k is a batch at the k -thstep, and α k is the stepsize.In this work, we chose the ℓ regularization to get sparse weight matrices. In this case,problem (3.20) has a closed-form solution:(3.21) θ k +1 = S α k λ ( θ k − α k g k ) , where S λ ( x ) = x ⊙ max(0 , − λ/ | x | ) is a soft-thresholding operator. CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Figure 4:
Perturbations along the directions of an active-subspace direction and of principal compo-nent, respectively. (a) The function f ( x ) = a T x − b . (b) The perturbed function along the active-subspace direction. (c) The perturbed function along the principal component analysis direction.
4. Active-Subspace for Universal Adversarial Attacks.
This section investigates how togenerate a universal adversarial attack by the active-subspace method. Given a function f ( x ),the maximal perturbation direction is defined by(4.1) v ∗ δ = argmax k v k ≤ δ E x [( f ( x + v ) − f ( x )) ] . Here, δ is a user-defined perturbation upper bound. By the first order Taylor expansion, wehave f ( x + v ) ≈ f ( x ) + ∇ f ( x ) T v , and problem (4.1) can be reduced to(4.2) v AS = argmax k v k =1 E x [( ∇ f ( x ) T v ) ] = argmax k v k =1 v T E x [ ∇ f ( x ) ∇ f ( x ) T ] v . The vector v AS is exactly the dominant eigenvector of the covariance matrix of ∇ f ( x ). Thesolution for (4.1) can be approximated by + δ v AS or − δ v AS . Here, both v AS and − v AS aresolutions of (4.2) but their effect on (4.1) are different. Example f ( x ) = a T x − b with a = [1 , − T and b = 1, and x follows a uniform distribution in a two-dimensional square domain [0 , ,as shown in Fig. 4 (a). It follows from direct computations that ∇ f ( x ) = a and the co-variance matrix C = aa T . The dominant eigenvector of C or the active-subspace directionis v AS = a / k a k = [1 / √ , − / √ v AS to perturb f ( x ) and plot f ( x + δ v AS )in Fig. 4 (b), which shows a significant difference even for a small permutation δ = 0 . w = [1 / √ , / √ T in Fig. 4 (c). Here, w is the eigenvector of the covariance matrix E x [ xx T ] = (cid:20) / / / / (cid:21) . However, w does not result in any perturbation because a T w = 0.This example indicates the difference between the active-subspace and principal componentanalysis: the active-subspace direction can capture the sensitivity information of f ( x ) whereasthe principal component is independent of f ( x ). Given a dataset D and a classifi-cation function j ( x ) that maps an input sample to an output label. The universal perturbationseeks for a vector v ∗ whose norm is upper bounded by δ , such that the class label can be per-turbed with a high probability, i.e.,(4.3) v ∗ = argmax k v k≤ δ prob x ∈D [ j ( x + v ) = j ( x )] = argmax k v k≤ δ E x [1 j ( x + v ) = j ( x ) ] , where 1 d equals one if the condition d is satisfied and zero otherwise. Solving problem (4.3)directly is challenging because both 1 d and j ( x ) are discontinuous. By replacing j ( x ) withthe loss function c ( x ) = loss( f ( x )) and the indicator function 1 d with a quadratic function,we reformulate problem (4.3) as(4.4) max v E x [( c ( x + v ) − c ( x )) ] s.t. k v k ≤ δ. The ball-constrained optimization problem (4.4) can be solved by various numerical tech-niques such as the spectral gradient descent method [6] and the limited-memory projectedquasi-Newton [51]. However, these methods can only guarantee convergence to a local sta-tionary point. Instead, we are interested in computing a direction that can achieve a betterobjective value by a heuristic algorithm.
Using the first order Taylor expansion c ( x + v ) ≈ c ( x ) + v T ∇ c ( x ), we reformulate problem (4.4) as a ball constrained quadratic problem(4.5) max v v T E x [ ∇ c ( x ) ∇ c ( x ) T ] v s.t. k v k ≤ δ. Problem (4.5) is easy to solve because its closed-form solution is exactly the dominant eigen-vector of the covariance matrix C = E x [ ∇ c ( x ) ∇ c ( x ) T ] or the first active-subspace direction.However, the dominant eigenvector in (4.5) may not be efficient because c ( x ) is nonlinear.Therefore, we compute v recursively by(4.6) v k +1 = proj( v k + s k d k v ) , where proj( v ) = v × min(1 , δ/ k v k ), s k is the stepsize, and d k v is approximated by(4.7) d k v = argmax d v d T v E x (cid:20) ∇ c (cid:16) x + v k (cid:17) ∇ c (cid:16) x + v k (cid:17) T (cid:21) d v , s.t. k d v k ≤ . Namely, d k v is the dominant eigenvector of C k = E x h ∇ c (cid:0) x + v k (cid:1) ∇ c (cid:0) x + v k (cid:1) T i . Because d k v maximizes the changes in E x [( c ( x + v + d v ) − c ( x + v )) ], we expect that the attack ratiokeeps increasing, i.e., r ( v k +1 ; D ) ≥ r ( v k ; D ), where(4.8) r ( v ; D ) = 1 |D| X x i ∈D j ( x i + v ) = j ( x i ) . The backtracking line search approach [3] is employed to choose s k such that the attack ratioof v k + s k d k v is higher than the attack ratio of both v k and v k − s k d k v , i.e.,(4.9) s k = min i { s ki,t : r ( v k +1 i,t ; D ) > max( r ( v k +1 i, − t ; D ) , r ( v k ; D ) } , CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Algorithm 4.1
Recursive Active Subspace Universal Attack
Input:
A pre-trained deep neural network denoted as c ( x ), a classification oracle j ( x ), atraining dataset D , an upper bound for the attack vector δ , an initial stepsize s , a decreaseratio γ <
1, and the parameter in the stopping criterion α . Initialize the attack vector as v = 0. for k = 0 , , . . . do Select the training dataset as D = { x i + v k : x i ∈ D and j ( x i + v k ) = j ( x i ) } , thencompute the dominate active subspace direction d v by Algorithm 3.2. for i = 0 , , ...I do Let s ki, ± = ( − ± s γ i and v k +1 i, ± = proj( v k + s k +1 i, ± d k v ) . Compute the attack ratios r ( v k +1 i, ) and r ( v k +1 i, − ) by (4.8). If either r ( v k +1 i, ) or r ( v k +1 i, − ) is greater than r ( v k ), stop the process. Return s k =( − t s ki, , where t = 1 if r ( v k +1 i, ) ≥ r ( v k +1 i, − ) and t = − end for If no stepsize s k is returned, let s k = s r I and record this step as a failure. Computethe next iteration v k +1 by the projection (4.6). If the number of failure is greater the threshold α , stop. end forOutput: The universal active adversarial attack vector v AS .where s ki,t = ( − t s γ i , t ∈ { , − } , s is the initial stepsize, γ < v k +1 i,t = proj( v k + s k +1 i,t d k v ). If such a stepsize s k exists, we update v k +1 by (4.6) and repeatthe process. Otherwise, we record the number of failures and stop the algorithm when thenumber of failure is greater than a threshold.The overall flow is summarized in Algorithm 4.1. In practice, instead of using the wholedataset to train this attack vector, we use a subset D . The impact for different number ofsamples is discussed in section 5.2.2.
5. Numerical Experiments.
In this section, we show the power of active-subspace inrevealing the number of active neurons, compressing neural networks, and computing theuniversal adversarial perturbation. All codes are implemented in PyTorch and are availableonline . We test the ASNet constructed by Algo-rithm 3.1, and set the polynomial order as p = 2, the number of active neurons as r = 50,and the threshold in Equation (3.4) as ǫ = 0 .
05 on default. Inspired by the knowledge dis-tillation [28], we retrain all the parameters in the ASNet by minimizing the following lossfunction min θ m X i =1 βH (cid:0) ASNet θ ( x i ) , f ( x i ) (cid:1) + (1 − β ) H (cid:0) ASNet θ ( x i ) , y i (cid:1) . https://github.com/chunfengc/ASNet Table 1: Comparison of number of neurons r of VGG-19 on CIFAR-10. For the stroagespeedup, the higher is bettter. For the accuracy reduction before or after finetuning, thelower is better. r = 25 r = 50 r = 75 ǫ Storage
Accu. Reduce ǫ Storage
Accu. Reduce ǫ Storage
Accu. ReduceBefore After Before After Before After
ASNet(5) 0.34 × × × ASNet(6) 0.24 × × × ASNet(7) 0.15 × × × Here, the cross entropy H ( p , q ) = P j s ( p ) j log s ( q ) j , the softmax function s ( x ) j = exp( x j ) P j exp( x j ) ,and the parameter β = 0 . − and 10 − for VGG-19 and ResNet, and thestepsize for the active subspace layer and the polynomial chaos expansion layer is set as 10 − ,respectively,We also seek for sparser weights in ASNet by the proximal stochastic gradient descentmethod in Section 3.6. On default, we set the stepsize as 10 − for the pre-model and 10 − forthe active subspace layer and the polynomial chaos expansion layer. The maximal epoch isset as 100. The obtained sparse model is denoted as ASNet-s.In all figures and tables, the numbers in the bracket of ASNet( · ) or ASNet-s( · ) indicatethe index of a cut-off layer. We report the performance for different cut-off layers in terms of accuracy, storage, and computational complexities . We first show the influence of number of reduced neurons r , tolerance ǫ , and cutting-off layer index l of VGG-19 on CIFAR-10 in Table 1. The VGG-19 can achieve 93.28% testing accuracy with 76.45 Mb stroage consumption. Here, ǫ = λ r +1 + ... + λ n λ + ... + λ n . For different choices of r , we display the corresponding tolerance ǫ , the storagespeedup compared with the original teacher network, and the testing accuracy reduction forASNet before and after fine-tuning compared with the original teacher network.Table 1 shows that when the cutting-off layer is fixed, a larger r usually results in asmaller tolerance ǫ and a smaller accuracy reduction but also a smaller storage speedup. Thisis corresponding to Lemma 3.1 that the error of ASNet before fine-tuning is upper boundedby O ( ǫ ). Comparing r = 50 with r = 75, we find that r = 50 can achieve almost the sameaccuracy with r = 75 with a higher storage speedup. r = 50 can even achieve better accuracythan r = 75 in layer 7 probably because of overfitting. This guides us to chose r = 50 in thefollowing numerical experiments. For different layers, we see a later cutting-off layer indexcan produce a lower accuracy reduction but a smaller storage speedup. In other words, thechoice of layer index is a trade-off between accuracy reduction with storage speedup. We show the effectiveness of ASNet constructed bySteps 1-3 of Algorithm 3.1 without fine-tuning. We investigate the following three properties.(1)
Redundancy of neurons.
The distributions of the first 200 singular values of thematrix ˆ G (defined in (3.7)) are plotted in Fig. 5 (a). The singular values decrease almost CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS −2 −1 S i ngu l a r V a l u e s (a) l= 4 l= 5 l= 6 l= 7 A cc u r ac y (b) AS+PCEPCA+PCE AS+LRLR PCA+LR
Figure 5:
Structural analysis of VGG-19 on the CIFAR-10 dataset. (a) The first 200 singular valuesfor layers 4 ≤ l ≤
7; (b) The accuracy (without any fine-tuning) obtained by active-subspace (AS) andpolynomial chaos expansions (PCE) compared with principal component analysis (PCA) and logisticregression (LR). exponentially for layers l ∈ { , , , } . Although the total numbers of neurons are 8192, 16384,16384, and 16384, the numbers of active neurons are only 105, 84, 54, and 36, respectively.(2) Redundancy of the layers.
We cut off the deep neural network at an intermediatelayer and replace the subsequent layers with one simple logistic regression [30]. As shown bythe red bar in Fig. 5 (b), the logistic regression can achieve relatively high accuracy. Thisverifies that the features trained from the first few layers already have a high expressionpower since replacing all subsequent layers with a simple expression loses little accuracy.(3)
Efficiency of the active-subspace and polynomial chaos expansion.
We comparethe proposed active-subspace layer with the principal component analysis [31] in projectingthe high-dimensional neuron to a low-dimensional space, and also compare the polynomialchaos expansion layer with logistic regression in terms of their efficiency to extract class labelsfrom the low-dimensional variables. Fig. 5 (b) shows that the combination of active-subspaceand polynomial chaos expansion can achieve the best accuracy.
We continue to present the results of ASNet and ASNet-s on CIFAR-10 by two widely used networks: VGG-19 and ResNet-110 in Tables 2 and 3, respectively.The second column shows the testing accuracy for the corresponding network. We reportthe storage and computational costs for the pre-model, post-model (i.e., active-subspace pluspolynomial chaos expansion for ASNet and ASNet-s), and overall results, respectively. Forboth examples, ASNet and ASNet-s can achieve a similar accuracy with the teacher networkyet with much smaller storage and computational cost. For VGG-19, ASNet achieves 14 . × storage savings and 3 . × computational reduction; ASNet-s achieves 23 . × storage savingsand 7 . × computational reduction. For most ASNet and ASNet-s networks, the storageand computational costs of the post-models achieve significant performance boosts by ourproposed network structure changes. It is not surprising to see that increasing the layer index(i.e., cutting off the deep neural network at a later layer) can produce a higher accuracy.However, increasing the layer index also results in a smaller compression ratio. In other words,the choice of layer index is a trade-off between the accuracy reduction with the compression Table 2:
Accuracy and storage on VGG-19 for CIFAR-10. Here, “Pre-M” denotes the pre-model, i.e.,layers 1 to l of the original deep neural networks, “AS” and “PCE” denote the active subspace andpolynomial chaos expansion layer, respectively. Network Accuracy Storage (MB) Flops (10 )VGG-19 93.28% 76.45 398.14Pre-M AS+PCE Overall Pre-M AS+PCE OverallASNet(5) 91.46% 2.12 3.18 5.30 115.02 0.83 115.85(23.41 × ) (14.43 × ) (340.11 × ) (3.44 × )ASNet-s(5) 90.40% 1.14 2.05 .19 54.03 0.54 × ) (36.33 × ) ( × ) (2.13 × ) (527.91 × ) ( × )ASNet(6) 93.01% 4.38 3.18 7.55 152.76 0.83 153.60(22.70 × ) (10.12 × ) (294.76 × ) (2.59 × )ASNet-s(6) 91.08% 1.96 1.81 3.77 67.37 0.48 67.85(2.24 × ) (39.73 × ) (20.27 × ) (2.27 × ) (515.98 × ) (5.87 × )ASNet(7) × ) (7.80 × ) (249.41 × ) (2.08 × )ASNet-s(7) 90.87% 2.61 1.91 4.52 80.23 0.50 80.73(2.54 × ) (36.64 × ) (16.92 × ) (2.37 × ) (415.68 × ) (4.93 × ) Table 3: Accuracy and storage on ResNet-110 for CIFAR-10. Here, “Pre-M” denotes thepre-model, i.e., layers 1 to l of the original deep neural networks, “AS” and “PCE” denotethe active subspace and polynomial chaos expansion layer, respectively. Network Accuracy Storage (MB) Flops (10 )ResNet-110 93.78% 6.59 252.89Pre-M AS+PCE Overall Pre-M AS+PCE OverallASNet(61) 89.56% 1.15 1.61 2.77 140.82 0.42 141.24(3.37 × ) (2.38 × ) (265.03 × ) (1.79 × )ASNet-s(61) 89.26% 0.83 1.23 .06 104.05 0.32 × ) (4.41 × ) ( × ) (1.35 × ) (346.82 × ) ( × )ASNet(67) 90.16% 1.37 1.61 2.98 154.98 0.42 155.40(3.24 × ) (2.21 × ) (231.55 × ) (1.63 × )ASNet-s(67) 89.69% 1.00 1.22 2.22 116.38 0.32 116.70(1.36 × ) (4.29 × ) (2.97 × ) (1.33 × ) (306.72 × ) (2.17 × )ASNet(73) × ) (2.06 × ) (198.07 × ) (1.49 × )ASNet-s(73) 90.02% 1.18 1.16 2.34 128.65 0.30 128.96(1.34 × ) (4.32 × ) (2.82 × ) (1.31 × ) (275.74 × ) (1.96 × ) ratio.For Resnet-110, our results are not as good as those on VGG-19. We find that theeigenvalues for its covariance matrix are not exponentially decreasing as that of VGG-19,which results in a large number of active neurons or a large error ǫ when fixing r = 50. Apossible reason is that ResNet updates as x l +1 = x l + f l ( x l ). Hence, the partial gradient ∂ x l +1 /∂ x l = I + ∇ f l ( x l ) is less likely to be low-rank. CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Table 4:
Accuracy and storage on VGG-19 for CIFAR-100. Here, “Pre-M” denotes the pre-model,i.e., layers 1 to l of the original deep neural networks, “AS” and “PCE” denote the active subspaceand polynomial chaos expansion layer, respectively. Network Top-1 Top-5 Storage (MB) Flops (10 )VGG-19 71.90% 89.57% 76.62 398.18Pre-M AS+PCE Overall Pre-M AS+PCE OverallASNet(7) 70.77% 91.05% 6.63 3.63 10.26 190.51 0.83 191.35(19.23 × ) (7.45 × ) (249.41 × ) (2.08 × )ASNet-s(7) 70.20% 90.90% 5.20 3.24 8.44 144.81 0.85 × ) (21.56 × ) (9.06 × ) (1.32 × ) (244.57 × ) ( × )ASNet(8) 69.50% 90.15% 8.88 1.29 10.17 228.26 0.22 228.48(52.50 × ) (7.52 × ) (779.04 × ) (1.74 × )ASNet-s(8) 69.17% 89.73% 6.87 1.22 .09 172.69 0.32 173.01(1.29 × ) (55.36 × ) ( × ) (1.32 × ) (530.92 × ) (2.30 × )ASNet(9) × ) (4.95 × ) (357.10 × ) (1.61 × )ASNet-s(9) 71.38% 90.28% 9.38 1.94 11.32 183.27 0.51 183.78(1.43 × ) (32.49 × ) (6.75 × ) (1.35 × ) (296.74 × ) (2.17 × ) Table 5: Accuracy and storage on ResNet-110 for CIFAR-100. Here, “Pre-M” denotes thepre-model, i.e., layers 1 to l of the original deep neural networks, “AS” and “PCE” denotethe active subspace and polynomial chaos expansion layer, respectively. Network Top-1 Top-5 Storage (MB) Flops (10 )ResNet-110 71.94% 91.71 % 6.61 252.89Pre-M AS+PCE Overall Pre-M AS+PCE OverallASNet(75) 63.01% 88.55% 1.79 1.29 3.08 172.67 0.22 172.89(3.73 × ) (2.14 × ) (367.88 × ) (1.46 × )ASNet-s(75) 63.16% 88.65% 1.47 1.20 .67 143.11 0.31 × ) (3.99 × ) ( × ) (1.21 × ) (254.69 × ) ( × )ASNet(81) 65.82% 90.02% 2.64 1.29 3.93 186.83 0.22 187.04(3.07 × ) (1.68 × ) (302.96 × ) (1.35 × )ASNet-s(81) 65.73% 89.95% 2.20 1.21 3.41 155.61 0.32 155.93(1.20 × ) (3.27 × ) (1.93 × ) (1.20 × ) (208.38 × ) (1.62 × )ASNet(87) × ) (1.38 × ) (238.04 × ) (1.26 × )ASNet-s(87) 67.65% 90.10% 2.91 1.21 4.12 166.50 0.32 166.81(1.20 × ) (2.56 × ) (1.60 × ) (1.21 × ) (163.50 × ) (1.52 × ) Next, we present the results of VGG-19 and ResNet-110 on CIFAR-100 in Tables 4 and 5, respectively. On VGG-19, ASNet can achieve 7 . × storage savings and2 . × computational reduction, and ASNet-s can achieve 9 . × storage savings and 2 . × computational reduction. The accuracy loss is negligible for VGG-19 but larger for ResNet-110. The performance boost of ASNet is obtained by just changing the network structuresand without any model compression (e.g., pruning, quantization, or low-rank factorization). This subsection demonstrates the effectiveness ofactive-subspace in identifying a universal adversarial attack vector. We denote the resultgenerated by Algorithm 4.1 as “AS” and compare it with the “UAP” method in [40] andwith “random” Gaussian distribution vector. The parameters in Algorithm 4.1 are set as α = 10 and δ = 5 , . . . ,
10. The default parameters of UAP are applied except for the maximaliteration. In the implementation of [40], the maximal iteration is set as infinity, which is time-consuming when the training dataset or the number of classes is large. In our experiments, weset the maximal iteration as 10. In all figures and tables, we report the average attack ratioand CPU time in training out of ten repeated experiments with different training datasets.A higher attack ratio means the corresponding algorithm is better in fooling the given deepneural network. The datasets are chosen in two ways. We firstly test data points from one class(e.g., trousers in Fashion-MNIST) because these data points share lots of common featuresand have a higher probability to be attacked by a universal perturbation vector. We thenconduct experiments on the whole dataset to show our proposed algorithm can also providebetter performance compared with the baseline even if the dataset has diverse features. -norm of perturbation0255075100 T r a i n i ng A tt ac k r a ti o ( % ) (a) 5 6 7 8 9 10ℓ -norm of perturbation0255075100 T e s ti ng A tt ac k r a ti o ( % ) (b) AS UAP Random -norm of perturbation0102030 C P U ti m e ( s ) (c)5 6 7 8 9 10ℓ -norm of perturbation0255075100 T r a i n i ng A tt ac k r a ti o ( % ) (d) 5 6 7 8 9 10ℓ -norm of perturbation0255075100 T e s ti ng A tt ac k r a ti o ( % ) (e) AS UAP Random -norm of perturbation0204060 C P U ti m e ( s ) (f) Figure 6:
Universal adversarial attacks for the Fashion-MINST with respect to different ℓ -norms.(a)-(c): the results for attacking one class dataset. (d)-(f): the results for attacking the whole dataset. Firstly, we present the adversarial attack result on Fashion-MNIST by a 4-layer neural network. There are two convolutional layers with kernel sizeequals 5 ×
5. The size of output channels for each convolutional layer is 20 and 50, respec-tively. Each convolutional layer is followed by a ReLU activation layer and a max-pooling layerwith a kernel size of 2 ×
2. There are two fully connected layers. The first fully connectedlayer has an input feature 800 and an output feature 500.Fig. 6 presents the attack ratio of our active-subspace method compared with the baselines
CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS Figure 7:
The effect of our attack method on one data sample in the Fashion-MNIST dataset. (a) Atrouser from the original dataset. (b) An active-subspace perturbation vector with the ℓ norm equalsto 5. (c) The perturbed sample is misclassified as a t-shirt/top by the deep neural network. UAP method [40] and Gaussian random vectors. The top figures show the results for justone class (i.e., trouser), and the bottom figures show the results for all ten classes. For allperturbation norms, the active-subspace method can achieve around 30% higher attack ratiothan UAP while more than 10 times faster. This verifies that the active-subspace method hasbetter universal representation ability compared with UAP because the active-subspace canfind a universal direction while UAP solves data-dependent subproblems independently. Bythe active-subspace approach, the attack ratio for the first class and the whole dataset arearound 100% and 75%, respectively. This coincides with our intuition that the data points inone class have higher similarity than data points from different classes.In Fig. 7, we plot one image from Fashion-MNIST and its perturbation by the active-subspace attack vector. The attacked image in Fig. 7 (c) still looks like a trouser for a human.However, the deep neural network misclassifies it as a t-shirt/top.
Next, we show the numerical results of attacking VGG-19 on CIFAR-10. Fig. 8 compares the active-subspace method compared with the baseline UAP and Gauss-ian random vectors. The top figures show the results by the dataset in the first class (i.e.,automobile), and the bottom figures show the results for all ten classes. For both two cases,the proposed active-subspace attack can achieve 20% higher attack ratios while three timesfaster than UAP. This is similar to the results in Fashion-MNIST because the active-subspacehas a better ability to capture the global information.We further show the effects of different number of training samples in Fig. 9. When thenumber of samples is increased, the testing attack ratio is getting better. In our numericalexperiments, we set the number of samples as 100 for one-class experiments and 200 forall-classes experiments.We continue to show the cross-model performance on four different ResNet networks andone VGG network. We test the performance of the attack vector trained from one model onall other models. Each row in Table 6 shows the results on the same deep neural network andeach column shows the results of the same attack vector. It shows that ResNet-20 is easier -norm of perturbation0255075100 T r a i n i ng A tt ac k r a ti o ( % ) (a) 5 6 7 8 9 10ℓ -norm of perturbation0255075100 T e s ti ng A tt ac k r a ti o ( % ) (b) AS UAP Random -norm of perturbation0255075100 C P U ti m e ( s ) (c)5 6 7 8 9 10ℓ -norm of perturbation0255075100 T r a i n i ng A tt ac k r a ti o ( % ) (d) 5 6 7 8 9 10ℓ -norm of perturbation0255075100 T e s ti ng A tt ac k r a ti o ( % ) (e) AS UAP Random -norm of perturbation050100150 C P U ti m e ( s ) (f) Figure 8:
Universal adversarial attacks of VGG-19 on CIFAR-10 with respect to different ℓ -normperturbations. (a)-(c): The training attack ratio, the testing attack ratio, and the CPU time in secondsfor attacking one class dataset. (d)-(f): The results for attacking ten classes dataset together. Table 6:
Cross-model performance for CIFAR-10
ResNet-20 ResNet-44 ResNet-56 ResNet-110 VGG-19ResNet-20 to be attacked compared with other models. This agrees with our intuition that a simplenetwork structure such as ResNet-20 is less robust. On the contrary, VGG-19 is the mostrobust. The success of cross-model attacks indicates that these neural networks could find asimilar feature.
Finally, we show the results on CIFAR-100 for both the first class(i.e., dolphin) and all classes. Similar to Fashion-MNIST and CIFAR-10, Fig. 10 shows thatactive-subspace can achieve higher attack ratios than both UAP and Gaussian random vectors.Further, compared with CIFAR-10, CIFAR-100 is easier to be attacked partially because ithas more classes.We summarize the results for different datasets in Table 7. The second column shows thenumber of classes in the dataset. In terms of testing attack ratio for the whole dataset, active-
CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS
10 30 50 100 200Number of training samples0.00.20.40.60.81.0 A tt ac k r a ti o (a) 10 30 50 100 200Number of training samples0.00.20.40.60.81.0 A tt ac k r a ti o (b) Training Testing
Figure 9:
Adversarial attack of VGG-19 on CIFAR-10 with different number of training samples. The ℓ -norm perturbation is fixed as 10. (a) The results of attacking the dataset from the first class; (b)The results of attacking the whole dataset with 10 classes. Table 7:
Summary of the universal attack for different datasets by the active-subspace compared withUAP and the random vector. The norm of perturbation is equal to 10.
Training Attack ratio Testing Attack ratio CPU time (s) .15 5.49MNIST 10 79.2% 51.5% 8.0% .40 58.85 CIFAR-10 .18 52.8310 86.5% 65.9% 10.2% CIFAR-100 . .
1% higher attack ratios than UAP for Fashion-MNIST,CIFAR-10, and CIFAR-100, respectively. In terms of the CPU time, active-subspace achieves42 × , 5 × , and 14 × speedup than UAP on the Fashion-MNIST, CIFAR-10, and CIFAR-100,respectively.
6. Conclusions and Discussions.
This paper has analyzed deep neural networks by theactive subspace method originally developed for dimensionality reduction of uncertainty quan-tification. We have investigated two problems: how many neurons and layers are necessary (orimportant) in a deep neural network, and how to generate a universal adversarial attack vectorthat can be applied to a set of testing data? Firstly, we have presented a definition of “thenumber of active neurons” and have shown its theoretical error bounds for model reduction.Our numerical study has shown that many neurons and layers are not needed. Based on thisobservation, we have proposed a new network called ASNet by cutting off the whole neural -norm of perturbation0255075100 T r a i n i ng A tt ac k r a ti o ( % ) (a) 5 6 7 8 9 10ℓ -norm of perturbation0255075100 T e s ti ng A tt ac k r a ti o ( % ) (b) AS UAP Random -norm of perturbation02505007501000 C P U ti m e ( s ) (c)5 6 7 8 9 10ℓ -norm of perturbation0255075100 T r a i n i ng A tt ac k r a ti o ( % ) (d) 5 6 7 8 9 10ℓ -norm of perturbation0255075100 T e s ti ng A tt ac k r a ti o ( % ) (e) AS UAP Random -norm of perturbation02505007501000 C P U ti m e ( s ) (f) Figure 10:
Results for universal adversarial attack for CIFAR-100 with respect to different ℓ -normperturbations. (a)-(c): The results for attacking the dataset from the first class. (d)-(f): The resultsfor attacking ten classes dataset together. network at a proper layer and replacing all subsequent layers with an active subspace layerand a polynomial chaos expansion layer. The numerical experiments show that the proposeddeep neural network structural analysis method can produce a new network with significantstorage savings and computational speedup yet with little accuracy loss. Our methods can becombined with existing model compression techniques (e.g., pruning, quantization and low-rank factorization) to develop compact deep neural network models that are more suitablefor the deployment on resource-constrained platforms. Secondly, we have applied the activesubspace to generate a universal attack vector that is independent of a specific data sampleand can be applied to a whole dataset. Our proposed method can achieve a much higherattack ratio than the existing work [40] and enjoys a lower computational cost.ASNet has two main goals: to detect the necessary neurons and layers, and to compressthe existing network. To fulfill the first goal, we require a pre-trained model because fromLemmas 3.1, and 3.2, the accuracy of the reduced model will approach that of the original one.For the second task, the pre-trained model helps us to get a good estimation for the number ofactive neurons, a proper layer to cut off, and a good initialization for the active subspace layerand polynomial chaos expansion layer. However, a pre-trained model is not required becausewe can construct ASNet in a heuristic way (as done in most DNN): a reasonable guess forthe number of active neurons and cut-off layer, and a random parameter initialization for thepre-model, the active subspace layer and the polynomial chaos expansion layer. Acknowledgement.
We thank the associate editor and referees for their valuable com-ments and suggestions.
CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS [1]
S. Abdoli, L. G. Hafemann, J. Rony, I. B. Ayed, P. Cardinal, and A. L. Koerich , Universaladversarial audio perturbations , arXiv preprint arXiv:1908.03173, (2019).[2]
A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg , Net-trim: Convex pruning of deep neural networkswith performance guarantee , in Advances in Neural Information Processing Systems, 2017, pp. 3177–3186.[3]
L. Armijo , Minimization of functions having lipschitz continuous first partial derivatives , Pacific Journalof mathematics, 16 (1966), pp. 1–3.[4]
S. Baluja and I. Fischer , Adversarial transformation networks: Learning to generate adversarial ex-amples , arXiv preprint arXiv:1703.09387, (2017).[5]
M. Behjati, S.-M. Moosavi-Dezfooli, M. S. Baghshah, and P. Frossard , Universal adversarialattacks on text classifiers , in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), IEEE, 2019, pp. 7345–7349.[6]
E. G. Birgin, J. M. Mart´ınez, and M. Raydan , Nonmonotone spectral projected gradient methods onconvex sets , SIAM Journal on Optimization, 10 (2000), pp. 1196–1211.[7]
H. Cai, L. Zhu, and S. Han , ProxylessNAS: Direct neural architecture search on target task and hardware ,arXiv preprint arXiv:1812.00332, (2018).[8]
N. Carlini and D. Wagner , Towards evaluating the robustness of neural networks , in 2017 IEEE Sym-posium on Security and Privacy (SP), IEEE, 2017, pp. 39–57.[9]
P. G. Constantine , Active subspaces: Emerging ideas for dimension reduction in parameter studies ,vol. 2, SIAM, 2015.[10]
P. G. Constantine, E. Dow, and Q. Wang , Active subspace methods in theory and practice: applica-tions to kriging surfaces , SIAM Journal on Scientific Computing, 36 (2014), pp. A1500–A1524.[11]
P. G. Constantine, M. Emory, J. Larsson, and G. Iaccarino , Exploiting active subspaces to quantifyuncertainty in the numerical simulation of the hyshot ii scramjet , Journal of Computational Physics,302 (2015), pp. 1–20.[12]
M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio , Binarized neural networks:Training deep neural networks with weights and activations constrained to +1 or-1 , arXiv preprintarXiv:1602.02830, (2016).[13]
C. Cui and Z. Zhang , Stochastic collocation with non-gaussian correlated process variations: Theory,algorithms and applications , IEEE Transactions on Components, Packaging and Manufacturing Tech-nology, (2018).[14]
C. Cui and Z. Zhang , High-dimensional uncertainty quantification of electronic and photonic ic with non-gaussian correlated process variations , IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, (2019).[15]
L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li , Gxnor-net: Training deep neural networks with ternaryweights and activations without full-precision memory under a unified discretization framework , NeuralNetworks, 100 (2018), pp. 49–58.[16]
G. K. Dziugaite, Z. Ghahramani, and D. M. Roy , A study of the effect of jpg compression onadversarial images , arXiv preprint arXiv:1608.00853, (2016).[17]
J. Frankle and M. Carbin , The lottery ticket hypothesis: Finding sparse, trainable neural networks ,arXiv preprint arXiv:1803.03635, (2018).[18]
T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov , Ultimate tensorization: compressingconvolutional and FC layers alike , arXiv preprint arXiv:1611.03214, (2016).[19]
R. Ge, R. Wang, and H. Zhao , Mildly overparametrized neural nets can memorize training data effi-ciently , arXiv preprint arXiv:1909.11837, (2019).[20]
R. G. Ghanem and P. D. Spanos , Stochastic finite element method: Response statistics , in StochasticFinite Elements: A Spectral Approach, Springer, 1991, pp. 101–119.[21]
M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff , Frequent directions: Simple anddeterministic matrix sketching , SIAM Journal on Computing, 45 (2016), pp. 1762–1792.[22]
I. J. Goodfellow, J. Shlens, and C. Szegedy , Explaining and harnessing adversarial examples , arXivpreprint arXiv:1412.6572, (2014).[23]
A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber , Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , in Proceedings of the 23rd inter-national conference on Machine learning, ACM, 2006, pp. 369–376.[24]
N. Halko, P.-G. Martinsson, and J. A. Tropp , Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrix decompositions , SIAM review, 53 (2011), pp. 217–288.[25]
S. Han, H. Mao, and W. J. Dally , Deep compression: Compressing deep neural networks with pruning,trained quantization and huffman coding , arXiv preprint arXiv:1510.00149, (2015).[26]
C. Hawkins and Z. Zhang , Bayesian tensorized neural networks with automatic rank selection , arXivpreprint arXiv:1905.10478, (2019).[27]
Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han , Amc: Automl for model compression and accel-eration on mobile devices , in Proceedings of the European Conference on Computer Vision (ECCV),2018, pp. 784–800.[28]
G. Hinton, O. Vinyals, and J. Dean , Distilling the knowledge in a neural network , stat, 1050 (2015),p. 9.[29]
W. Hoeffding , Probability inequalities for sums of bounded random variables , in The Collected Worksof Wassily Hoeffding, Springer, 1994, pp. 409–426.[30]
D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant , Applied logistic regression , vol. 398, JohnWiley & Sons, 2013.[31]
I. Jolliffe , Principal component analysis , in International encyclopedia of statistical science, Springer,2011, pp. 1094–1096.[32]
C. Kanbak, S.-M. Moosavi-Dezfooli, and P. Frossard , Geometric robustness of deep networks:analysis and improvement , in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2018, pp. 4441–4449.[33]
V. Khrulkov and I. Oseledets , Art of singular vectors and universal adversarial perturbations , inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8562–8570.[34]
D. P. Kingma and J. Ba , Adam: A method for stochastic optimization , arXiv preprint arXiv:1412.6980,(2014).[35]
A. Krizhevsky, I. Sutskever, and G. E. Hinton , Imagenet classification with deep convolutionalneural networks , in Advances in neural information processing systems, 2012, pp. 1097–1105.[36]
V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky , Speeding-up convolutionalneural networks using fine-tuned cp-decomposition , arXiv preprint arXiv:1412.6553, (2014).[37]
D. R. Lide , Handbook of mathematical functions , in A Century of Excellence in Measurements, Standards,and Technology, CRC Press, 2018, pp. 135–139.[38]
L. Liu, L. Deng, X. Hu, M. Zhu, G. Li, Y. Ding, and Y. Xie , Dynamic sparse graph for efficientdeep learning , arXiv preprint arXiv:1810.00859, (2018).[39]
Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell , Rethinking the value of network pruning , arXivpreprint arXiv:1810.05270, (2018).[40]
S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard , Universal adversarial perturbations ,in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1765–1773.[41]
S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard , Deepfool: a simple and accurate method tofool deep neural networks , in Proceedings of the IEEE conference on computer vision and patternrecognition, 2016, pp. 2574–2582.[42]
P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. McAuley, and F. Koushanfar , Universaladversarial perturbations for speech recognition systems , arXiv preprint arXiv:1905.03828, (2019).[43]
A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov , Tensorizing neural networks , in Ad-vances in Neural Information Processing Systems, 2015, pp. 442–450.[44]
S. Oymak and M. Soltanolkotabi , Towards moderate overparameterization: global convergence guar-antees for training shallow neural networks , arXiv preprint arXiv:1902.04674, (2019).[45]
N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami , The limitationsof deep learning in adversarial settings , in 2016 IEEE European Symposium on Security and Privacy(EuroS&P), IEEE, 2016, pp. 372–387.[46]
A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio , Fitnets: Hints for
CTIVE SUBSPACE OF NEURAL NETWORKS: STRUCTURAL ANALYSIS AND UNIVERSAL ATTACKS thin deep nets , arXiv preprint arXiv:1412.6550, (2014).[47] H. L. Royden , Real Analysis , Macmillan, 2010.[48]
T. M. Russi , Uncertainty quantification with experimental data and complex system models , PhD thesis,UC Berkeley, 2010.[49]
T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran , Low-rank matrixfactorization for deep neural network training with high-dimensional output targets , in IEEE interna-tional conference on acoustics, speech and signal processing, 2013, pp. 6655–6659.[50]
S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini , Group sparse regularization for deepneural networks , Neurocomputing, 241 (2017), pp. 81–89.[51]
M. Schmidt, E. Berg, M. Friedlander, and K. Murphy , Optimizing costly functions with simple con-straints: A limited-memory projected quasi-newton algorithm , in Artificial Intelligence and Statistics,2009, pp. 456–463.[52]
A. C. Serban and E. Poll , Adversarial examples-a complete characterisation of the phenomenon , arXivpreprint arXiv:1810.01185, (2018).[53]
S. Shalev-Shwartz and T. Zhang , Accelerated proximal stochastic dual coordinate ascent for regularizedloss minimization , in International Conference on Machine Learning, 2014, pp. 64–72.[54]
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus , Intriguing properties of neural networks , arXiv preprint arXiv:1312.6199, (2013).[55]
D. Xiu and G. E. Karniadakis , Modeling uncertainty in steady state diffusion problems via generalizedpolynomial chaos , Computer methods in applied mechanics and engineering, 191 (2002), pp. 4927–4948.[56]
D. Xiu and G. E. Karniadakis , The wiener–askey polynomial chaos for stochastic differential equations ,SIAM journal on scientific computing, 24 (2002), pp. 619–644.[57]
S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, S. Liu, J. Tang, et al. , ProgressiveDNN compression: A key to achieve ultra-high weight pruning and quantization rates using ADMM ,arXiv preprint arXiv:1903.09769, (2019).[58]
T. Young, D. Hazarika, S. Poria, and E. Cambria , Recent trends in deep learning based naturallanguage processing , ieee Computational intelligenCe magazine, 13 (2018), pp. 55–75.[59]