RReconstruction of training samples from lossfunctions
Akiyoshi Sannai
Center for Advanced Intelligence ProjectRIKEN1-4-1, Nihonbashi, Chuo, Tokyo 103-0027, Japan [email protected]
Abstract
This paper presents a new mathematical framework to analyze the loss functions ofdeep neural networks with ReLU functions. Furthermore, as as application of thistheory, we prove that the loss functions can reconstruct the inputs of the trainingsamples up to scalar multiplication (as vectors) and can provide the number oflayers and nodes of the deep neural network. Namely, if we have all input andoutput of a loss function (or equivalently all possible learning process), for allinput of each training sample x i ∈ R n , we can obtain vectors x (cid:48) i ∈ R n satisfying x i = c i x (cid:48) i for some c i (cid:54) = 0 . To prove theorem, we introduce the notion of virtualpolynomials, which are polynomials written as the output of a node in a deepneural network. Using virtual polynomials, we find an algebraic structure for theloss surfaces, called semi-algebraic sets. We analyze these loss surfaces fromthe algebro-geometric point of view. Factorization of polynomials is one of themost standard ideas in algebra. Hence, we express the factorization of the virtualpolynomials in terms of their active paths. This framework can be applied to theleakage problem in the training of deep neural networks. The main theorem inthis paper indicates that there are many risks associated with the training of deepneural networks. For example, if we have N (the dimension of weight space) + 1nonsmooth points on the loss surface, which are sufficiently close to each other, wecan obtain the input of training sample up to scalar multiplication. We also pointout that the structures of the loss surfaces depend on the shape of the deep neuralnetwork and not on the training samples. Deep learning has had great success in many fields. Deep learning model perform extremely well incomputer vision [21], image processing, video processing, face recognition [27], speech recognition[15], and natural language processing [1, 6, 30]. Deep learning has also been used in more complexsystems that are able to play games [14, 19, 28] or diagnose and classify diseases [2, 9, 10].Along with the development of deep learning, decisions made using deep-learning principles arebeing implemented in a wide range of applications. In order for deep learning to be more useful forhuman beings, it is necessary to ensure that there is no leakage of personal or confidential informationin the learning process or decision-making process. In this paper, we point out the leakage problemin deep learning. Namely, the learning process of deep learning might leak sample data. This type ofphenomenon is specific to the deep-learning methods with ReLU functions. The following exampleexpresses the difference between linear and ReLU models. Consider a one-dimensional least-squaresmodel. g ( a ) := m (cid:88) i =1 ( y i − ax i ) . Preprint. Work in progress. a r X i v : . [ s t a t . M L ] M a y ecause many { ( x i , y i ) | i = 1 , . . . , m } give the same g ( a ) (See [25]), we cannot reconstruct { ( x i , y i ) | i = 1 , . . . , m } from g ( a ) . However, if we consider the one-dimensional ReLU least-squaresmodel, h ( a ) := m (cid:88) i =1 (max(0 , y i − ax i )) . Then, as { ( y i /x i ) | i = 1 , . . . , m } are the nonsmooth points of h ( a ) , we can obtain the nonsmoothpoint of h ( a ) from h ( a ) . Hence, we can reconstruct { ( x i , y i ) | i = 1 , . . . , m } from h ( a ) up to scalarmultiplication. Namely, we obtain { ( x (cid:48) i , y (cid:48) i ) | i = 1 , . . . , m } satisfying ax (cid:48) i = x i and ay (cid:48) i = y i forsome a ( (cid:54) = 0) ∈ R .The main theorem in this paper shows that, if we reveal all possible learning processes, leakageof training samples can occur. As this example indicates, nonsmooth points of loss functionsplays important role. We show that the set of nonsmooth points and the set induced by somealgebraic structure coincides (Theorem 2.4). This is a key point of this paper. Another key pointis the concept of homogenous polynomials, which is used in algebraic geometry. We find thenatural multidegree (layer-wise degree) in deep neural networks. We show that loss functions areessentially homogenous polynomials (virtual polynomials) of layer-wise degree. By using the theoryof homogenous polynomials, we show the correspondence of the factorization of virtual polynomialsand its active paths (Theorem 2.8). Finally, as an application of this theorem, we show the weakreconstruction theorem of training samples (Theorem 2.11). We also give a theoretical algorithm toobtain weak reconstruction of training samples (Section2.5). Leakage problem: The leakage problem in deep learning can be a serious issue in the future andmany researchers are working on this. Since the trained model has essential information about thetraining sample, it is possible to extract sensitive information from a model [3, 11, 12]. B. Hitaj, G.Ateniese, & F. P’erez-Cruz considered this problem by using generative adversarial networks (GANs)[18]. This type of approach is suitable for images, but unsuitable for numerical data such as medicalnonimage data. For example, generating human-model data, such as height of six feet, is usuallyconsidered normal. However, if identity of the model is made available, then it is a leakage problem.This is the difference between the model generating approach and the deterministic approach.Loss surfaces: Mathematically, the learning process of deep learning is to find the local minima ofloss surfaces (loss functions). Before this paper, some researchers had analyzed the loss surfaces.One of their aims was achieving theoretical understanding of the generalization of deep learning. Forexample, K. Kawaguchi proved that any local minima of loss surfaces associated with linear neuralnetworks was a global minimum [4,20]. J. Pennington and Y. Bahri analyzed loss surfaces using therandom matrix theory [26]. In this paper, we present a new framework to analyze loss surfaces. Westudy the structure of loss surfaces using algebraic geometry. This approach can contribute to thetheoretical understanding of the generalization of deep learning.Algebraic geometry: Algebraic geometry is one of the most exciting field of pure mathematics[5,8,16,24]. Furthermore, algebraic geometry frequently applied to machine learning. For example,R. Livni, et al introduced vanishing component analysis to express the algebraic (nonlinear) structureof data sets [23]. S. Watanabe applied algebraic geometry to learning theory. He proved that aninvariant defined in algebraic geometry and the one defined in learning theory coincides [29]. He alsorelated these invariants value to zeta functions. When we treat polynomials, algebraic geometry is apowerful tool to consider them.
We discuss the loss functions of fully connected deep neural networks with square losses. Basically,all notations are taken from the deep learning book by Goodfellow, et al [13]. Let L be the number oflayers. We do not use the notion of "hidden layer" for the consistency of the other definitions. Wedenote the weight parameters by w ∈ R N , which consists of the entries of the parameter matricescorresponding to each layer : W L − ∈ R d L × d L − , . . . , W k ∈ R d k +1 × d k , . . . , W ∈ R d × d . Here, d k represents the width of the k -th layer, where the first layer is the input layer and the L -th layer isthe output layer. We use ( i, k ) -node to indicate the i -th node in the k -th layer. We denote its output2s x ( k ) i and pre-output as z ( k ) i , namely x ( k ) i = max(0 , z ( k ) i ) and W k x ( k )1 x ( k )2 ... x ( k ) d k = z ( k +1)1 z ( k +1)2 ... z ( k +1) d k +1 . We simply denote x (1) i by x i . We denote the output by F w : R d → R d L , namely, F w x x ... x d = z ( L )1 z ( L )2 ... z ( L ) d L . Let
Ω = { ( a i , b i ) | i = 1 , . . . , M } ⊂ R d × R d L be a training sample set. Then, we define the lossfunction as follows, E ( w ) = (cid:88) ( a i ,b i ) ∈ Ω (cid:107) b i − F w ( a i ) (cid:107) , where (cid:107) · (cid:107) is the Frobenius norm. The main theorem of this paper is given below. Theorem 1.1
Let E ( w ) be the loss function of deep neural network with ReLU functions. Assumewe can obtain all input and output of E ( w ) . Then we can obtain { a (cid:48) i } satisfying c i a (cid:48) i = a i for some c i ( (cid:54) = 0) ∈ R , number of layers, and number of nodes in each layer. This theorem means that the input and output of the loss function E ( w ) can reconstruct the input ofthe training samples up to scalar multiplication. In other word, if we can obtain all possible trainingprocess of deep learning, { a i } is reconstructed as { a (cid:48) i } . In general, a (cid:48) i is not equal to a i . However,if we obtain a entry of a i , we can specify c i in the Theorem 1.1. Hence, we can obtain a i . Thisindicates that it carries many risks to reveal the training process of deep learning. Hence, we need toconceal the value of loss functions to protect training samples. We can provide a stronger statementafter proper mathematical preparation (See Theorem 2.11).Note that Theorem 1.1 can be generalized as follows. First, we can add any smooth function r(w) tothe loss function E(w) as a regularization term. Second, we can change the activation function to anypiecewise linear function such as Leaky ReLU, Maxout, and LWTA [17,22]. For simplicity, in thispaper, we only treat the simplest case. In this section, we prepare definitions and theorems to prove the main result. Our focus is on the losssurfaces that are defined by X = { ( w, y ) ∈ R N × R | y = E ( w ) } ,where E ( w ) is the loss function defined above. From the view point of deep learning, we areinterested in the local minima of loss functions. We provide some mathematical frameworks fromalgebraic geometry. This is a new method to analyze the loss surfaces, which can contribute to thetheoretical understanding of generalization. For the standard notations in algebraic geometry, we referto [8,16]. First, we define semi-algebraic sets from a field of pure mathematics, algebraic geometry.Let X be a subset of R N . X is said to be a semi-algebraic set if X is defined by the polynomials f i = 0 and g j > and the finite union of them. If X is a semi-algebraic set, we can state that f i is adefining equation of X and g i is a defining inequation of X . For other notations in semi-algebraicgeometry, we refer to [7]. The following theorem points out that the loss surfaces are semi-algebraicsets. Theorem 2.1 (Structure theorem 1)
Let X be a loss surface of a square loss function. Then, X isa semi-algebraic set of codimension 1. i X j u i = 0 u j = 0 Figure 1: Decomposition of a loss surfaceFigure 1 indicates the meaning of the theorem. The polynomial u i divides the loss surface intosubsurfaces X k and each X k is defined by a (fixed) polynomial. We can see the precise descriptionof u i later.This theorem allows us to use algebraic geometry for analyzing loss surfaces. The secondtheorem is about the decomposition of X as a semi-algebraic set. To describe it, we define virtualpolynomials, which are functions written as the outputs of nodes. The concept of virtual polynomials plays an important role in this paper.
Definition 2.2
Fix an input x and weight w on the fixed deep neural network. An ( i, k ) -node is saidto be active if its output x ( k ) i is positive. We define the set { [( i, k, q ) | q = positive or negative } to be the ReLU activation set. When we mention just ReLU activation set, it is just a formal pair of a node and its activations. Hence,it is irrelevant whether it is realized by an input and a weight. When we have a ReLU activation set, itinduces a deep linear network. We define virtual polynomials by using them.
Definition 2.3
Fix an input x . A weight valuable polynomial u is defined to be a virtual polynomialof type ( i, k ) if u = x ( k ) i , where x ( k ) i is the output of the i -th node in the k -th layer in the deep linearnetwork induced by some ReLU activation set. We simply define u as a virtual polynomial if u is avirtual polynomial of type ( i, k ) for some ReLU activation set and some ( i, k ) . See Figure 2. The virtual polynomials of type (1 , in this neural network are { ω ω x + ω ω x + ω ω x + ω ω x , ω ω x + ω ω x , ω ω x + ω ω x , } . The corresponding ReLU activation sets are { (1 , , active) , (1 , , active) , (2 , , active) , (2 , , active) }{ (1 , , active) , (1 , , active) , (2 , , active) , (2 , , negative) }{ (1 , , active) , (1 , , active) , (2 , , negative) , (2 , , active) }{ (1 , , active) , (1 , , active) , (2 , , negative) , (2 , , negative) } . If we fix a ReLU activation set, we have a virtual polynomial. However, even if we fix the virtual4 x ! ! ! ! ! ! v Figure 2: Neural networkpolynomial, the ReLU activation set that provides the virtual polynomial is not unique. For example, { (1 , , negative) , (1 , , negative) , (2 , , negative) , (2 , , negative) } give as a virtual polynomialin the example above.Now, we can state the second theorem. Theorem 2.4 (Structure theorem 2)
Let X be a loss surface of a square loss function. Let Sing( X ) be the set of nonsmooth points on X . Then, • The shortest decomposition (the decomposition that we cannot reduce by defining inequa-tions) is given by
Sing( X ) . • Sing( X ) is purely codimension 1 in X (See [3]) and is locally defined by a virtual polyno-mial. • Sing( X ) is a semi-algebraic set. This indicates that
Sing( X ) is a natural set from not only the differential-geometric view but also thealgebro-geometric view. By this theorem, Sing( X ) is locally defined by some virtual polynomials.Hence, from the algebro-geometric view point, we need to know the irreducible decomposition ofvirtual polynomials to obtain the geometric structure of Sing( X ) . In this section, we review factrization of polynomials. We first define the irreducibility of polynomials.Let f be a polynomial with real coefficients and n valuables. f is said to be irreducible if we cannotwrite f as a product of two non-constant polynomials. Namely, f = gh ⇒ g or h is constant . It is well-known that polynomials have an irreducible decomposition[5,8,10]. Namely, let f bea polynomial with real coefficients and n valuables. Then, f has a unique decomposition of thefollowing form. f = f · · · f n ,where f i is an irreducible polynomial with real coefficients and each f i is unique up to constantmultiplication. We define f i above as an irreducible component of f . In Section 2.4, we give theirreducible decomposition of virtual polynomials (See Theorem 2.8).5 .3 Homogenous polynomials In this subsection, we review the concepts of homogenous polynomials and multidegree. Let x , . . . , x n be valuables. Multidegree of each x i is defined as an element in Z n . For any monomial m = x a · · · , x a n n , we define deg ( m ) = (cid:80) i a i deg ( x i ) A deep neural network induces natural multidegree.
Definition 2.5
Let w ( k )( i,j ) be the weight valuable on the path passing from the i -th node in the k -thlayer to the j -th node in the k + 1 -th layer. Then, we define deg( w ( k )( i,j ) ) = (0 , , . . . , , , , , . . . , , where exists in the k -th entry. We call this multidegree layer-wise degree. Fix multidegree. A polynomial f is said to be homogenous if any monomial appearing in f has thesame multidegree. In this case, we define deg( f ) = deg( m ) , where m is a monomial appearing in f . Deg( m ) does not depend on the choice of m . It is well-known that any irreducible component ofhomogenous polynomial is homogenous (See [5,8,24]).We can see an example of layer-wise degree in Figure 2. The layer-wise degree of this neural networkis deg( w i ) = (1 ,
0) ( i = 1 , , ,
4) deg( w i ) = (0 ,
1) ( i = 5 , The following theorem points out the features of virtual polynomials.
Theorem 2.6
Virtual polynomials of type ( i, k ) are homogenous polynomials of layer-wise degreewith deg( f ) = (1 , . . . , , , . . . , , where 1 exists from the first entry to the k -th entry . In this subsection, we give the necessary and sufficient conditions for the irreducible decompositionof virtual polynomials.
Definition 2.7
Let P be a ReLU activation set of fixed input x and weight w . Then, a P -activeneural network is a subneural network, which consists of P -active nodes and the paths between them. An example of P -active neural networks is given below. See Figure 2 and 3. We can regard the neuralnet in Figure 2 as a sub neural network of the one in Figure 2. Assume that v in Figure 2 is negativefor some input and weight and the earlier output was positive. Then, with this ReLU activation set P ,the P -active neural network is equal to the one in Figure 2. Theorem 2.8 (Irreducible decomposition theorem)
Let P be a ReLU activation set of fixed input x and weight w . Let u be a virtual polynomial of type ( i, L ) induced by P . Then, u = g · · · g n ifand only if the P -active neural network has n − layers such that there is a unique node in the layer.Furthermore, we can write g i as the output of the subneural network which starts from a unique nodeand ends at the next unique node. A typical example of Theorem 2.8 is given below.Let u be a virtual polynomial with a P -active neural network (Figure 3). Then, we have the followingirreducible decomposition of uu = ( ω ω x + ω ω x + ω ω x + ω ω x )( ω ω + ω ω ) . The first component of the decomposition corresponds to the output of the node in the third layer.The second component of the decomposition corresponds to a function that starts from the thirdlayer and ends at the output. Hence, the theorem tells us the irreducible decomposition of the virtualpolynomials from its active node.We can see that u in this example is realized by an input x and weight w . Hence, we can see that u isone of the defining equations of Sing( X ) . The decomposition implies that u = 0 if and only if ω ω x + ω ω x + ω ω x + ω ω x = 0 x u ! ! ! ! ! ! ! ! ! ! Figure 3: P-active neural networkor ω ω + ω ω = 0 . This means that ω ω + ω ω is the defining equation of Sing( X ) . However, ω ω + ω ω doesnot depend on the training samples. Hence, the loss surfaces have differential geometric structures,which are independent of the training samples. Suitable algorithms using such structures can bedeveloped. Proposition 2.9
Sing( X ) has irreducible components, which do not depend on the training samples. Corollary 2.10
Linear components of virtual polynomials are weight parameters or come from thesecond layer.
We state the main result of this paper. Note that, if we know the input and output of the loss functions,we know the defining equations and the defining inequations of
Sing( X ) . Theorem 2.11 (Weak reconstruction theorem)
Sing( X ) reconstructs the number of layers, num-ber of nodes in the layers, and training samples (not equal to unit vector) up to scalar multiplication.Namely, a vector a (cid:48) i ∈ R n satisfying a i = c i a (cid:48) i for some c i (cid:54) = 0 is reconstructed for all input of eachtraining sample a i ∈ R n . In this theorem, we need infinite points on the loss surface. However, if we assume that these pointsare sufficiently close to each other, we can reconstruct the input of a training sample.
Proposition 2.12
Assume that we have N + 1 nonsmooth points on the loss surface, which are notsmooth and are sufficiently close to each other, then, we can obtain a vector a (cid:48) ∈ R n satisfying a i = ca (cid:48) for some i and c (cid:54) = 0 . We give a theoretical algorithm to obtain weak reconstruction of training samples. The algorithmrequires long time, but terminates in a finite time.We estimate the degree of defining equations from the dimension of the weight space. Then, we takerandom weight w and obtain the defining equation y = f ( w ) around w by taking finitely manypoints near w . After that, we find an adjacent division and its defining equation y = g ( w ) by takingrandom points w around w and comparing the values of f ( w ) and E ( w ) . Here, the intersectionof these two equations is an irreducible component of Sing(X). In other word, f ( w ) − g ( w ) is avirtual polynomial. We repeat this procedure until we find all the divisions.7 Sketch of proofs
We give sketches of the proofs in this section. We complete the proofs in the appendix.
Let u ( i,k ) be a virtual polynomial at (i,k). Then, put W u = { w ∈ R N | u ( i,k ) ( w ) = 0 } . Since W u is defined by a single polynomial, W u divides the weight space into two areas defined byinequalities. Add all W u to the weight space. Then, the space is divided into many areas defined byinequalities. Fix an input x and take two weights w and w . If w and w belong to the same area,the ReLU activation set associated with ( x, w ) and the one associated with ( x, w ) are the same.This implies that, if the weights are in the same area, any entry of F w is a polynomial. Hence the lossfunction is a polynomial. This means that the loss surface is a semi-algebraic set. By the assumption, we may assume that we know the defining equations of
Sing( X ) , we can pick upthe linear polynomials in it. We show that, if the linear polynomial is not a weight parameter, thecoefficients are equal to the input of some samples up to scalar multiplication. First, we remark thatthe coefficients of the virtual polynomials in the second layer are equal to the inputs of some samplesup to scalar multiplication. Since we can see that the defining polynomials of Sing( X ) includevirtual polynomials in the second layer, it is proof enough that any linear polynomials appearingin the defining polynomials are virtual polynomials in the second layer. If we assume that a linearpolynomial appears in the defining polynomials of Sing( X ) , it will be the irreducible component ofa virtual polynomial. By Theorem 2.8, we can see that it is the virtual polynomial of the second layeror weight parameter. This is because, if a linear polynomial that is not a virtual polynomial of thesecond layer appears as an irreducible component of the virtual polynomial, it must start from the l -th layer with one active node and end at the l + 1 -th layer with one active node. This is a weightparameter. Hence, we reconstruct the input of a sample up to scalar multiplication and the weights onthe paths from the first layer to the second layer. We can find a quadric polynomial in the definingequation of Sing( X ) , which contains the weights on the paths from the first layer to the second layer.Then, the remaining weights are the weights on the paths from the second layer to the third layer.Inductively, we can reconstruct the number of nodes and layers. In this paper, we presented a new mathematical framework based on algebraic geometry and somenew concepts including virtual polynomials. Using these, we proposed a structure theorem for losssurfaces, an irreducible decomposition theorem for virtual polynomials. The main contribution ofthis paper was the reconstruction theorem for samples. Namely, the training process of deep learningcould leak information of samples. While this fact is important on its own, the proposed frameworkcontributes more. This framework enables researchers in the fields of machine learning and algebraicgeometry to pursue research on deep learning. We will be able to discover new algorithms on securityissues, from this framework. In addition, we may be able to find an efficient training algorithm ondeep learning from this framework. More theoretical understanding of deep learning is required, butthere is also a possibility of contributing to this.
Acknowledgments 1
The author would like to thank Prof. Masashi Sugiyama and Kenichi Bannaifor giving the opportunity to study machine learning at RIKEN AIP. The author would like to thankProf. Jun Sakuma and Takanori Maehara for carefully reading the draft and offering valuable advice.The author would like to thank Prof. Shuji Yamamoto and Sumio Watanabe for their fruitful discussion.The author was partially supported by JSPS Grant-in-Aid for Young Scientists (B) 16K17581. [1] A. Abdulkader, A. Lakshmiratan, and J. Zhang. (2016) Introducing DeepText: Facebook’s text understandingengine. https://tinyurl.com/ jj359dv[2] A. Cruz-Roa, J. Ovalle, A. Madabhushi, and F. Osorio. (2013) A deep learning architecture for imagerepresentation, visual interpretability and automated basal-cell carcinoma cancer detection.
In International onference on Medical Image Computing and ComputerAssisted Intervention. Springer Berlin Heidelberg,403–410. [3] G. Ateniese, L. V Mancini, A. Spognardi, A. Villani, D. Vitali, & G. Felici. (2015) Hacking smart machineswith smarter ones: How to extract meaningful data from machine learning classifiers. International Journal ofSecurity and Networks 10, 3, 137–150. [4] P. Baldi & K. Hornik. (1989) Neural networks and principal component analysis: Learning from exampleswithout local minima.
Neural networks, 2(1), 53–58 . [5] W. Bruns & H. J. Herzog. (1998)
Cohen-Macauley rings
Cambridge University Press[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. (2011) Natural languageprocessing (almost) from scratch.
Journal of Machine Learning Research 12, Aug (2011), 2493–2537. [7] M. Coste. (2002) An introduction to semialgebraic geometry. Tech. rep.,
Institut de Recherche Mathematiquesde Rennes [8] D. Cox, J. Little, and D. O’Shea. (1992)
Ideals, Varieties, and Algorithms: An Introduction to ComputationalAlgebraic Geometry and Commutative Algebra.
Springer.[9] DeepMind. 2016. DeepMind Health, Clinician-led. Patient-centred. (2016). https://deepmind.com/applied/deepmind-health/[10] R. Fakoor, F. Ladhak, A. Nazi, and M. Huber. (2013) Using deep learning to enhance cancer diagnosis andclassification.
In The 30th International Conference on Machine Learning (ICML 2013),WHEALTH workshop [11] M. Fredrikson, S. Jha, and T. Ristenpart. (2015) Model inversion attacks that exploit confidence infor-mation and basic countermeasures.
In Proceedings of the 22nd ACM SIGSAC Conference on Computer andCommunications Security. ACM, 1322–1333. [12] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart. (2014) Privacy in pharmacogenetics: Anend-to-end case study of personalized warfarin dosing.
In 23rd USENIX Security Symposium (USENIX Security14). 17–32. [13] I. Goodfellow, Y. Bengio & A.Courville. (2016)
Deep Learning.
MIT Press.[14] Google DeepMind. 2016. AlphaGo, the first computer program to ever beat a professional player at thegame of GO. (2016). https://deepmind.com/alpha-go[15] A. Graves, A. Mohamed, and G. Hinton. (2013) Speech recognition with deep recurrent neural networks.
In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649. [16] R. Hartshorne. (1977)
Algebraic geometry,
Springer-Verlag, New York, Graduate Texts in Mathematics,No. 52[17] K. He, X. Zhang, S. Ren & J. Sun. (2015) Delving Deep into Rectifiers: Surpassing Human-LevelPerformance on ImageNet Classification
IEEE International Conference on Computer Vision [18] B. Hitaj, G. Ateniese, & F. Perez-Cruz. (2017)
Deep models under the GAN: information leakage fromcollaborative deep learning. CoRR, abs/1702.07464. [19] M. Lai. (2015) Giraffe: Using deep reinforcement learning to play chess. arXiv preprint arXiv:1509.01549(2015). [20] K. Kawaguchi. (2016) Deep learning without poor local minima,
In Advances In Neural InformationProcessing Systems, pp. 586–594, 2016 .[21] Y. LeCun, K. Kavukcuoglu, C. Farabet, et al. (2010) Convolutional networks and applications in vision.
InISCAS. 253–256. [22] Z. Liao & G. Carneiro. On the Importance of Normalisation Layers in Deep Learning with Piecewise LinearActivation Units, arxiv1508.0033 [23] R. Livni, D. Lehavi , S. Schein, H. Nachlieli, S. Shalev-Shwartz & A. Globerson. (2013) VanishingComponent Analysis 30th International Conference on Machine Learning.[24] H. Matsumura.
Commutative Ring Theory
Cambridge Studies in Advanced Mathematics[25] M. Marshall. (2008) Positive Polynomials and Sums of Squares ,
Mathematical Surveys and MonographsVolume: 146 [26] J. Pennington & Y. Bahri. (2017) Geometry of Neural Network Loss Surfaces via Random Matrix Theory
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2798–2806.
27] Y. Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. (2014) DeepFace: Closing the Gap toHuman-Level Performance in Face Verification.
In Proceedings of the 2014 IEEE Conference on ComputerVision and Pattern Recognition, 1701–1708. [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. (2013)Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013). [29] S. Watanabe. (2009)
Algebraic Geometry and Statistical Learning Theory
Cambridge University Press[30] X. Zhang and Y. LeCun. (2016) Text Understanding from Scratch. arXiv preprint arXiv:1502.01710v5(2016).
Appendix of Reconstruction of training samples from loss functions.
Let a p be an input of a training sample. Let V p be the set of virtual polynomial of the input a p . We show that V = (cid:83) Mp =1 { u ( p )( i,k ) ( w ) > , u ( p )( i,k ) ( w ) < | u ( p )( i,k ) ∈ V p } and equations define the loss surface. Put W ( u ( p )( i,k ) ) = { w ∈ R N | u ( p )( i,k ) ( w ) = 0 } .W + ( u ( p )( i,k ) ) = { w ∈ R N | u ( p )( i,k ) > } .W − ( u ( p )( i,k ) ) = { w ∈ R N | u ( p )( i,k ) < } . Take two weights w and w from ( (cid:84) u p ( i,k ) ∈ V + W + ( u p ( i,k ) )) (cid:84) ( (cid:84) u p ( i,k ) ∈ V − W − ( u p ( i,k ) )) , where V + ∪ V − = V . The ReLU activation set associated with ( a k , w ) and the one associated with ( a p , w ) are the same.This implies that, if the weights w are in the same area, the entry of F w ( a k ) is a polynomial. Hence,the loss function is a fixed polynomial f ( w ) in this area. The loss surface is defined by y = f ( w ) in ( (cid:84) u p ( i,k ) ∈ V + W + ( u p ( i,k ) )) (cid:84) ( (cid:84) u p ( i,k ) ∈ V − W − ( u p ( i,k ) )) . If the weights are in ( (cid:84) u p ( i,k ) ∈ V (cid:48) W ( u p ( i,k ) )) , put V ∗ = V − V (cid:48) . In this case, we can use the same discussion for V ∗ . This means that the loss surface is asemi-algebraic set. The proof of Theorem 2.1 implies that W ( u ( p )( i,k ) ) gives a decomposition. Put ˜ W =( (cid:84) u p ( i,k ) ∈ V + W + ( u p ( i,k ) )) (cid:84) ( (cid:84) u p ( i,k ) ∈ V − W − ( u p ( i,k ) )) , where V + ∪ V − = V . We see that, the losssurface is smooth in the domain ˜ W . Again, by the proof of Theorem 2.1, the loss function is a polynomial in ˜ W . Then, the loss surface is defined in the form of y = f ( w ) , where f ( w ) is a polynomial. By the Jacobiancriterion, we see that the loss surface is smooth in ˜ W for any V + , V − .We then claim that we can erase the virtual polynomial u p ( i,k ) from the defining inequalities if and only if W ( u ( p )( i,k ) ) consist of smooth points. Take a point x in ˜ W . If the point is smooth, we can take the Taylorexpansion of E ( w ) . Since E ( w ) is a polynomial at a point in the neighborhood of x , the Taylor expansion of E ( w ) will be a polynomial. Hence, we can erase u p ( i,k ) from the defining inequations. Conversely, we assumethat we can erase u p ( i,k ) from the defining inequations. Since the loss surface around any point in W ( u p ( i,k ) ) is defined by a polynomial, the point is smooth. Finally, we show that Sing ( X ) is also a semi-algebraic setof codimension 1 in X . Note that Sing ( X ) is locally of the form W ( u ( p )( i,k ) ) . This implies that Sing ( X ) iscodimension 1 in X . We consider the decomposition discussed in the proof of Theorem2.1 again. In eachdomain, it is fixed that W ( u (cid:48) p ( i,k ) ) is singular or not, because the function u (cid:48) p ( i,k ) is a polynomial in each domain.This implies that Sing ( X ) is a semi-algebraic set. By the construction of the virtual polynomial, the weights of each layer appear precisely once in each monomial.This implies that the layer-wise degree is equal to (1 , . . . , , , . . . , . We prove the theorem for any connected deep neural networks by induction on the number of layers. If n = 1 , thestatement is clear. Assume n > . First, by Theorem2.15, virtual polynomials of type ( i, L ) are a homogenouspolynomials of layer-wise degree (1 , , . . . , and the layer-wise degree is realized by assigning the degree (0 , , . . . , , , , , . . . , to the weights on the paths passing from the l -th layer to the l + 1 -th layer, where xists in the l -th entry. Assume that u = g · · · g n . Then, by a general theory of commutative algebra, g i arehomogenous polynomials of layer-wise degree. We have n (cid:88) i =1 deg( g i ) = deg( u ) = (1 , , . . . , . We may assume that from the first entry to the l -th entry, the entry of deg( g ) is 1 and l + 1 -th entry of deg( g ) is zero . Hence, we may also assume that the l + 1 -th entry of deg( g ) is 1. Let w ( l )( i,j ) be the weighton the path passing from the i -th node in the l -th layer to the j -th node in the l + 1 -th layer. There are themonomials containing w ( l )( i,j ) in g and monomials containing w ( l +1)( r,s ) in g by the layer-wise degree. Afterthe construction of virtual polynomials, there will be no monomials in u containing w ( l )( i,j ) w ( l +1)( r,s ) if j (cid:54) = r .However, u = g · · · g n implies that u has monomials containing w ( l )( i,j ) w ( l +1)( r,s ) for all i, j, r, s . This impliesthat the l + 1 -th layer of P -active neural network has a unique node. Hence, after the construction of virtualpolynomials, we obtain u = g (cid:48) u (cid:48) , where g (cid:48) is the output of the unique node in l + 1 -th layer. By the generaltheory of commutative algebra (uniqueness of the decomposition), we obtain g = g (cid:48) . Since we can regard u (cid:48) isa virtual polynomial of the subneural network which starts from l + 1 -th layer induced by the output of l + 1 -thlayer and the same weights, the theorem holds for u (cid:48) by inductive hypothesis. This completes the proof of thefirst statement in Theorem 2.18. The remaining claim follows from the construciton of g (cid:48) . By the assumption, we may assume that we know the defining equations of
Sing( X ) , we can pick up thelinear polynomials in it. Let L i = (cid:80) mk =1 a (cid:48) k w k is linear polynomials in it.We show that, if L i is not a weightparameter, ( a (cid:48) k ) is equal to the input of some samples up to scalar multiplication. First, we remark that thecoefficients of the virtual polynomials in the second layer are equal to the inputs of some samples. Since we cansee that the defining polynomials of Sing( X ) include virtual polynomials in the second layer, it is enough toshow that any linear polynomials appearing in the defining polynomials are virtual polynomials in the secondlayer. If we assume that a linear polynomial appears in the defining polynomials of Sing( X ) , it will be theirreducible component of a virtual polynomial. By Theorem 2.8, we can see that it is the virtual polynomial ofthe second layer or weight parameter. This is because, if a linear polynomial that is not a virtual polynomial ofthe second layer appears as an irreducible component of the virtual polynomial, it must start from the l -th layerwith one active node and end at the l + 1 -th layer with one active node. This is a weight parameter. Hence, wereconstruct the input of samples up to scalar multiplication and the weights on the paths from the first layer tothe second layer.Next, We identify the weights on the paths from the l -th layer to the l + 1 -th layer by induction on l . By thediscussion above, we identified the weights of the first layer, namely l = 1 . Assume that we identified theweights for the l -th layer if l < k . Pick up the polynomials f of the defining inequalities of degree k such thateach monomial in f is consist of the weights of l < k except one weight parameter. We can easily see that suchweight is on the paths from the k -th layer to the k + 1 -th layer by induction on l . Hence, we can also obtain thenumber of the layers. This completes the proof. We note that the ambient space of the loss surface is R N +1 . Hence, we can determine the hyperplane goingthrough the points. This hyperplane will be a linear component of Sing( X ) , and we obtain a sample up to scalar., and we obtain a sample up to scalar.