[PDF] Dropping Activation Outputs with Localized First-layer Deep Network for Enhancing User Privacy and Data Security

Abstract

Deep learning methods can play a crucial role in anomaly detection, prediction, and supporting decision making for applications like personal health-care, pervasive body sensing, etc. However, current architecture of deep networks suffers the privacy issue that users need to give out their data to the model (typically hosted in a server or a cluster on Cloud) for training or prediction. This problem is getting more severe for those sensitive health-care or medical data (e.g fMRI or body sensors measures like EEG signals). In addition to this, there is also a security risk of leaking these data during the data transmission from user to the model (especially when it's through Internet). Targeting at these issues, in this paper we proposed a new architecture for deep network in which users don't reveal their original data to the model. In our method, feed-forward propagation and data encryption are combined into one process: we migrate the first layer of deep network to users' local devices, and apply the activation functions locally, and then use "dropping activation output" method to make the output non-invertible. The resulting approach is able to make model prediction without accessing users' sensitive raw data. Experiment conducted in this paper showed that our approach achieves the desirable privacy protection requirement, and demonstrated several advantages over the traditional approach with encryption / decryption

Full PDF

DDropping Activation Outputs with Localized First-layer DeepNetwork for Enhancing User Privacy and Data Security

Hao Dong, Chao Wu, Zhen Wei, Yike Guo

Department of Computing, Imperial College London { hao.dong11, chao.wu, zhen.wei11, y.guo } @imperial.ac.uk Abstract —Deep learning methods can play a crucial role inanomaly detection, prediction, and supporting decision makingfor applications like personal health-care, pervasive body sensing,etc. However, current architecture of deep networks suffers theprivacy issue that users need to give out their data to themodel (typically hosted in a server or a cluster on Cloud) fortraining or prediction. This problem is getting more severe forthose sensitive health-care or medical data (e.g fMRI or bodysensors measures like EEG signals). In addition to this, thereis also a security risk of leaking these data during the datatransmission from user to the model (especially when it’s throughInternet). Targeting at these issues, in this paper we proposed anew architecture for deep network in which users don’t revealtheir original data to the model. In our method, feed-forwardpropagation and data encryption are combined into one process:we migrate the ﬁrst layer of deep network to users’ local devices,and apply the activation functions locally, and then use “droppingactivation output” method to make the output non-invertible.The resulting approach is able to make model prediction withoutaccessing users’ sensitive raw data. Experiment conducted in thispaper showed that our approach achieves the desirable privacyprotection requirement, and demonstrated several advantagesover the traditional approach with encryption / decryption.

I. I

NTRODUCTION

Deep learning, also known as deep neural networks [1], hasbeen proved to be successful for classiﬁcation and regressiontasks. Once well established, deep learning models exhibitedpromising performance, which is desired by many real-timeprediction / decision-making applications [2], like automaticspeech recognition, image recognition, natural language pro-cessing, etc.For these applications, due to hardware limitations (i.e.the computation capability due to the power consumptionlimitation), it is very expensive to implement deep learningalgorithms entirely on users’ local devices (sensors, mobiles,and even laptops). As a result, the typical approach is tocollect data locally at ﬁrst and then transmit the data to aremote server / cluster, and then apply the deep networksfor training / prediction. However, such approach suffers animportant privacy issue, especially for those sensitive data,e.g. patients clinical data (like fMRI images) and sensitiveenvironment data (like video monitoring in public space).While these kind of data are becoming more important for bigdata analytics (e.g. for personalized health-care), it brings theprivacy concern when sending these data directly to a remotemodel. For example, it would cause severe consequence if acompany as a model provider with a lot of users’ health-caredata like DNA, EHR, fMRI, etc. was invaded by malicioushackers. It also causes another security risk of leaking data during the transmission from user to the model (especiallywhen it’s through Internet). It is problematic if we transfer theusers’ sensitive data directly to a deep learning model hostedon the remote server.There have been some researches on conducting neuralnetwork prediction on encrypted data rather than the originaldata in the situation so that users / sensors won’t share themwith remote servers [6], [7]. However, their approach bring inextra computation cost, and it would not be safe once the keyfor encryption was hacked.Targeting at this issue, we proposed a novel deep networkarchitecture, in which users don’t need to reveal their originaldata to the model. Instead of the tradition approach shown inFig 1, we split the layers between local device and the server,and migrate the ﬁrst layer to local device (as shown in Fig.2). We apply the activation functions for the ﬁrst layer locallyand then transfer the outputs from the ﬁrst layer to the server.We then use “dropping activation output” method to make theoutput non-invertible, which randomly drop some outputs fromactivation function. In this method, feed-forward propagationand data encryption are combined into one process.We investigated both invertible and non-invertible activa-tions and their performance in privacy protection. We provedthis method can make sure the original data cannot berecovered from the transmitted data. Concretely, the maincontributions of this paper are listed as follow: • We introduced a new deep network architecture that com-bines neural network and data encryption to solve privacyand data security problem. With privacy preserving feed-forward propagation, server (model provider) can servefor user without directly accessing the user’s data. • We proved and evaluated that dropping few activationoutputs can encrypt data for invertible activation function.We also evaluated that ramp function is better thanrectiﬁer in term of encryption. • We proposed privacy preserving error-back propagation,by which server can train a neural network for userwithout accessing user’s data. • With our method, data compression for data transmissionis available when the number of neurons is less than thesize of input data.The paper is organized as follows: Section II ﬁrstly intro-duces the new architecture of deep network and then discussesinvertible and non-invertible activation functions, mainly fo-cusing on invertible activation function to encrypt the data. Wethen give the details of “dropping activation output” method a r X i v : . [ c s . C V ] N ov nd how it works during feed-forward propagation withoutany additional computation. Section III provides mathematicsproven and explanation for our method. Section IV presentsour experiment and its result. We conclude our work in SectionV. II. M ETHOD : D

ROPPING A CTIVATION O UTPUTS WITH L OCALIZED F IRST - LAYER D EEP N ETWORK

We ﬁrstly introduce our new architecture of localized ﬁrst-layer deep network, then we introduce the proposed encryptmethods for both invertible activation (like sigmoid) and non-invertible activation (like rectiﬁer). The main content willfocus on a method for encrypting data for invertible activationfunction.

A. Localized First-layer Deep Network

Fig. 1 illustrates the traditional deep network’s architecture,which sends the data from local device to server with encryp-tion / decryption processes, then server uses the original datato compute the result.The general equation of a single layer can be written as a = f ( x ∗ W + b ) (1)where x is a row vector of original input data, W is the weightmatrix, b is the row vector of biases, a is a row vector ofactivation outputs and f reﬂects the activation function suchas sigmoid, hyperbolic tangent, softplus, rectiﬁer [9], max-out[18], etc.From security aspect, the activation outputs a can be cap-tured by the network sniffer, the weight matrix W and bias b can be acquired by hacking the software. As a result, given a , W , b and f , the input data x can be fully reconstructed byEquation (2) , where W − is inverse matrix of W and f − is the inverse activation function. Therefore, encrypting theactivation outputs a is desired for data privacy and security. x = ( f − ( a ) − b ) ∗ W − (2) Fig. 1. Implement feed-forward propagation in server; x is the input data, y is the output value, P ( y | x ) is the probability of y given x . Our new architecture decomposes layers of a neural networkbetween local device and server, as shown in Fig. 2. Thelocal device transfers the activation outputs of ﬁrst hidden If the number of activation outputs equals or greater than the number ofinput data X X X X X X P(y|x) a a a a a a a a a a a a Local Server

Figure 2

Fig. 2. Spliting neuron network during feed-forward propagation. layer to the server, and server can only use the activationoutputs to compute the result. We will now discuss howthis architecture can enable privacy protection in both non-invertible and invertible activation functions.

B. Non-invertible Activation Functions

Several up-to-date activation functions are non-invertible[9], [18], for example, rectiﬁer is commonly used in deepneural network, which is able to reach network’s best per-formance without any unsupervised pre-training on unlabeleddata [9]. Its output is a linear function of the inputs, so thevanishing gradient problem can be reduced. The rectiﬁer isshown on Equation (3), where z = x ∗ W + b . It sets allnegative activation outputs to zero. a = f ( z ) = (cid:26) , when z < z , otherwise (3)Non-invertible activation functions naturally can encrypt theactivation outputs in some degree. For non-invertible activationfunctions, the best way to reconstruct input data is to usean approximated inverse activation function ˆ f − and thetransposed weight matrix W T (instead of W − ), as Equation(4) describes. The mathematical explanation can be found inSection III. ˆ x = ( ˆ f − ( a ) − b ) ∗ W T (4)However, due to the linearity of rectiﬁer, the activationoutputs is highly related to the scale of feature patternsfrom inputs. The general features of original data can bedecrypted by Equation (4). To alleviate this problem, weintroduce ramp activation function, as Equation (5) shows,where z = x ∗ W + b , and v reﬂects a small value. Smaller v will cut more scale information, so that the activation outputswill contain lesser scale information but more information offeatures combination (i.e. a set of small feature identiﬁers).The rectiﬁer and ramp functions will be evaluated in SectionIV-A. a = f ( z ) =  , when z < z , when < = z < vv , when z > = v (5)In the view of information, the non-invertible outputs willresult in information loss, so the original data cannot be fullyeconstructed. However, from a neural network perspective, thenon-invertible function selects features, therefore, informationis not lost, and its accuracy can even be improved [9]. Based onthis principle, in the rectifying layer even if just one activationoutput is turned to zero from negative value, the original inputdata cannot be reconstructed completely. In practice, due to thesparse property of rectifying layer, there will be a large portionof rectifying outputs being zero - therefore, the reconstructeddata will be distorted completely.Hence, data encryption and privacy are realized throughnon-invertible activation functions, as the input of each hiddenlayer cannot be reconstructed. Even if all parameters of themodel are known, the original inputs still cannot be recon-structed by the server. Hence, privacy is preserved for users.Apart from the rectiﬁer and ramp functions, other activationfunction such as binary step and max-out [18] are all non-invertible. C. Invertible Activation Functions

Different from non-invertible activation functions, recon-struction is solvable when activation function is invertible, e.g.sigmoid, hyperbolic tangent, softplus, etc. The sigmoid andinverse sigmoid is shown in Equation 6 and 7. a = 11 + e − ( x ∗ W + b ) (6) x = ( − ln ( 1 a − − b ) ∗ W − (7)Therefore, the only ways to encrypt x are modifying a , W or b . The modiﬁcation will bring uncertainty to neural network.Therefore, the key is to ﬁnd a modiﬁcation method which doesnot affect the ﬁnal predicting result.Due to the sparse behavior of state-of-the-art neural net-work, accuracy will not be affected by slightly changingactivation outputs and weights. First of all, during training, theproposed method would not compromise the learning results,because both Dropout and Dropconnect are the state-of-the-art way to avoid overﬁtting. For testing / inferencing, non-invertible activation would not affect the result at all. Forinvertible activation, dropping a few of activation outputs alsowould not harm the ﬁnal result, that is because:1) In theory point of view, Dropout training can be consid-ered as training many sub-networks to do the same job, theresult is the averaged of all sub-networks, which is one sortof ensemble learning and recently be proved as a gaussianprocess [11], [15]. On the other hand, reasonable uncertaintyof neural network during testing would not affect the testingresult if the network have uncertainty during training [15],[14], [16], [17].2) In experience point of view, the Dropping probabilityrequired by the method (0.5%) is quite low, compared withthe Dropping probability during training (usually 20 to 50%),and would not affect the ﬁnal averaged result of all sub-networks. TABLE I and II show the effect with differentdropping probabilities using sigmoid function, as expect, our method would not lead to noticeable performance decreases,as a small dropping probabilities is enough for encryption.The sparse behavior of neural network can be achievedby the methods of Dropout, Dropconnect, and Autoencoder([10], [11], [12], [13]). In this paper, we proposed the ideaof Dropping activation outputs and Dropping connections.The deﬁnitions of Dropping activation outputs and Droppingconnections are different from Dropout and Dropconnect.They are applied in the feed-forward propagation, whereasDropout and Dropconnect are applied in the error-back propa-gation. A ﬁgure of Dropping activation outputs and Droppingconnections can be shown in Fig. 3. Dropping activation outputs during feed-forward process:

For encrypting the data during feed-forward propagation pro-cess, we purposed a method called Dropping activation output.Mathematically the modiﬁed activation output ˆ a is written as ˆ a = d (cid:12) f ( x ∗ W + b ) (8)where, the element-wise multiplication (hadamard product) isdenoted by (cid:12) , and d is a binary vector randomly, setting someactivation outputs to zero.Setting an activation output to zero is equivalent to removingall connections between a neuron and all input data. We willdescribe more detail in Section III.Equation (2) and (4) can be used to reconstruct the data, andwe found that similar with non-invertible activation, Equation(4) can better reconstruct the data. Dropping connections during feed-forward process:

Another purposed method is called Dropping connections. Indropping connections, some elements of weight matrix W areset to zero. The modiﬁed weight matrix ˆ W and the modiﬁedactivation output ˜ a are written as ˆ W = W (cid:12) D (9) ˜ a = f ( x ∗ ˆ W + b ) (10)where, D is a binary matrix with some elements arezero. However, the experiment shows that this method is noteffective enough to encrypt the data.III. M ATHEMATICAL E XPLANATION

If the activation is linear, we have: a = x ∗ W + b (11) x = ( a − b ) ∗ W − (12)We now discuss why dropping connections won’t encrypt asgood as dropping activation. X X X X X P(y|x) a a a a a a a a a a a a Local Server

Figure 3 Left X X X X X X P(y|x) a a a a a a a a a a a a Local Server

Figure 3 Right

Fig. 3. Dropping activation output (Left) and Dropping connection (Right) during feed-forward propagation.

A. Dropping connections

Assume ∆ W is the different between W and ˜ W , i.e. ∆ W = W − ˜ W , replace a in Equation (12) by Equation (11), let ˜ a represent the new a that is calculated through the changed W,then the new ˜ x can be written as ˜ x = (˜ a − b ) W − = ( x ∗ W (cid:12) D + b − b ) W − = x ( W (cid:12) D ) W − (13) = x ( W − ∆ W ) W − = x ( I − ∆ W W − ) Note in ∆ W , only a few entries are non-zero, hence theproduct of ∆ W W − is very small, which means it can beneglected. The reason behind it is the product of W W − isan identity matrix, which means the sum of the product ofentries in row W and their corresponding entries in column W − is 1 when the sum position is diagonal, but the sum is 0elsewhere. Then ( I − ∆ W W − ) approximates to an identitymatrix. This means the input data is nearly the same as thedropping connection result, written as ˜ x ≈ x , therefore theencrypt performance is poor. Mathematically, it corresponds tothe Afﬁne transformation with no change circumstances [28].This result applies to any c -by- d matrix W , where c ≤ d . As ∆ W c × d is a c -by- d matrix with a few entries change, whichmeans only a few terms are affected in the summation of dterms, therefore, ∆ W c × d W − c × d ≈ , hence I c − ∆ W c × d W − c × d approximate to identity matrix I c , which means ˜ x × c ≈ x × c .Give an example when W is a 3-by-5 matrix, and ∗ represententries that are not zero, let W =   • When the changing entry/entries are on the same rowin ∆ W , then it is the same row that is non-zero in ∆ W W − :if ∆ W =   or   → ∆ W W − =  ∗ ∗ ∗  (14) if ∆ W =   or   → ∆ W W − =  ∗ ∗ ∗  (15)if ∆ W =   or   → ∆ W W − =  ∗ ∗ ∗  (16) • When changing entries are at different row, then thosecorresponding rows in ∆ W W − have non-zero entries.if ∆ W =   → ∆ W W − =  ∗ ∗ ∗∗ ∗ ∗  (17)if ∆ W =   →→ ∆ W W − =  ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗  (18) W − is the inverse of W, therefore W W − = I (i.e. W W − + W W − + W W − + W W − + W W − = 1 W W − + W W − + W W − + W W − + W W − = 0 W W − + W W − + W W − + W W − + W W − = 0 W W − + W W − + W W − + W W − + W W − = 0 W W − + W W − + W W − + W W − + W W − = 1 W W − + W W − + W W − + W W − + W W − = 0 W W − + W W − + W W − + W W − + W W − = 0 W W − + W W − + W W − + W W − + W W − = 0 W − + W W − + W W − + W W − + W W − = 1 ). However, ∆ W represents change of only a few entries in W,which means only a few terms are affected in the summationof ﬁve terms, therefore, ∆ W W − ≈ , hence I − ∆ W W − approximate to identity matrix I , which means ˜ x × = x × . B. Dropping activation outputs

In Dropping activation outputs, linear activation is also usedin explanation, let the modiﬁed activation output be ˆ a and itscorresponding modiﬁed input data be ˆ x , then it can be writtenas ˆ x = (ˆ a − b ) W − (19) = ( a − ∆ a − b ) W − = ( a − b ) W − − ∆ aW − = x − ∆ aW − = x ( I − ∆ aW − x ) Hence for any activation output vector a in size -by- d andweight matrix W that has size c -by- d , where c ≤ d . If afew entries of a change, then the modiﬁed vector ˆ x can bewritten as a product of -by- c original input data x and amatrix ( I − ∆ a × d W − d × c x × c ) , where all diagonal entries are notnecessary to be 1. Then x ˆ x . Therefore it won’t necessaryto be no change circumstance in the Afﬁne transformation.For example, let a = ( a , a , a ) , ∆ a = (∆ a , ∆ a , ∆ a ) ,and W − × =  w w w w w w  Then x can be written as: x = ( x , x ) = aW − × = ( a , a , a )  w w w w w w  (20) = ( a w + a w + a w , a w + a w + a w )ˆ x can be written as: ˆ x = ( ˆ x , ˆ x ) = ˆ aW − × (21) = ( a − ∆ a , a − ∆ a , a − ∆ a )  w w w w w w  = ( a w − ∆ a w + a w − ∆ a w + a w − ∆ a w ,a w − ∆ a w + a w − ∆ a w + a w − ∆ a w ) Which means ( ˆ x , ˆ x ) = ( x , x  − ∆ a w a w a w x − ∆ a w a w a w x  (22) In this case, ∆ a w +∆ a w +∆ a w x and ∆ a w +∆ a w +∆ a w x are unknown and there is norestriction on them, which means they can be any number,hence it is not an identity matrix. IV. E XPERIMENTAL S TUDIES

We used MNIST dataset to evaluate our methods, whichhas a training set of 50k, a validation set of 10k and a testset of 10k. Each image is a 28 x 28 grey-scale digit, with10 classes in total. We adopted accuracy of classiﬁcation taskto evaluate our model, which is a standard way of evaluatingmachine learning algorithms. For classiﬁcation, the deﬁnitionof accuracy is the percentage of correct prediction on test set.We further evaluated dropping activation outputs on CI-FAR10 dataset, which is a more challenging dataset comparedwith MNIST and has 50k training images and 10k test images.Each image is a 32 x 32 RGB image, with 10 classes in total(airplane, automobile, bird, cat, deer, dog, frog, horse, ship,truck).

A. Non-invertible Activation Functions

Rectiﬁer

Our experimental network has three rectifyinghidden layers, and each layer has 800 neurons, so the modelcan be represented as (784-800-800-800-10) from input layerto output layer.Dropout has been applied during training to prevent overﬁt-ting. The Dropout probability from the input layer to the ﬁrstlayer is , and the probabilities of other layers are . Noweights decay were used. All neural networks were trainedby using Adam gradient descent [25] with mini-batches ofsize 500 and 2000 of epochs. Accuracy of . was found(under the null hypothesis of the pairwise test with p = 0 . ).To evaluate the rectiﬁer, Equation (23) and (24) are usedto reconstruct the approximate input data. Fig. 4 shows theoriginal input data and the reconstructed input data from theactivation outputs of ﬁrst hidden layer by using Equation (23).It is clear that the reconstruction was failed. ˆ x ≈ ( a − b ) ∗ W − (23) ˆ x ≈ ( a − b ) ∗ W T (24) Figure 4

Fig. 4. Reconstructed input data from rectifying output by using ˆ x ≈ ( a − b ) ∗ W − (Right); Original input data x (Left); Activation output a = max (0 , x ∗ W + b ) . The KL Divergence between x and ˆ x for digit 2 and 6 are 1.91 and1.88. The input data can be better reconstructed by Equation (24).As the middle column of Fig. 5 demonstrated, even when theeconstructed data were totally distorted, the outline of originalinput data can still be recognized from the reconstructed data.Further encryption can be applied by transferring the acti-vation outputs of the second hidden layer, i.e. encrypting thedata twice. Right hand side of Fig. 5 shows the reconstructedinput data by Equation (24), where the outline of digit cannotbe recognized. However, applying more local rectifying layerswill lead to higher local computation.

Figure 5

Fig. 5. Reconstructed input data from rectifying outputs by ˆ x ≈ ( a − b ) ∗ W T :Original input data x (Left); Reconstruct from outputs of ﬁrst hidden layer(Middle, the KL Divergence between x and ˆ x (1 st layer) are 2.32, 1.49, 1.51for digit 1, 2, 6); Reconstruct from outputs of second hidden layer (Right, theKL Divergence between x and ˆ x (2nd layer) are 2.63, 1.58, 1.56 which arehigher than ˆ x (1st layer)). Ramp

To let the activation outputs of ﬁrst hidden layercontain lesser scale information, ramp activation function canbe applied. The two testing networks are the same with theprevious rectiﬁer network, except for the activation function ofﬁrst layer, which are ramp with v of 0.2 and 0.05 respectively.The accuracies of these two networks are both . , whichare approximately the same as the previous pure rectiﬁerneural network.According to Fig. 6, smaller v can better encrypt the inputdata, and the topology of input data cannot be recognized fromthe reconstructed data.The above experiments can be summarized as below: • With the purposed architecture, no matter what kind ofactivation function is used, without knowing the modelparameters in local device, the original input data cannotbe reconstructed from the activation output. • Even knowing the model parameters in local device,a single rectifying layer is able to encrypt the inputdata. Better encryption can be done by applying morerectifying layers before transferring the activation outputs. • Further encryption to the data can be done through Rampactivation function as it reduces the scale of featurepattern.

Figure 6

Fig. 6. Reconstructed input data from ramp activation outputs by using ˆ x ≈ ( a − b ) ∗ W T (Middle and Right) and Original input data x with v = 0 . (Left);The KL Divergence between x and ˆ x ( v = 0 . ), which are 2.35, 1.54, 1.55for digit 1, 2, 6, is smaller than that between x and ˆ x ( v = 0 . ), which are2.42, 1.63, 1.62. B. Invertible Activation Functions - Dropping Activation Out-puts

For invertible activation function, the ﬁrst experiment isabout Dropping activation outputs. We consider a networksof 3 hidden layers of 800 units each, the activation function issigmoid, which is invertible. All hyper-parameters and trainingmethods are set to be the same as the previous rectiﬁer neuralnetwork.As a few of activation outputs are set to zero, the inversesigmoid at Equation (7) becomes insolvable, since a numbercan not be divided by zero. An approximate reconstructioncan be achieved by Equation (25) and (4), i.e. when a is outof range, the approximate inverse number is set to zero . f − ( a ) = (cid:26) , when a ≥ , a ≤ − ln ( a − , otherwise (25)Fig. 7 shows the reconstructed input data under droppingprobabilities p of 0.5%, 1% and 2%. It is clear that the recon-structed input data were distorted, and no further distortionwas found as the dropping probability increase.Therefore, dropping a few activation outputs is enoughto encrypt the input data, and does not lead to noticeableperformance decreases as Table. I shows. In addition, addingsome noise to a few activation outputs can also lead to thesame result. If we choose Equation (2) for reconstruction, thereconstructed input data will be show in Fig. 8, which is worsethan Fig. 7.For CIFAR10, we used a convolutional neural network(CNN) for this task. The network architecture is as follows: Expect zero, the following constants have been explored: {− , − , − . , , . , , } , we found them all giving similar results asFig 7 shows.ABLE IR ANDOMLY SETTING ACTIVATION OUTPUTS TO ZERO IN FIRST HIDDENLAYER DURING FEED - FORWARD PROPAGATION OVER

TRAILS p Accuracy(%) StandardDerivation Max/Min Accuracy (%)0%

NA NA0.5%

ANDOMLY SETTING ACTIVATION OUTPUTS TO ZERO IN FIRSTCONVOLUTIONAL LAYER DURING FEED - FORWARD PROPAGATION OVER

TRAILS p Accuracy(%) StandardDerivation Max/Min Accuracy (%)0%

NA NA0.1%

N64F5S1 > Sigmoid > F3S2 > LRN > N64F5S1 > ReLU > LRN > F3S2 > D384 > ReLU > D192 > ReLU > D10 > Soft-max, where N64F5S1 represents a CNN with 64 ﬁlters of size5 and of stride 1, F3S2 represents Maxpooling with size of3, stride of 2, and D10 represents fully-connective layer withunit of 10, LRN represents local response normalization. Themodel is trained by 50000 epochs with learning rate of 0.0001and Adam gradient descent. We randomly dropped activationoutputs on the 1st convolutional layer. The result can be seenin Table. II.The experiments can be summarized as follow: • For invertible activation function, dropping a few activa-tion outputs during feed-forward propagation can providedata encryption and privacy. • To make the function insolvable, the activation valueneeds to be set out of its reasonable range, for example,setting the value out of (0 , for sigmoid, and ( − , for hyperbolic tangent. • Instead of setting activation outputs out of its reasonablerange, adding small noise values to few activation outputcan have similar impact.

C. Invertible Activation Functions - Dropping Connections

Randomly setting a part of weight values to zero can indi-rectly modify the activation outputs. However, according to ourexperiment by using MNIST dataset, when combining sigmoidfunction and dropping activation outputs during feed-forwardprocess, encryption does not appear and the reconstructed datais almost the same with the original input data.

D. Autoencoder

Small number of activation output can reduce both compu-tation and communication cost for local device. In that case,Autoencoder can be applied.Even for invertible activation function, when the number ofactivation outputs is smaller than the number of input data,the original input data cannot be reconstructed correctly fromits activation outputs, as it uses smaller dimension to repre-sent the original data. The input data can be approximatelyreconstructed by Equation (7).Fig. 9 shows the input data which is reconstructed fromAutoencoder with sigmoid activation function. In this case,the number of the neuron is half of the number of input data.In other words, half of the information of original input datacan be reconstructed, noting the shape of digit image canstill be identiﬁed clearly. Therefore, an Autoencoder with non-invertible activation function or dropping activation output arerecommended, so as to provide both data encryption and datacompression.

E. Privacy preserving

Here we analyze the privacy preserving property of ourmethod, by discussing how it works in bruteforce attack andhonest-but-curious model.Assume we drop M neurons, as we discussed before, thecomputation raised exponentially, the attackers have to pre-compute possible activation results. Take sigmoid function asan example, as the outputs of neurons are continuous values,assume attacker pre-deﬁne N possible output values, and thealgorithm drop M neurons of the outputs, the number ofpossible combinations is N M . In our sigmoid experimentwhen dropping probability is 0.5%, M is 4, if deﬁne N to100, the number of possible combinations is 100 millions, afterreconstructing 100 millions images, the attacker also need analgorithm to recognize the good image. If we used rectiﬁer,the attacker would not know where and how many neurons aredropped by our method, which mean the number of possiblecombinations will become signiﬁcantly larger.We illustrate this by conducting a bruteforce experimenton MNIST classiﬁer with sigmoid function. There are 800outputs, M is 4 (0.5%). We separated [0, 1] to 101 gaps, i.e.from 0, 0.01, 0.02 to 1. Therefore, there is 104060401 possibleinputs, it takes 5083 hours to compute all possibilities on aTitan X Pascal GPU. Note that, the attacker needs an extraalgorithm to select one data from 104060401 data, but as theattacker do not know the content of data, it is difﬁcult to deﬁnethe criteria.In honest-but-curious model, for invertible activation func-tion, the intact activation outputs can be obtained if local de-vice inputs the same data into the network for multiple times,and the dropping positions do not overlap. The probability ofthe dropping position do not overlap on second time with theﬁrst time is C MN − M /C MN . As M is relatively smaller than N ,so the original input data has high probability to be decryptedif the data is input to the network twice. igure 7 Fig. 7. Equation (25) and (4) are used to reconstruct input data from ﬁrst sigmoid hidden layer after dropping activation outputs. The ﬁrst column is theoriginal input data x ; the dropping probabilities increase from 0.5% to 2% from second to forth column. Note that 0.5% is only 4 neurons in the layer of800 neurons. The KL Divergence between x and ˆ x ( p = 0 . are 2.03, 1.34, 1.35 which are similar with ˆ x ( p = 1%) (2.02, 1.35, 1.35) and ˆ x ( p = 2%) (2.04, 1.35, 1.36). Figure 8

Fig. 8. Reconstructed input data from ﬁrst sigmoid hidden layer afterdrop activation outputs by using Equation (25) and (2). Original inputdata x (The ﬁrst column); From second to third columns, the droppingprobabilities increase from 0.25% to 0.5%. The KL Divergence between x and ˆ x ( p = 0 . are 2.34, 1.51, 1.45, and ˆ x ( p = 0 . are 2.49, 1.52,1.46, which are higher than Fig.7 As a result, in practice, to prevent such situation, we dropthe same position for the same data. We used the max valuein the data as the random seed to select the dropping position.We need to mention that, for non-invertible activationfunction, as the activation outputs are dropped inherently,even without using dropping activation outputs, the honest-

Figure 9

Fig. 9. Reconstructed input data from sigmoid outputs by using Equation (7)(Right); Original input data x (Left). The number of neuron is the half of thenumber of input data. but-curious model can not work. F. Speed

To verify the impact of implementing Sigmoid function onuser local devices, we did a system test to measure it. We usedMetal framework to develop the Sigmoid function on GPU(in Swift language): kernel void sigmoid(const device float*inVector [[ buffer(0) ]],device float *outVector [[ buffer(1) ]],uint id [[ thread_position_in_grid ]]) {outVector[id] = 1.0 /(1.0 + exp(-inVector[id]));} The device we used is iPhone 6 with iOS 10.

N SDate () was used to measured the time lapse of computing the ﬁrstlayer activations. The input we used is a vector x where | x | = 40000 (200x200 pixels grayscale images), and we https://developer.apple.com/metal/ pplied Sigmoid ( W ˙ x + b ) with W (size 40000x1000) and b (size 1000) for 1000 neurons. Below is a sample output oftime lapses measured, is calculated to be the averagelatency induced. D/Sigmoid(601): onActivation: beginD/Sigmoid(601): onActivation: 1311 msD/Sigmoid(601): onActivation: end, 1311 ms

V. C

ONCLUSION

In this paper, we proposed a new architecture for deepnetwork, with the localized ﬁrst layer of the network. Weinvestigate how this architecture can support better privacyprotection in model prediction. Invertible activation functionthrough Dropping activation outputs during feed-forward prop-agation are proved to be able to encrypt the original inputdata and preserving privacy. The whole encryption process canbe improved through combining feed-forward propagation anddata encryption into one process, which means no need for aspecialized data encryption process on the local device, nordata decryption process on the server.During the encryption process, both invertible and non-invertible activation functions have been discussed and math-ematically proved possible to do encryption. In error-backpropagation, splitting the neural network into local device andserver can provide data privacy during training. In other words,the server is able to provide model learning service by usingerror-back propagation, without accessing the original inputdata from the local device.A

CKNOWLEDGMENT