[PDF] FFConv: Fast Factorized Neural Network Inference on Encrypted Data

Abstract

Homomorphic Encryption (HE), allowing computations on encrypted data (ciphertext) without decrypting it first, enables secure but prohibitively slow Neural Network (HENN) inference for privacy-preserving applications in clouds. To reduce HENN inference latency, one approach is to pack multiple messages into a single ciphertext in order to reduce the number of ciphertexts and support massive parallelism of Homomorphic Multiply-Add (HMA) operations between ciphertexts. However, different ciphertext packing schemes have to be designed for different convolution layers and each of them introduces overheads that are far more expensive than HMA operations. In this paper, we propose a low-rank factorization method called FFConv to unify convolution and ciphertext packing. To our knowledge, FFConv is the first work that is capable of accelerating the overheads induced by different ciphertext packing schemes simultaneously, without incurring a significant increase in noise budget. Compared to prior art LoLa and Falcon, our method reduces the inference latency by up to 87% and 12%, respectively, with comparable accuracy on MNIST and CIFAR-10.

Full PDF

FFFConv: Fast Factorized Neural Network Inference on Encrypted Data

Yuxiao Lu Jie Lin Chao Jin Zhe Wang Khin Mi Mi Aung Xiaoli Li Abstract

Homomorphic Encryption (HE), allowing com-putations on encrypted data (ciphertext) with-out decrypting it ﬁrst, enables secure but pro-hibitively slow Neural Network (HENN) infer-ence for privacy-preserving applications in clouds.To reduce HENN inference latency, one approachis to pack multiple messages into a single cipher-text in order to reduce the number of ciphertextsand support massive parallelism of Homomor-phic Multiply-Add (HMA) operations between ci-phertexts. However, different ciphertext packingschemes have to be designed for different convolu-tion layers and each of them introduces overheadsthat are far more expensive than HMA operations.In this paper, we propose a low-rank factorizationmethod called FFConv to unify convolution andciphertext packing. To our knowledge, FFConv isthe ﬁrst work that is capable of accelerating theoverheads induced by different ciphertext pack-ing schemes simultaneously, without incurring asigniﬁcant increase in noise budget. Comparedto prior art LoLa and Falcon, our method reducesthe inference latency by up to 87% and 12%, re-spectively, with comparable accuracy on MNISTand CIFAR-10.

1. Introduction

Homomorphic Encryption (HE) (Gentry, 2009; Braker-ski and Vaikuntanathan, 2011; Fan and Vercauteren, 2012)is one of the promising cryptographic systems that enablesecure Neural Network inference for privacy-preservingapplications in clouds while at the cost of high inferencelatency. At the client, the plaintext data is encrypted inthe form of ciphertext , then transmitted to the cloud server.At the cloud, neural network inference is evaluated homo-morphically on the ciphertexts to generate an encrypted Institute for Infocomm Research (I R), A ∗ STAR, Singapore.This work is done during Yuxiao LU’s internship at I R. This workis supported by A*STAR under its AME Programmatic Funds(Project No.A19E3b0099).. Correspondence to: Jie Lin < [email protected] > . prediction. The encrypted prediction is returned to the clientfor decryption. Since the cloud cannot encrypt or decryptthe data, the data privacy is protected. Despite the high levelof security, the HE-enabled Neural Network (HENN) infer-ence is prohibitively slow, mainly due to a large number ofciphertexts and the expensive Homomorphic Multiply-Add(HMA) operations on the ciphertexts. For instance, the infer-ence latency of a shallow network (1 convolution layer and2 fully connected layers) on one encrypted 28x28 MNISTimage (LeCun et al., 1998) is more than 200 seconds onmulti-core CPUs (Dowlin et al., 2016).Modern HE cryptographic systems (Fan and Vercauteren,2012; Brakerski et al., 2014; Cheon et al., 2017; Brutzkuset al., 2019) used packed encryption to accelerate the HENNinference, in which the ciphertext structure is conﬁguredas a vector of slots, and each slot encrypts a different mes-sage. Therefore, packing can signiﬁcantly reduce the num-ber of ciphertexts required to encrypt a given amount ofdata messages. In addition, packing also enables massiveparallel execution of the HMA operations between cipher-texts, as per the Single-Instruction Multiple-Data (SIMD)execution model (Smart and Vercauteren, 2014). Whenmultiplying (resp. adding) two packed ciphertexts, it isequivalent to concurrent slot-wise multiplications (resp. ad-ditions) of the underlying vectors of the two ciphertexts. Inparticular, LoLa (Brutzkus et al., 2019) introduced severalciphertext packing schemes, which reduced the inferencelatency to around 2 seconds for the shallow network onMINIST (Dowlin et al., 2016).Despite the faster HENN inference, ciphertext packingschemes introduce expensive operations by themselves,which prolong the inference latency of HENN with in-creased network width and depth for better performanceon larger problems. As summarized in LoLa (Brutzkuset al., 2019), there are two major packed representationsnamed Dense Packing (DensePack) and Convolution Pack-ing (ConvPack), in which DensePack requires a Rotation operation to align the slots inside ciphertexts before eachaddition and ConvPack requires

Im2Col (a combination ofHMA and Rotation operations) to reorganize ciphertexts fortransition between layers. Compared to HMA operations,the rotations are 10x more expensive (Lou et al., 2020).For instance, LoLa (Brutzkus et al., 2019) reported a 3-layer network inference (2 convolution layers and 1 fully a r X i v : . [ c s . CR ] F e b FConv: Fast Factorized Neural Network Inference on Encrypted Data connected layer) on one encrypted 32x32x3 CIFAR-10 im-age (Krizhevsky, 2009) takes over 700 seconds, in whichthe rotations account for over 90% of the time.In this paper, we introduce a low-rank matrix factorizationmethod called FFConv to unify convolution/fully-connectedlayers and ciphertext packing, which enables signiﬁcantlyreduced inference latency. We summarize our contributionsas follows:• We introduce a low-rank factorization framework de-signed for accelerating neural network inference onencrypted data without imposing any constraints onthe conﬁgurations of convolution layers.• We propose Factorized Packing (FactPack) to seam-lessly integrate the factorized convolutions with differ-ent ciphertext packing schemes in order to maximizethe reduction of overheads introduced by packing.• To our knowledge, FFConv is the ﬁrst work that iscapable of reducing the overheads induced by differentciphertext packing schemes simultaneously.• Compared to prior art LoLa and Falcon, our methodreduces the inference latency by up to 87% and 12%respectively, with comparable accuracy on MNIST andCIFAR-10. Moreover, our method incurs signiﬁcantlyless noise budget than the most recent work Falcon.

Remarks . Since HE can only support multiplication andaddition operations, non-linear layers in the modern neuralnetwork such as the Rectiﬁed Linear Unit (ReLU) (Glorotet al., 2011) is not supported by HE. To address this issue,CryptoNets (Dowlin et al., 2016) proposed to approximateReLU with a simple Square function. There are also secureneural network inference solutions that combine HE with se-cure multi-party computation (MPC) techniques (Liu et al.,2017; Juvekar et al., 2018). These solutions usually useMPC to evaluate non-linear activation function, thus elimi-nating the need of polynomial approximation for activationfunction. However, they also incur high communicationcost among multiple parties and require each party to haveconsiderable computation power and be constantly online,rendering them less attractive than the non-interactive HEsolutions in many use scenarios. In this paper, we focus onsecure neural network inference enabled by HE only.

2. Homomorphic Encryption (HE)

HE (Gentry, 2009) has always been an intriguing technol-ogy due to its ability of computing on encrypted data in theabsence of the decryption key. In HE, the plaintexts and ci-phertexts are elements in polynomial rings. HE provides theuser with two main computational operations on ciphertexts:homomorphic multiplication and homomorphic addition. These operations can manipulate ciphertexts and produceencrypted results that are equivalent to the correspondingplaintext results after decryption.HE ciphertexts conceal plaintext messages with noise thatcan be identiﬁed and removed with the secret key (Braker-ski and Vaikuntanathan, 2011). The noise magnitude canbe accumulated inside a ciphertext along with the computa-tion on it. As long as the noise is below a certain thresholdthat is controlled by the encryption parameters, decryptioncan ﬁlter out the noise and retrieve the plaintext messagesuccessfully; otherwise, the plaintext message could be cor-rupted and decryption could fail. Although HE schemesinclude a primitive (known as bootstrapping ) to reduce thenoise inside ciphertexts (Gentry, 2009), it is extremely com-putationally intensive. Instead, a more practical way is tocarefully select the encryption parameters to provide justenough noise budget for a ciphertext to accommodate a pre-deﬁned maximum depth of computation under a speciﬁcalevaluation circuit.In this work, following LoLa (Brutzkus et al., 2019) andFalcon (Lou et al., 2020), we employ the Brakerski-Fan-Vercauteren (BFV) HE scheme (Fan and Vercauteren, 2012).The BFV scheme is governed by three important parame-ters: t , Q , and N . First, the plaintext space is controlledby the plaintext coefﬁcient modulus t . To prevent compu-tational overﬂows, t needs to be set to be large enough toaccommodate any intermediate result of the homomorphicevaluations. Second, the scheme imposes a limit on thenumber of homomorphic operations that can be performedon the ciphertext before decryption fails. We refer to thislimit as the computation noise budget, which can be con-trolled by the ciphertext coefﬁcient modulus Q . Third, theunderlying ring dimension N is set to guarantee the targetedsecurity level λ . For typical security requirements in prac-tical applications, λ is set to a minimum of

128 bits . Weremark that the choice of N and Q signiﬁcantly impact theperformance of HE schemes in terms of computational andmemory requirements. They also affect the data expansionrate due to encryption. More speciﬁcally, the ciphertext sizecan be estimated to be at least ∗ N ∗ log Q bits.

3. HE-enabled Neural Network Inference

Applying HE to convolutional neural networks for privateinference poses unique challenges. Typical CNNs are com-posed of linear and non-linear function blocks. Linear func-tion blocks like Convolution, Full-connected, and AveragePooling layers, can be converted into matrix operationswith simple additions and multiplications. On the contrary,non-linear function blocks usually contain complex or com-parison operations such as ReLU layer (Glorot et al., 2011),which cannot be supported by HE directly. To enable com-patibility with HE primitives, there is a need to approxi-

FConv: Fast Factorized Neural Network Inference on Encrypted Data × 𝑰 𝒄 × 𝒅 × 𝒅 𝑶 𝒘 × 𝑶 𝒉 𝑰 𝒄 × 𝒅 × 𝒅 𝑶 𝒄 × 𝑊 𝑰 𝒄 × 𝒅 × 𝒅 𝑶 𝒘 × 𝑶 𝒉 𝑰 𝒄 × 𝒅 × 𝒅 𝑶 𝒄′ × 𝑶 𝒄′ 𝑶 𝒄 𝑊 Figure 1.

Low-rank Factorization for Convolution layer. mate these non-linear operations with polynomial functionswith only additions and multiplications. For instance, Cryp-toNets (Dowlin et al., 2016) suggested using the Squarefunction to approximate ReLU in the network for classify-ing MNIST images.In a typical Machine Learning as a Service (MLaaS) sce-nario, network models are deployed in cloud servers toprovide inference services to client users. We assume themodels are kept in plaintext, while the client users encrypttheir data into HE ciphertexts before sending them to thecloud server for private inference. In the next, we introducethe ways to compute convolution layers and other linear lay-ers in HENN with plaintext weights and ciphertext inputs.Pay attention to the fact that other linear layers like fully-connected and average-pooling layers can be treated as spe-cial cases of convolution layers, where for fully-connectedlayers, the ﬁlter sizes are the same as the input tensor sizes,and for average-pooling layers, the weights inside a singleﬁlter are set to be the same constant value.

Convolution layer is essentially dot product operations be-tween the ﬁlter weights and local patches cropped from theinput. Assume the weights of a regular Convolution layerare 4D tensor Y ∈ R d × d × I c × O c with kernel size d , inputchannel I c and output channel O c , the input and output ofthe Convolution layer is 3D tensor X ∈ R I w × I h × I c and Z ∈ R O w × O h × O c , where I w , I h / O w , O h are input/outputwidth and height respectively. Similar to the Fully Con-nected layer, the convolution operation Z = X ∗ Y of aConvolution layer can be formulated as matrix multiplica-tion as follows: ˆZ = I × W , (1)where I ∈ R S × K is a matrix with S = O w O h rows and K = d I c columns, each row of I is a vector stretched out Table 1.

Complexity of DensePack. Besides regular HE operations(MulPC: plaintext-ciphertext multiplication, AddCC: ciphertext-ciphertext addition), DensePack introduces the Rotation operations. O = O w O h O c , O (cid:48) = O w O h O (cid:48) c . S CHEME UL PC DD CC OT L O L A O Olog N Olog N F ALCON ∼ O/p ∼ Olog N/p ∼ Olog N/p O URS O (cid:48) O (cid:48) log N O (cid:48) log N Table 2.

Complexity of ConvPack. Falcon (Lou et al., 2020) isnot applicable to ConvPack. The Im2Col operation induced byConvPack is discussed in Table 3. K = d I c . S CHEME UL PC DD CCL O L A O c K O c ( K − F ALCON O c K O c ( K − O URS O (cid:48) c ( K + O c ) O (cid:48) c ( K + O c ) − K − O (cid:48) c from a 3D patch R d × d × I c cropped from the input X for theﬁlters at each spatial location. W ∈ R K × O c is the weightmatrix, each column of W is a ﬁlter with K parameters, asillustrated in Fig.1 Top. There are several ways to map the convolution-layer compu-tation onto HE ciphertexts. A straightforward way is to en-crypt each input value as a separate ciphertext, and the com-putation process would be the same as the one with plaintextconvolution. However, it requires too many ciphertexts to becreated and fails to utilize the parallel computation providedby SIMD packing to accelerate the convolution process. Onthe contrary, convolution layer with packed ciphertexts hasthe dual beneﬁts of reduced ciphertext amount and paral-lelized computation. There are majorly two ways to packthe input values into ciphertexts, namely DensePack andConvPack, to facilitate the convolution layer to be computedin two different manners.

Dense Packing (DensePack) . For the DensePack style, theinput tensor of the convolution layer is ﬂattened as an one-dimensional vector along the width, height, and channeldimensions, and then packed sequentially into a ciphertext,as shown in Fig.2 (a). For one-step of convolution compu-tation between one ﬁlter and the input ciphertext, the ﬁlteris ﬁrst extended into the same size as the input tensor bypadding zeros and ﬂattened into a plaintext vector, followedby slot-wise multiplication with the input ciphertext, andthen all the slots in the resultant ciphertext are summed upto produce the convolution (dot-product) result.As shown in Fig.2 (b), the entire convolution layer is done bypermuting all the ﬁlters and shifted locations of each ﬁlter,

FConv: Fast Factorized Neural Network Inference on Encrypted Data

Figure 2. (a) DensePack for 1 ﬁlter at 1 spatial location. (b) DensePack for 1 Convolution. (c) ConvPack for 1 Convolution.

Table 3.

Complexity analysis of Im2Col operations between twolayers (transition of ciphertexts), with different combinations ofDensePack (DP) and ConvPack (CP) and different kernel sizes.CP-CP: A regular Convolution layer with ConvPack followedby another regular Convolution layer with ConvPack. Similardeﬁnition to DP-CP and CP-DP. DP-DP is not included here sinceIm2Col operations are induced by CP. I = I w I h I c , K = d I c , O = O w O h O c . Column ”Kernel” denotes the kernel size for the2 nd layer, with default kernel stride 1. Superscript indicates the1 st or 2 nd layer.K ERNEL UL PC DD CC OT CP-CP d > O O w O h K O w O h Kd = 1 d > O O w O h K O w O h Kd = 1 O c O c CP-DP d > I c − I c − d = 1 I c − I c − each doing one convolution step with the input ciphertext,and arranging all the convolution results into the ﬁnal outputciphertext. Convolution Packing (ConvPack) . For the ConvPackstyle, as shown in Fig.2 (c), the input tensor of the con-volution layer is packed into K ciphertexts, where K is thenumber of weights in a single ﬁlter. Each of the ciphertextsis packed with values in the input tensor, which will bemultiplied with the corresponding ﬁlter weight in all shiftedlocations. Subsequently, the K weights in a ﬁlter are mul-tiplied with the K ciphertexts separately, and the resultantciphertexts are added up together into one ciphertext whichproduces exactly the convolution result between the ﬁlterand the input ciphertext. Similar processes can be appliedto all the O c ﬁlters in the convolution layer, and the resultsare O c ciphertexts each encrypts a separate channel of theoutput tensor. Ciphertext Packing between Layers . A typical HENN is stacked with multiple layers, and it is essential to supportthe smooth transition of ciphertext packing between layers,i.e., to formulate the packing of the input ciphertexts of acertain layer from the output ciphertexts of its precedentlayer .The transition (i.e., Im2Col in Fig 5 (a)) overheads in termsof MultPC, AddCC, and Rotation operations for differentcombinations of ciphertext packing schemes between twoconsecutive layers (1 st layer and 2 nd layer respectively) areshown in Table 3. Generally, the overheads are related tothe tensor sizes, the kernel sizes, and the number of kernels.Smaller kernel sizes usually result in smaller overheads. Itmust be noted that when a size d = 1 kernel is used, thereis much less additional transition overhead between twoConvPack layers or DensePack and ConvPack layers. Aswill be illustrated in the next section, our FFConv designtakes advantage of this property to optimize the computationof convolution layers.

4. FFConv

In Section 4.1, we introduce the low-rank factorization forconvolution, which enables fast inference on data encryptedas packed ciphertexts. Section 4.2 describe how the low-rank factorized convolutions can be seamlessly integratedwith various ciphertext packing representations in order tolargely reduce the overheads brought to a regular convolu-tion by a single packing scheme (DensePack or ConvPack)itself, which in turn further accelerate the inference speed onencrypted data. In Section 4.3, we analyze the advantagesof our method over the state-of-the-art. Here we only consider the linear layers, as the non-linearlayers are computed through element/slot-wise operations, whichare generally not affected by the packing schemes.

FConv: Fast Factorized Neural Network Inference on Encrypted Data

Input Dense Packing CiphertextDense Vector –Row Major Multiplication Grouping Convolution Packing Ciphertext …… ciphertext 𝑶 𝒄′ …… ………… ciphertext 1 …… Convolution Vector – Row Major Multiplication & Vectorize …… Output Dense Packing Ciphertext

Figure 3.

Example of FactPack: Factorized DensePack-ConvPack toaccelerate a regular convolution layer with DensePack. For brevity,we omit the plaintext weights in this ﬁgure.

Input Convolution Packing Ciphertext Convolution Vector – Row Major Multiplication ciphertext 1ciphertext K …… ……………… ciphertext 2 Convolution Packing Ciphertext ciphertext 𝑶 𝒄′ …… ………… ciphertext 1 Convolution Vector – Row Major Multiplication ciphertext 𝑶 𝒄 …… ………… ciphertext 1 …… ciphertext 2 Output Convolution Packing Ciphertext

No Im2Col

Figure 4.

Example of FactPack: Factorized ConvPack-ConvPack toaccelerate a regular convolution layer with ConvPack. For brevity,we omit the plaintext weights in this ﬁgure.

Grouping …… ………… ciphertext 1 …… ciphertext 𝑶 𝒄′ ciphertext 𝑶 𝒄′ …… ………… ciphertext 1 …… Im2Col (a) (b)

Figure 5. (a) Im2Col operation for transition of ciphertext packingbetween layers. (b) Grouping operation to split the ciphertexts.Grouping, which is nearly free, is dedicated to 1 × Low-rank matrix factorization is a popular technique toreduce the number of multiply-add operations and parame-ters of convolution layers (Lebedev et al., 2015; Kim et al.,2016), which is achieved by factorizing the learned weightmatrix W ∈ R K × O c as a product of low-dimensional ma-trices W ∈ R K × O (cid:48) c and W ∈ R O (cid:48) c × O c : min W W || W − W W || F s.t. rank( W W ) < O c , (2)where || · || F is the Frobenius Norm. W W is a low-rank approximation of W with the rank O (cid:48) c smaller than O c . Based on the Eckart–Young–Mirsky theorem (Eckartand Young, 1936), the low-rank matrices W and W aresolved analytically by the truncated Singular Value Decom-position (SVD). As a result, the number of multiplicationoperations is reduced from O c KS to O (cid:48) c KS , and parametersize reduced from O c K to O (cid:48) c ( K + O c ) .As illustrated in Fig. 1 Bottom, the matrix multiplicationfor a regular convolution layer I × W is transformed to I × W × W , which is essentially equivalent to two smallconvolution layers. The ﬁrst convolution layer is with O (cid:48) c d × d ﬁlters W ∈ R d × d × I c × O (cid:48) c , followed by the secondconvolution layer with O c × W ∈ R × × O (cid:48) c × O c . With a pre-trained network, we apply the low-rank factor-ization to decompose each of the regular convolution layer(with kernel size larger than 1 ×

1) in the network into twosmall convolutions with rank O (cid:48) c < O c . It is worth notingthat the accuracy may drop with a smaller O (cid:48) c . To restorethe accuracy, one can perform re-training of the factorizednetwork, with the weights of the two factorized convolutionsinitialized by the truncated SVD.4.1.1. D ISCUSSIONS

We brieﬂy discuss how low-rank factorization is related tothe efﬁcient network architecture and ﬁlter pruning. Onecan refer to the supplementary material for more details onexperiments and analysis.

Relationship to Efﬁcient Network Architecture . The fac-torized d × d convolution W with a small number of ﬁltersfollowed by 1 × W with a large number ofﬁlters in spirit is similar to the manually designed efﬁcientnetwork module Bottleneck in the modern neural networkResNet (He et al., 2016), which ﬁrst reduces the numberof ﬁlters with 1 × ×

3) and ﬁnally increases thenumber of ﬁlters with 1 × W and W from scratch, withoutthe need for low-rank factorization. However, we found thattraining from scratch with randomly initialized weights is in-ferior to low-rank factorization, which initializes the weightswith truncated SVD. This is because the HE-enabled neuralnetwork usually used the Square function to replace the non-linear ReLU (Dowlin et al., 2016). Unlike ReLU, trainingwith the Square function may cause instability and convergeinto a local minima since it is easier to explode the activa-tions during training. Low-rank factorization can possiblyalleviate this problem with the proper weight initialization. Relationship to Filter Pruning . Another straightforwardidea to reduce the computations of convolution layer is ﬁlterpruning (Li et al., 2017), in which the redundant ﬁlters areidentiﬁed and pruned. In this sense, ﬁlter pruning only needs

FConv: Fast Factorized Neural Network Inference on Encrypted Data to maintain the ﬁrst small convolution W , while low-rankfactorization has two factorized convolutions. Nevertheless,low-rank factorization is superior to ﬁlter pruning. First, weobserved that low-rank factorization could achieve signif-icantly faster inference speed on encrypted data than ﬁlterpruning, with comparable accuracy. Second, ﬁlter pruningis not able to reduce the Im2Col overheads (transition ofciphertext packing between layers) induced by ConvPack,while the factorized convolutions can. Low-rank factorization decomposes a regular convolutionlayer with O c ﬁlters into a small convolution W with thesame kernel size and O (cid:48) c ﬁlters ( O (cid:48) c < O c ), followed byanother 1 × W with O (cid:48) c input channels and O c output channels. Correspondingly, we introduce thedesign principle, Factorized Packing (FactPack), to pack W and W efﬁciently onto ciphertexts respectively. Packing W . If we only consider the complexity of Con-vPack itself and do not consider the Im2Col operations(transition cost of ciphertexts between layers) induced byConvPack, ConvPack is much more efﬁcient than DenseP-ack for ciphertext packing, due to the fact that DensePackintroduces a large number of expensive rotation operations(Table 1) while ConvPack does not (Table 2). On the otherhand, as shown in Table 3, the transition cost of ciphertextpacking between two layers is signiﬁcantly reduced whenthe convolution kernel size equals 1, regardless of the com-binations of ConvPack and DensePack used. Therefore, thesecond 1 × W factorized by our method per-fectly matches ConvPack. In the next, we introduce how theConvPack for W can be seamlessly integrated with eitherDensePack or ConvPack for the ﬁrst small convolution W ,which in turn incurs little or nearly zero transition overheadsbetween layers (Fig 3 and Fig 4). Packing W . W can be packed by either DensePack orConvPack, depending on the conﬁguration of W .• DensePack for W . As mentioned in Section 3.2,DensePack in LoLa introduces log ( N ) ∗ O rotationsinto the computation process for each convolutionlayer, which is the main reason for the extremely highinference latency. Considering the inference latencyinduced by DensePack is in linear relation to O c , if wereduce O c , the inference latency can be greatly saved.As shown in Table 1, Falcon reduced the number ofrotations by utilizing the property of discrete Fouriertransform (DFT) with block circulant matrices, whichtotally changes the computation process of DenseConv.They could reduce the number of rotations by p times( p is the size of each block circulant matrix), while thevalue p has to be the power of 2. Moreover, the multi- Table 4.

Comparisons of Non-interactive HENN.Features CryptoNets LoLa Falcon OursFaster DensePack × × √ √

Faster ConvPack × × × √ Table 5.

MNIST Results.

Method Latency (s) Acc(%)CryptoNets 205 98.95nGraph-HE 135 98.95FCryptoNets 39.1 98.71LoLa 2.1 98.95Falcon 1.2 98.95LoLa-TinyNet 0.45 98.23FFConv-TinyNet (Ours) 0.37 98.40plicative depth for each convolution layer in Falcon isincreased from 1 to 3, which limits the network depththat can be supported. As described in Section 4.1, an-other idea is to reduce O c via low-rank factorization. Ifwe reduce O c to O (cid:48) c for the ﬁrst factorized convolution W and O (cid:48) c /O c < = p , we could achieve fewer num-ber of rotations and faster inference speed than Falcon(see Table 7). Moreover, our method can ﬂexibly adjustthe reduction rate of inference latency according to therequirement of the model accuracy. Figure 3 illustratesan example of Factorized DensePack-ConvPack for W and W .• ConvPack for W . Since there is no rotation opera-tion in ConvPack, Falcon can not be applied to reducethe inference latency of convolution with ConvPack.On the contrary, our factorized convolution W withConvPack can be accelerated since it reduces the num-ber of ﬁlters from O c to O (cid:48) c . As a result, the numberof MulPC and AddCC operations required is reduced(Table 2). More importantly, since W is 1 × W to W only involves the nearly free Group-ing operation (Fig 5 (b)). Thus, the expensive Im2Coloperations between layers is avoided, as shown in Ta-ble 3 ”CP-CP” with d = 1 . Figure 4 illustrates anexample of Factorized ConvPack-ConvPack for W and W . Table 4 summarizes the comparisons of our FFConv withstate-of-the-art non-interactive HENNs.

Faster DensePack/ConvPack . CryptoNets (Dowlin et al.,2016) was the ﬁrst work on secure neural network infer-ence on encrypted data. It packed each pixel from a batch

FConv: Fast Factorized Neural Network Inference on Encrypted Data

Table 6.

TinyNet operations on MNIST for our FFConv and LoLa (Brutzkus et al., 2019).LoLa-TinyNetlayer Input size Representation LoLa HE Operation Time (s)convolution 64 x 144 convolution convolution vector - row major multiplication 0.15654 x 144 dense combine to one vector using 53 rotations and additions 0.125square 1 x 8064 dense square 0.031fc 1 x 8064 dense dense vector - row major multiplication 0.140output 1 x 10 dense FFConv-TinyNetlayer Input size Representation FFConv HE Operation Time (s)convolution 64 x 144 convolution convolution vector - row major multiplication 0.03113 x 144 convolution convolution vector - row major multiplication 0.04654 x 144 dense combine to one vector using 53 rotations and additions 0.125square 1 x 8064 dense square 0.031fc 1 x 8064 dense dense vector - row major multiplication 0.140output 1 x 10 dense

Table 7.

CIFAR-10 results with WideNet. of images into a ciphertext, resulting in a huge number ofHMA operations for prediction at batch size 1. ThoughLoLa (Brutzkus et al., 2019) proposed DensePack and Con-vPack to accelerate the inference by reducing the numberof ciphertexts, the packing strategies introduced expensiverotations or Im2Col operations that prolong the inferencelatency of deeper and wider networks. The most recent workFalcon (Lou et al., 2020) proposed frequency-domain neuralnetwork to reduce the number of rotations in DensePack.However, it cannot be applied to ConvPack. Our FFConv isthe ﬁrst work that is able to reduce the overheads inducedby DensePack and ConvPack simultaneously.

Noise Budget . Our FFConv requires signiﬁcantly less noisebudget than Falcon, since our increase in multiplicativedepth is smaller. The multiplicative depth of a regular con-volution in FFConv is increased from 1 to 2 after matrixfactorization, while Falcon increases the depth from 1 to 3due to the DFT and inverse DFT operations attached to eachconvolution. A larger multiplicative depth would requiremore noise budget. As shown in Section 5, the noise budgetof our FFConv for WideNet on CIFAR-10 is 380 bits, versusFalcon 430 bits.

5. Experiments

CNN Architecture

We evaluate our method on MNIST andCIFAR-10 datasets, respectively. MNIST contains 28 * 28grayscale images divided into two sets of 60,000 trainingand 10,000 test samples. For MNIST, we designed a smaller neural network TinyNet, which contains only an 8*8 convo-lution layer with a stride of (2, 2) and 56 output channels,followed by a fully connected layer. After 100 epochs oftraining, the model accuracy can reach 98.23 % . The designof TinyNet is to evaluate the effectiveness of our Factor-ized ConvPack-ConvPack. Speciﬁcally, FFConv-TinyNetfactorizes the ﬁrst convolution layer of TinyNet into a 8 × × % with initialization after 100 epochs training.CIFAR-10 is an image classiﬁcation dataset and contains60,000 colored images with 10 object classes of 32 × % . FFConv-WideNet replacesthe second convolutional layer of WideNet with a 6 × × % after 200 epochs training, with our SVDbased weight initialization. The weight and activations ofall models for MNIST and CIFAR-10 are quantized with 8bits. Cryptosystem Settings

We use BFV scheme to implementall models based on the message representations and ho-momorphic operations used in LoLa. We set different pa-rameters in order to maximize the performance of eachmodel. In particular, (1) LoLa-TinyNet: ring dimension N = 8192, plaintext coefﬁcient modulus t = 1099511922689;(2) FFConv-TinyNet: N = 8192, t = 576460752303439873;(3) LoLa-WideNet: N = 16384, t = 34359771137 × N = 16384, t =9007199255560193 × FConv: Fast Factorized Neural Network Inference on Encrypted Data

Table 8.

FFConv-WideNet on CIFAR-10.Layer Input size Representation FFConv HE Operation a fair comparison with the baseline, all experiments are runon Azure standard B8ms virtual machine with 8 vCPUs and32GB DRAM.

Table 6 summarizes the message representation, homomor-phic operation and inference latency that LoLa-TinyNet andFFConv-TinyNet apply at each layer. Both TinyNet andFFConv-TinyNet implement Im2Col to preprocess the inputand encode the input into 64 ciphertexts. Each ciphertextcontains 144 elements. After performing convolution vector-row major multiplication on each of the 64 ciphertexts, bothof the models result in 54 dense output messages and 13dense output messages and consume 0.156 seconds and0.031 seconds, respectively. For FFConv-TinyNet, an addi-tional layer of convolution vector-row major multiplicationis required to form the entire ﬁrst convolution layer, whichresults in dense output messages in 0.046 seconds. AlthoughFFConv-TinyNet uses two layers of convolution vector-rowmajor multiplication, it still reduces the ﬁrst convolutionlayer inference time from 0.156 seconds by 50.64 % to 0.031+ 0.046 = 0.077 seconds. The remaining layers of FFConv-TinyNet remain the same as LoLa-TinyNet and therefore usethe same time. For the total inference latency of FFConv-TinyNet shown in Table 5, it is 0.37 seconds reduced by17.78 % from 0.45 seconds, which is also much faster thanFaster-CryptoNets (Chou et al., 2018) and nGraph-HE (Boe-mer et al., 2019). It is worth noting that the introduction ofFFConv has improved the accuracy of TinyNet. The reasonis FFConv also reduces the amount of model parameters,which can help prevent model overﬁtting and enhance gen-eralization ability. Since TinyNet’s ﬁrst convolutional layerdoes not contain any rotation operation, Falcon cannot beused to accelerate it. Table 8 summaries the message representation, homomor-phic operation and the number of rotations that FFConv-WideNet applies at each layer. In LoLa-WideNet, the sec-ond convolution layer consumes 711 seconds, accounting for more than 97 % of the total inference latency. The reasonlies in the nearly 500,000 parameters and the use of a largenumber of extremely time-consuming rotation operations.Therefore, we focus on optimizing this convolutional layer.FFConv-WideNet replaces the second convolutional layer inLoLa-WideNet with two sub-convolutional layers to reduce52,975 rotations by 86.48 % to 7000 + 162 = 7162 rotations,reduce 4075 MulPC by 7.75 % to 3575 MulPC, and reduce52,975 AddCC by 81.61 % to 9740. The message representa-tion and homomorphic operation of the remaining layers re-main unchanged, so the time consumed remains unchanged.Falcon-WideNet transformed the second spatial-domain con-volution layer into frequency-domain in order to reduce thenumber of rotations, but at the same time, they introduced alarge amount of MulPC ( ∼ % less.Moreover, the increase in noise caused by Falcon’s homo-morphic DFT and inverse DFT operations requires highernoise budget Q = 430 -bit , however, FFConv-WideNet onlyneeds 380-bit. As shown in Table 7, by reducing the numberof rotations and MulPC, FFConv-WideNet inference latencyis 96.4 seconds, which is reduced by 87.04 % and 11.59 % compared to LoLa-WideNet and Falcon-WideNet.

6. Conclusion

In this paper, we propose a low-rank factorization frame-work called FFConv to accelerate secure neural networkinference on encrypted data. FFConv factorizes a regu-lar convolution into two small convolutions, which can beencrypted with Factorized Packing (FactPack) efﬁciently.Experimental results show that FFConv enables faster in-ference speed than state-of-the-art. FFConv is the ﬁrst non-interactive HENN that is capable of reducing the compu-tational overheads induced by different ciphertext packingschemes simultaneously.

References

Boemer, F., Lao, Y., Cammarota, R., and Wierzynski, C.(2019). ngraph-he: A graph compiler for deep learningon homomorphically encrypted data. In

Proceedings of

FConv: Fast Factorized Neural Network Inference on Encrypted Data the 16th ACM International Conference on ComputingFrontiers , pages 3–13.Brakerski, Z., Gentry, C., and Vaikuntanathan, V. (2014).(leveled) fully homomorphic encryption without boot-strapping.

ACM Transactions on Computation Theory(TOCT) , 6(3):13.Brakerski , Z. and Vaikuntanathan, V. (2011). Fully homo-morphic encryption from ring-lwe and security for keydependent messages. In

Annual cryptology conference ,pages 505–524. Springer.Brutzkus, A., Gilad-Bachrach, R., and Elisha, O. (2019).Low latency privacy preserving inference. In

Interna-tional Conference on Machine Learning , pages 812–821.Cheon, J. H., Kim, A., Kim, M., and Song, Y. (2017). Homo-morphic encryption for arithmetic of approximate num-bers. In

International Conference on the Theory and Ap-plication of Cryptology and Information Security , pages409–437. Springer.Chou, E., Beal, J., Levy, D., Yeung, S., Haque, A., andFei-Fei, L. (2018). Faster cryptonets: Leveraging spar-sity for real-world encrypted inference. arXiv preprintarXiv:1811.09953 .Dowlin, N., Gilad-Bachrach, R., Laine, K., Lauter, K.,Naehrig, M., and Wernsing, J. (2016). Cryptonets: Apply-ing neural networks to encrypted data with high through-put and accuracy. In

International Conference on Ma-chine Learning .Eckart, C. and Young, G. (1936). The approximation ofone matrix by another of lower rank.

Psychometrika ,1(3):211–218.Fan, J. and Vercauteren, F. (2012). Somewhat practicalfully homomorphic encryption.

IACR Cryptology ePrintArchive , 2012:144.Gentry, C. (2009). Fully homomorphic encryption usingideal lattices. In

STOC09 , pages 169–178. ACM.Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparserectiﬁer neural networks. In

Proceedings of the Four-teenth International Conference on Artiﬁcial Intelligenceand Statistics , volume 15 of

Proceedings of MachineLearning Research , pages 315–323.He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deepresidual learning for image recognition. In

Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 770–778. Juvekar, C., Vaikuntanathan, V., and Chandrakasan, A.(2018). GAZELLE: A low latency framework for se-cure neural network inference. In , pages 1651–1669.USENIX Association.Kim, Y., Park, E., Yoo, S., Choi, T., Yang, L., and Shin, D.(2016). Compression of deep convolutional neural net-works for fast and low power mobile applications. In .Krizhevsky, A. (2009). Learning multiple layers of featuresfrom tiny images. Technical report.Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., andLempitsky, V. (2015). Speeding-up convolutional neu-ral networks using ﬁne-tuned cp-decomposition. In .LeCun, Y., Cortes, C., and Burges, C. J. (1998). The MNISTdatabase of handwritten digits.Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf.,H. P. (2017). Pruning ﬁlters for efﬁcient convnets. In

International Conference on Learning Representations(ICLR ’17) .Liu, J., Juuti, M., Lu, Y., and Asokan, N. (2017). Obliviousneural network predictions via MiniONN transformations.In ccs17 , pages 619–631.Lou, Q., jie Lu, W., Hong, C., and Jiang, L. (2020). Fal-con: Fast spectral inference on encrypted data. In

Ad-vances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020) .Smart, N. P. and Vercauteren, F. (2014). Fully homomor-phic simd operations.