[PDF] Photonic Convolution Neural Network Based on Interleaved Time-Wavelength Modulation

Abstract

Convolution neural network (CNN), as one of the most powerful and popular technologies, has achieved remarkable progress for image and video classification since its invention in 1989. However, with the high definition video-data explosion, convolution layers in the CNN architecture will occupy a great amount of computing time and memory resources due to high computation complexity of matrix multiply accumulate operation. In this paper, a novel integrated photonic CNN is proposed based on double correlation operations through interleaved time-wavelength modulation. Micro-ring based multi-wavelength manipulation and single dispersion medium are utilized to realize convolution operation and replace the conventional optical delay lines. 200 images are tested in MNIST datasets with accuracy of 85.5% in our photonic CNN versus 86.5% in 64-bit computer.We also analyze the computing error of photonic CNN caused by various micro-ring parameters, operation baud rates and the characteristics of micro-ring weighting bank. Furthermore, a tensor processing unit based on 4x4 mesh with 1.2 TOPS (operation per second when 100% utilization) computing capability at 20G baud rate is proposed and analyzed to form a paralleled photonic CNN.

Full PDF

aa r X i v : . [ c s . ET ] F e b P HOTONIC C ONVOLUTION N EUR AL N ETWORK B ASED ON I NTER LEAVED T IME -W AVELENGTH M ODULATION

Yue Jiang, Wenjia Zhang*, Fan Yang, and Zuyuan He

State Key Laboratory of Advanced Optical Communication Systems and Networks,Shanghai Jiao Tong University, Shanghai, China 200240 *[email protected]

February 22, 2021 A BSTRACT

Convolution neural network (CNN), as one of the most powerful and popular technologies, hasachieved remarkable progress for image and video classiﬁcation since its invention in 1989. How-ever, with the high deﬁnition video-data explosion, convolution layers in the CNN architecture willoccupy a great amount of computing time and memory resources due to high computation com-plexity of matrix multiply accumulate operation. In this paper, a novel integrated photonic CNN isproposed based on double correlation operations through interleaved time-wavelength modulation.Micro-ring based multi-wavelength manipulation and single dispersion medium are utilized to real-ize convolution operation and replace the conventional optical delay lines. 200 images are testedin MNIST datasets with accuracy of 85.5 % in our photonic CNN versus 86.5 % in 64-bit computer.We also analyze the computing error of photonic CNN caused by various micro-ring parameters,operation baud rates and the characteristics of micro-ring weighting bank. Furthermore, a tensorprocessing unit based on × mesh with 1.2 TOPS (operation per second when 100 % utilization)computing capability at 20G baud rate is proposed and analyzed to form a paralleled photonic CNN. As the driving force of Industry 4.0, artiﬁcial intelligence (AI) technology is leading dramatic changes in many spheressuch as vision, voice and natural language classiﬁcation [1]. Convolution neural networks (CNN), as one of themost powerful and popular technologies, has achieved remarkable progress for image classiﬁcation through extractingfeature maps from thousands of images [2]. In particular, CNN, with various structures such as AlexNet [2], VGG16(or 19) [3] and GoogleNet [4], is mainly consisted of two parts: convolution feature extractors to extract the featuremap through multiple cascaded convolution layers, and fully connected layers as a classiﬁer. In the CNN architecture,convolution layers will occupy most of computing time and resources [5] due to high computation complexity ofmultiply accumulate operation and matrices multiply accumulate operation (MMAC) [6]. Therefore, image to columnalgorithm combined with general matrix multiplication (GeMM) [7, 8] and Winograd algorithms [9] were proposed toaccelerate the original 2-D convolution operation (2Dconv) due to the improvement of memory efﬁciency [10]. Withthe high deﬁnition video-data explosion, algorithm innovation can not achieve outstanding performance gain withouthardware evolution. Therefore, innovative hardware accelerators have been proposed and commercialized in the formsof application speciﬁc integrated circuit (ASIC) [11], graphics processing unit (GPU) [12, 13] and tensor processingunit (TPU) [14]. However, it has become overwhelmed for conventional electronic computing hardware to adapt thecontinuedly developing CNN algorithm [15].In the meantime, integrated photonic computing technology presents its unique potential for the next generation highperformance computing hardware due to its intrinsic parallelism, ultrahigh bandwidth and low power consumption[16]. Recently, signiﬁcant progress have been achieved in designing and realizing integrated optical neural networks(ONN) [17, 18, 19]. The fundamental components including Mach-Zehnder interferometers (MZI) [18] and micro-ring resonators (MRR) [19] have been widely employed to compose a optical matrix multiplier unit ( OM U ), whichis used to complete the MMAC operation. In order to construct full CNN architecture, electrical control unit like ﬁeldprogrammable gate array (FPGA) is required to send slices of input images as voltage control signals to optical modu- PREPRINT - F

EBRUARY

22, 2021lators and also operate nonlinear activation. For instance, an OM U controlled by FPGA, has been proposed by usingfan-in-out structure based on microring resonators [20]. Similarly, CNN accelerator based on Winograd algorithm inthe work of [21] is also composed of an OM U based on MRR and electronic buffer. However, the proposed photonicCNN architecture controlled by electronic buffer rely on electrical components for repeatedly accessing memory toextract the corresponding image slices (or slice vectors) and are ﬁnally constrained by memory access speed and ca-pacity. In 2018, photonic CNN using optical delay line to replace the electronic buffer was ﬁrstly proposed in [22].Based on the similar idea, the researchers have developed an optical patching scheme to complete the 2-D convolutionin [23], where the wavelength division multiplexing (WDM) method is used[22].In our previous work [24], wavelength domain weighting based on interleaved time-wavelength modulation wasdemonstrated to complete the MMAC operation. The idea of multi-wavelength modulation and dispersed time de-lay can realize matrix vector multiplication by employing time and wavelength domain multiplexing. However, thecross-correlation operation between an input vector and a single column of weighting matrix is operated through sam-pling process by generating a large amount of useless data. Moreover, a 2Dconv operation can be decomposed asthe sum of multiple double correlation operations between vectors. In this paper, a novel integrated photonic CNNis proposed based on double correlation operation through interleaved time-wavelength modulation. Microring basedmulti-wavelength manipulation and single dispersion medium are utilized to realize convolution operation and replacethe conventional optical delay lines used in [22] and [23]. 200 images are tested in MNIST datasets with accuracyof 85.5 % in our PCNN versus 86.5 % in 64-bit computer. We also analyze the error of PCNN caused by high baudrate and the characteristics of MRR weighting bank. Furthermore, a tensor processing unit based on × U mesh with 1.2 TOPS (operation per second when 100 % utilization) computing capability at 20G baud rate for MZMarchitecture is proposed and analyzed to form a paralleled photonic CNN. The convolution layer is the key building block of a convolution network that operates most of the computationalheavy lifting. Convolution operation essentially performs dot products between the feature map and local regions ofthe input. This operation will be iterated in the input image at stride of given location along both width and height.Therefore, the designed operation will consume a lot of memory, since some values in the input volume are replicatedmultiple times due striding nature of this process.In the proposed photonic CNN as shown in Fig. 1(a), the optical convolution unit (OCU) is consisted of OM U anddispersed time delay unit (TDU). The single 2Dconv operation for the M × M input image A and N × N convolutionkernel w is executed during one period in the OCU, which can be written as: Y m,n = N X i =1 N X j =1 ( w i,j · A m + i − ,n + j − ) (1)Here we set M = 3 , N = 2 for example in Fig. 1(b), the input image A is ﬂattened into a normalized × M vector A ′ which is modulated by a MZI modulator on multi-wavelength optical signals with N wavelengths: λ , λ ... λ N at certain Baud Rate (marked as BR in equations). The intensity of each frequency after modulation, I A ′ ( t ) can bewritten as I A ′ ( t ) = M P l =1 M P k =1 I input · A l,k · Square ( t ) Square ( t ) = U [ t − ( l − × M + kBR ] − U [ t − ( l − × M + k +1 BR ] (2)Where the U ( t ) is the step function, and the I input is the intensity of a single channel in WDM sources, which areequal for all frequencies. Optical signals of different wavelengths are separated by the DEMUX, and sent to thecorresponding MRRs. There are N MRRs R , R , . . . , R N compose as a MRR weighting Bank. The transmission( T ( i − × N + j ) of each MRR are set to the w i,j and tuned by the voltage bias from voltage source or an arbitrarywaveform generator. The control signal is generated from the w-V database which stores the mapping between the w and V . The output intensity of each MRR I R ( i − ) × N + j ( t ) with circuit time delay τ c can be written as I R ( i − ) × N + j ( t ) = I A ′ ( t − τ c ) · w i,j (3)2 PREPRINT - F

EBRUARY

22, 2021Optical signals of different wavelengths are combined as the matrix B shown in Fig. 1(b) in time domain, by passingthrough the MUX. The output intensity I OM U ( t ) of the OM U with the time delay τ c ′ is I OM U ( t ) = N X i =1 N X j =1 I A ′ ( t − τ ′ c ) · w i,j (4)Which is equal to the MMAC operation between the ﬂattened convolution kernel vector w ′ and the matrix [A ′ T , ..., A ′ T ] which contains N copies of A ′ . As depicted in Fig. 1(b), to complete the 2Dconv operation between A and w , the corresponding elements in (1) should be in the same column of the matrix B ′ , which can be realizedby introducing different time delay τ ( i − × N + j for wavelength λ ( i − × N + j in TDU to complete the zero paddingoperation: τ ( i − × N + j = [( N − i ) × M ) + N − j ] /BR (5)The intensity of the light wave passing through the TDU with the wavelength independent circuit time delay τ ′′ c canbe written as I TDU ( t ) = N X i =1 N X j =1 I A ′ ( t − τ ′′ c − τ ( i − × N + j ) (6)When optical signal is received by the photo-detector (PD), the I T DU ( t ) convert to V P D ( t ) . Refer to (6), there are M + ( N − × ( M + 1) elements in each row of matrix B ′ , and the q th column of which occupies one time slice in V P D ( t ) : from τ ′′ c + ( q − /BR to τ ′′ c + q/BR , compare the (1) and (6), when q = ( M − N + 1) × ( m −

1) + ( M + m ) + n (7)Where ≤ m, n ≤ M − N + 1 , and set a parameter σ between 0 and 1, we have: Y m,n = V P D [( t − τ ′′ c − q + σ ) /BR ] (8)When M = 3 , N = 2 shown in Fig. 1(b), the sum of B ′ i , , B i , ′ , B ′ i , , and B ′ i , corresponding to Y , , Y , , Y , ,and Y , , respectively. A programmed sampling function refer to (7) and (8) is necessary in digital signal process-ing, and the parameter σ decides the position of optimal sampling point, which needs to be adjusted at different bitrates. According to the (5), the row B ′ q of matrix B ′ can be divided into N groups with N vectors composed as amatrix of Group i , j = B ′ (i − × N+j , where i, j ≤ N . The kernel elements multiplied with vector A ′ in Group i are [ w i, , w i, , ..., w i,N ] , which are the elements in the same row of a convolution kernel w . Refer to (5), the difference ofthe time delay in between two adjacent rows in the same group is equal to /BR , whereas the difference of time delaybetween Group i , j and Group i+1 , j is equal to M/BR . The sum of q th column in the same group of B ′ can be writtenas X Group i (q) = N X j=1 w i , j · A ′ q+j − N (9)which is actually the expression of the cross-correlation (marked as R ( x, y ) ) between vector [ w i, , w i, , ..., w i,N ] and A ′ . Therefore, the 2Dconv operation can be decomposed as the sum of multiple double correlation operation betweenvectors as follows N X p =1 B ′ p = N X i=1 R[R(A ′ , w i ) , Flatten(C i )] (10)where P Ni =1 C i is an identity matrix with the size of N × N , and the elements at the i th row and column of C i is equalto 1, the other elements equal to 0. The matrix C i is ﬂattened in to a × N vector, and cross-correlation operation isdenoted as R (A ′ , w i ) . The MRRs based on electro-optic or thermal-optic effect are used in weighting Bank of OCU. Refer to (3), the elementsof convolution kernel w i,j , trained by 64-bit computer, are usually normalized from 0 to 1, which needs to be mappedinto the transmission of MRRs. As shown in Fig. 2(a), according to [25, 26], the transmission of the through port of3 PREPRINT - F

EBRUARY

22, 2021MRR based on electro-optic effect is tuned by voltage bias V loaded on the electrode of MRR, which can be writtenas: T = 1 − (1 − α )(1 − τ )(1 − ατ ) + 4 ατ sin ( θ/ , θ = θ + πV /V π (11)Where τ is the amplitude transmission constant between the ring and the waveguide, α is the round-trip loss factor,and θ is the round-trip phase shift, θ is the bias phase of the MRR, and V π is the voltage loaded on the MRR when θ = π , which is decided by the physical parameters of the waveguide. The curve of V-T is shown in Fig. 2(c). Avoltage source with speciﬁc precision (10-bit in our evaluation) sweeps the output voltage with the minimum step from0 to 0.4, which is loaded on the MRR. The transmission metrics of MRR at different voltages are recorded accordingly.As shown in Fig. 2(d), the processing actually equivalent to sampling the curve of V-T by using an analog-to-digitalconverter (ADC) with same precision of the voltage source. If | w i,j | ≤ , w i,j can be mapped directly into T , theweighting voltage V can be ﬁgured out by searching the number which is closest to w i,j in the database T-V. Otherwise,the whole convolution kernel should be normalized through being divided by the max of w i,j . Then, the normalized w nor matrix is utilized to control signal matrix V . Another mapping method is designed by using part of quasi-linearregion in V-T curve of MRR, where the matrix w needs to be normalized by multiplying max (T linear ) / max(w) . Notethat the error weighting error occurs during the mapping process as shown in Fig. 2(d). There will be a difference w ′ between the actual transmission of MRR T ′ and an ideal mapping point T . So the weighting error and outcome of the OM U , Y ′ can be written as (12), where Y is the theoretical outcome of the OM U , and Y ′ → Y when w ′ → . w ′ = T ′ − TWeighting Error = [A ′ T , ..., A ′ T ] × w ′ Y = [A ′ T , ..., A ′ T ] × (w + w ′ )Y ′ = Y + Weighting Error (12) The zero padding operation is executed by offering different time delay for each channel of multi-wavelength lightsource in time delay unit. In our previous work [24], the OM U based on wavelength division weighting method withsingle dispersion compensating ﬁber (DCF) was proposed, where the correlation operation between two vectors isrealized in time domain refer to (9.) Based on the OM U in [24], the TDU can be implemented with single dispersionmedia combined with programmed multi-wavelength light source (PMWS) shown in Fig. 3, which can be generatedby a shaped optical frequency comb refer to (5). The programmed light source contains N groups wavelengths, and N wavelengths are included in each group with the wavelength spacing of ∆ λ , the wavelength spacing between adjacentgroups is equal to M × ∆ λ . The requirements of programmed multi-wavelength light source can be written as (cid:26) P M W S i,j − P M W S i,j − = ∆ λP M W S i,j − P M W S i − ,j = M × ∆ λ (13)where P M W S is programmable multiple-wavelength source, which is sent to the dispersion media with length of L (km), and the dispersion of D (s/nm/km). Therefore, the time delay difference marked as TDD in 14) are introducedfor optical signal with wavelength P M W S i,j to the

P M W S , . This value is equal to T DD i,j = (

P M W S i,j − P M W S , ) × LD (14)When T DD i,j − T DD i,j − = 1 /BR , (14) is equivalent to (5), i.e. zero padding operation is conducted when multi-wavelength signals passing through the dispersion media. Note that there exist challenging tasks in implementing theTDU structure as shown in Fig. 3. It is essential to design the frequency comb with large enough number and densityof lines combine with dispersion media with ﬂat, large enough D (s/nm/km) and low loss. The bandwidth, B with thenumber of lines, k , and the length of DCF, L needed can be calculated as:  B = ( M + 1) × ( N − × ∆ λ k = B / ∆ λ + 1 L = ( BR × D × ∆ λ ) − (15)In this paper we take frequency comb with ∆ λ ≈ . nm as reported in [27] and DCF (suppose D is ﬂat for allwavelength) with D = − (ps/nm/km), to perform MNIST handwritten digit recognition task, where M = 28 , N = 3 for example, refer to (15) with B = 11 . nm, k = 59 lines, and L = 1.67 km at BR = 20 G.4

PREPRINT - F

EBRUARY

22, 2021Another widely discussed structure of dispersed delay architecture is based on multi-wavelength source and arrayedﬁber grating, where the PMWS is not necessary, and the cost of source and bandwidth is much cheaper. However, atleast N SMF are needed, which makes it hard to control the time delay of each wavelength precisely. N tunabletime delay units for short time delay such as Fiber Bragg Grating and Si N waveguide can be employed with properdelay controller to compensate the time delay error in each channel caused by fabrication process. Furthermore, thesize of input images M l for the l th convolution layer is equal to half of M l − after pooling operation with stride of 2,the length of SMF for l th convolution layer need to be adjusted according to M l , whereas the TDU based on PMWSand single DM can regulate the time delay with high robustness by reprogramming WDM source according to (14). As shown in Fig. 4(a), a simpliﬁed AlexNet convolution neural network for MNIST handwritten digit recognition taskis trained ofﬂine on 64-bit computer in TensorFlow framework (TCNN), which is composed of 3 convolution layers,and 2 kernels (3 × × , 4 kernels (3 × × and 4 kernels (3 × × in the st , nd and th convolutionlayer, respectively. The size of samples in MNIST written digital dataset × × W idth × Height × Channel ) ,and the output shape for each layer is (13 × × , (5 × × , (3 × × , and ﬁnally a (1 × ﬂatten featurevector (marked as FFV in equations) is output by the ﬂatten layer. A PCNN simulator with the same architectureis set up based on Lumerical and Matlab to implement the optical domain and DSP part of the OCU. The V − T database is established by recording the transmission of corresponding wavelength at through port of the default MRRoffered by lumerical, while sweeping voltage bias from 0 to 1.2 V with precision of 10-bit. Then the mapping processshown in Fig. 2 is conducted to load convolution kernel into the PCNN simulator. The feature map extracted at eachconvolution layer of input ﬁgure “8” from TensorFlow and reshaped feature vector of PCNN are compared in Fig.4(b), which shows the feature map extraction ability of the PCNN. Finally 200 test samples in MNIST are extractedrandomly and sent to the PCNN for test with the test accuracy is 85 % at 10 G Baud Rate. Note that the TensorFlow isa simpliﬁed AlexNet whose classiﬁcation accuracy for the same 200 test samples is only 86.5 % in our 64-bit computer.The confusion matrices of TensorFlow and PCNN at 10G Baud Rate are shown in Fig. 5 (a) and (b), respectively. Equation (12) shows that the weighting error occurs during mapping process, which is depending on the mappingprecision P ( v i ) of the MRR weigting bank. The P ( v i ) can be evaluated by the difference of the T ( v i ) [20], which is P ( v i ) = log [ ∇ T ( v i )] − = log [ T ( v i ) − T ( v i − )] − (16)As shown in Fig. 6, we numerical analyze the P ( v i ) of MRR with different ﬁneness at distinct ADC precisionlevel refer to (11) and (16). In Fig. 6(b), the MRR with smaller ﬁneness has higher P ( v i ) in quasi-linear region( v i ≤ v l , where v l is the boundary of quasi-linear region ). However, when v i ≥ v l , P ( v i ) increases with the ﬁneness.The precision of ADC also has impact on the P ( v i ) of MRR. As depicted in Fig. 6 (c), P ( v i ) increases with theprecision of ADC. The weighting error separated from the PCNN is added to the ﬂatten feature vector extracted fromthe TensorFlow CNN. The test accuracy of ﬂatten feature vector is 87 % , with the confusion matrix shown in Fig. 5(c). Note that the test accuracy of ﬂatten feature vector with error is higher than that in TensorFlow, the handwrittendigital recognition task in this paper is a 36-dimensions optimal task. Here we use 1-dimension optimal function g ( x ) to explain. As shown in Fig. 6(d), there is a distance D between the optimal point and the convergence point ofTensorFlow. The convergence point of PCNN can be treated as optimal point of TCNN added with noises in errorrange. This deviation will probably lead to a closer location to the optimal point and therefore a higher test accuracywith a certain probability. The test accuracy of MRR with different ﬁneness at distinct ADC precision level is shownin Fig. 6(e), where the w i,j is mapped into T from 0 to 1, whereas w i,j is mapped into T in quasi-linear region in Fig.6(f). By comparing two ﬁgures, the MRR with low ﬁneness and high ADC precision level are preferred in high-speedphotonic CNN. 5 PREPRINT - F

EBRUARY

22, 2021Table 1: EXECUTION SPEED AT DIFFERENT BAUD RATE FOR PCNN WITH 1 OCU

BaudRate Time of Conv.1(M=28)Period =2 Time of Conv.2(M=13)Period =8 Time of Conv.3(M=5)Period=16 Totaltime Ops Execution Speed(Average) Execution Speed(2Dconv)5G 340 ns 320 ns 128 ns 788 ns 56 GOPS 71 GOPS10G 170 ns 160 ns 64 ns 394 ns 112 GOPS 143 GOPS15G 114 ns 112 ns 40 ns 266 ns 44352 166 GOPS 213 GOPS20G 86 ns 80 ns 32 ns 198 ns 224 GOPS 282 GOPS25G 68 ns 64 ns 24 ns 156 ns 284 GOPS 357 GOPS

The distortion will be introduced when high bandwidth signals passing through ﬁlters such as MRR. Moreover, thequantization noise for high frequency signals will also induce the extra error, which can be extracted refer to (17):

Error = FFV

PCNN − FFV

TCNN − Weighting Error (17)where

Weighting Error is ﬁxed at any baud rate in our simulator. We run the photonic CNN at the baud rate of 5,10, 15, 20, and 25 Gbaud for 10 samples. The distribution statistics of

Error with 360 elements at each baud rate isshown are Fig. 7 (a) to (e). To analyze the impact of levels of error on the test accuracy at different baud rates, theprobability density function (PDF) of the error at each baud rate are calculated. The PDF shows a normal distribution,and the Gaussian ﬁt curve of PDF at each baud rate is shown in Fig. 7(f). The mean value of Gaussian ﬁt functionwill decrease whereas variance increases at higher baud rate for input vector, meaning that the error will increase withthe baud rate. 10 random error sequences

Error ′ i are generated according to the PDF at each baud rate and added with (FFV T CNN + Weighting Error) , which are combined as new ﬂatten feature vector with errors sent to the classiﬁerfor testing. The performance of photonic CNN at different baud rate is shown in Fig. 8. Note that the distance betweenthe optimal point and the convergence point is shown in Fig. 6(d). The difference of average accuracy at each baudrate and standard deviation of test accuracy should be considered instead. In Fig. 8, the performance degrades with theincreasing of baud rate, showing that the high speed photonic CNN will pay its the cost of computation performance.However, high operation baud rate will mean less computing time, which can be roughly calculated as t Dconv = [ M × ( M + 2) + 2] /BR + t c (18)Table 2: EXECUTION SPEED AT DIFFERENT BAUD RATE FOR × PCNN MESH

BaudRate Time of Conv.1(M=28) Time of Conv.2(M=13) Time of Conv.3(M=5) Totaltime Ops Execution Speed(54 % Utilization) Execution Speed(100 % Utilization)5G 170 ns 40 ns 8 ns 218 ns 203 GOPS 324 GOPS10G 85 ns 20 ns 4 ns 109 ns 406 GOPS 648 GOPS15G 57 ns 14 ns 2.5 ns 73.5 ns 44352 603 GOPS 1.03 TOPS20G 43 ns 10 ns 2 ns 55 ns 806 GOPS 1.29 TOPS25G 34 ns 8 ns 1.5 ns 43.5 ns 1.02 TOPS 1.73 TOPS

Where t c is the time delay in OM U , which is usually less than 100 ps in our system. Thus, the execution speed atdifferent are as shown in Table 1. Note that the operation in TCNN is a 4-dimension operation (or tensor operation) forwidth, height, channel and kernel. However, for each OCU only 2-dimension operation for width, height is realizedduring one period. In the layer of a photonic CNN with input of C channels and K kernels, one OCU can be usedrepeatedly to complete 4-dimension operation in C × K periods. To improve the execution speed, the parallelizationof the photonic CNN is necessary in the future. In this paper, a candidate mesh with MRR weighting bank shown inFig. 9 is proposed to complete tensor operation during one period. Each row of the mesh is combined as one kernelwith all channels. And the same channel of input ﬁgure is copied and sent to the mesh in the same column. For theﬁrst layer of photonic CNN, the input image “8” is ﬂattened into × vector and duplicated into two copies by asplitter for M W B , and M W B , . Two × vectors are sent to the DSP through the TDU and PD in the st and nd row of mesh. Note that the length of optical path through mesh and dispersion media should be equal. The6 PREPRINT - F

EBRUARY

22, 2021execution speed of the × mesh at different baud rate is shown in Table. 2. Note that the mesh is not 100 % utilized ineach period when loaded a simpliﬁed AlexNet shown in Fig. 4(a). The average utilization of PCNN can be calculatedas /

16 + 8 /

16 + 16 /

16 = 54% , thus the average execution time for one sample is much lower due to nature ofparallelization. Refer to (15) and Table 1, and 2, the photonic CNN running at higher baud rate has faster executionspeed and lower delay scale. However, the selection of baud rate depends on the requirement of CNN performance andtime delay resolution. As shown in Fig. 8, the performance degenerate signiﬁcantly at

Baud Rate = 25

G. Moreover,if we choose the delay structure in Fig. 3, and we set the length of DCF of L = 2 km and comb with density of . nm, R = 60 ps according to (15), which allows Baud Rate ≤ . G. The photonic CNN using electronic buffer based on 2Dconv and GeMM algorithm need to access to memory reapeatlyto extract the corresponding image slice. The number of times for memory access is × ( M − N + 1) . As shown inFig. 10(a), memory access times for 2Dconv and GeMM algorithm will increase signiﬁcantly with the width of inputimage, since that multiplication, addition and zero padding operations will require a large amount of data in memoryshown in Fig. 10(b). However, photonic CNN only needs to take out the ﬂatten image vector and store the convolutionresults, i.e. only 2 times for memory access are needed. Further more, intermediate data stored in the optical delayunit which will have less memory cost compared to electrical counterpart as in Fig. 10 and very close to the theoreticallower limit. In this paper, we propose a novel integrated photonic CNN based on double correlation operations through interleavedtimewavelength modulation. 200 images are tested in MNIST datasets with accuracy of 85.5 % in our PCNN versus86.5 % in 64-bit computer. The error caused by distortion induced by ﬁlters and ADC will increases with the baud rateof the input images, leading to the degradation of classiﬁcation performance. A tensor processing unit based on × mesh with 1.2 TOPS (operation per second when 100 % utilization) computing capability at 20G baud rate is proposedand analyzed to form a paralleled photonic CNN. References [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436–444, 2015.[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processing systems , pages 1097–1105, 2012.[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2014.[5] Yangqing Jia. Learning semantic image representations at a large scale. 2014.[6] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In , 2015.[7] Marat Dukhan. The indirect convolution algorithm. 2019.[8] Zhi-Gang Liu, Paul N Whatmough, and Matthew Mattina. Systolic tensor array: An efﬁcient structured-sparsegemm accelerator for mobile cnn inference.

IEEE Computer Architecture Letters , 19(1):34–37, 2020.[9] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. 2015.[10] Minsik Cho and Daniel Brand. Mec: memory-efﬁcient convolution for deep neural network. arXiv preprintarXiv:1706.06873 , 2017.[11] Tao Luo, Shaoli Liu, Ling Li, Yuqing Wang, and Yunji Chen. Dadiannao: A neural network supercomputer.

IEEE Transactions on Computers , 66(1):73–88, 2016.[12] Shunsuke Suita, Takahiro Nishimura, Hiroki Tokura, Koji Nakano, Yasuaki Ito, Akihiko Kasagi, and TsuguchikaTabaru. Efﬁcient cudnn-compatible convolution-pooling on the gpu. In

International Conference on ParallelProcessing and Applied Mathematics , pages 46–58. Springer, 2019.[13] Shunsuke Suita, Takahiro Nishimura, Hiroki Tokura, Koji Nakano, and Tsuguchika Tabaru. Efﬁcient convolutionpooling on the gpu.

Journal of Parallel and Distributed Computing , 138:222–229, 2020.7

PREPRINT - F

EBRUARY

22, 2021[14] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates,Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In

Proceedings of the 44th Annual International Symposium on Computer Architecture , pages 1–12, 2017.[15] Yu Wang. Neural networks on chip: From cmos accelerators to in-memory-computing. In , pages 1–3. IEEE, 2018.[16] H John Caulﬁeld and Shlomi Dolev. Why future supercomputing requires optics.

Nature Photonics , 4(5):261–263, 2010.[17] Yichen Shen, Nicholas C Harris, Scott Skirlo, Mihika Prabhu, Tom Baehr-Jones, Michael Hochberg, Xin Sun,Shijie Zhao, Hugo Larochelle, Dirk Englund, et al. Deep learning with coherent nanophotonic circuits.

NaturePhotonics , 11(7):441, 2017.[18] Philip Y Ma, Alexander N Tait, Thomas Ferreira de Lima, Chaoran Huang, Bhavin J Shastri, and Paul R Prucnal.Photonic independent component analysis using an on-chip microring weight bank.

Optics Express , 28(2):1827–1844, 2020.[19] Alexander N Tait, Hasitha Jayatilleka, Thomas Ferreira De Lima, Philip Y Ma, Mitchell A Nahmias, Bhavin JShastri, Sudip Shekhar, Lukas Chrostowski, and Paul R Prucnal. Feedback control for microring weight banks.

Optics express , 26(20):26422–26443, 2018.[20] Qixiang Cheng, Jihye Kwon, Madeleine Glick, Meisam Bahadori, Luca P Carloni, and Keren Bergman. Siliconphotonics codesign for deep learning.

Proceedings of the IEEE , 2020.[21] Armin Mehrabian, Mario Miscuglio, Yousra Alkabani, Volker J Sorger, and Tarek El-Ghazawi. A winograd-based integrated photonics accelerator for convolutional neural networks.

IEEE Journal of Selected Topics inQuantum Electronics , 26(1):1–12, 2019.[22] Hengameh Bagherian, Scott Skirlo, Yichen Shen, Huaiyu Meng, Vladimir Ceperic, and Marin Soljacic. On-chipoptical convolutional neural networks. arXiv preprint arXiv:1808.03303 , 2018.[23] Shaofu Xu, Jing Wang, and Weiwen Zou. Optical patching scheme for optical convolutional neural networksbased on wavelength-division multiplexing and optical delay lines.

Optics Letters , 45(13):3689–3692, 2020.[24] Yuyao Huang, Wenjia Zhang, Fan Yang, Jiangbing Du, and Zuyuan He. Programmable matrix operation withreconﬁgurable time-wavelength plane manipulation and dispersed time delay.

Optics express , 27(15):20456–20467, 2019.[25] Hidehisa Tazawa, Ying-Hao Kuo, Ilya Dunayevskiy, Jingdong Luo, Alex K-Y Jen, Harold R Fetterman, andWilliam H Steier. Ring resonator-based electrooptic polymer traveling-wave modulator.

Journal of lightwavetechnology , 24(9):3514, 2006.[26] Bartosz Bortnik, Yu-Chueh Hung, Hidehisa Tazawa, Byoung-Joon Seo, Jingdong Luo, Alex K-Y Jen, William HSteier, and Harold R Fetterman. Electrooptic polymer ring resonator modulation up to 165 ghz.

IEEE Journal ofSelected Topics in Quantum Electronics , 13(1):104–110, 2007.[27] Junqiu Liu, Erwan Lucas, Arslan S. Raja, Jijun He, and Tobias J. Kippenberg. Author correction: Photonicmicrowave generation in the x- and k-band using integrated soliton microcombs.

Nature Photonics , pages 1–1,2020. 8

PREPRINT - F

EBRUARY

22, 2021

FlattenFlatten

Zero PaddingImage:

M×M

Convolution

Kernel:

N×N

Feature Map: ( M-N+1)×(M-N+1)

For:

M=3,N=2

Figure 1: (a) Structure of the OCU, where the 2Dconv operation shown in (b) is done. MZM: Mach Zehnder modulator,W-V Data Base: set up following the process shown in Fig. 2(b) to generate the voltage control signal loaded onthe MRR weigthting bank, PD: Photodetector to covert optical signal into electric domain, ADC and DAC: Analog-to-Digital and Digital-to-Converter respectively, DSP: Digital signal processing where the sampling, nonlinear, andpooling operation is done. 9

PREPRINT - F

EBRUARY

22, 2021 in Through V in Through V (a) (b) T r a n s m i ss i o n - Curve and Curve of MRM Voltage T r a n s m i ss i o n - Curve and Curve of MRM Voltage v-T

CurveADCSampled v-T

Curve

V-T

Data

Base W Data Base

Training

Mapping

Fuctions

Control Signal

MRM v-T

CurveADCSampled v-T

Curve

V-T

Data

Base W Data Base

Training

Mapping

Fuctions

Control Signal

MRM T r a n s m i ss i o n Voltage

Sampled - Curve of MRM T r a n s m i ss i o n Voltage

Sampled - Curve of MRM (c) (d)

QLRQLR

Mapping

Weighting Error

Mapping

Weighting Error

Figure 2: (a) Schematic of MRR based on EO effect, (b) Mapping process of w to T − V . (c) v − T and ∇ T ( v ) curve of MRR, the QLR (quasi-linear region) in this paper is deﬁned as the region between 0 v and the correspondingvoltage at the highest / of the ∇ T ( v ) curve , (d) v − T curve sampled by ADC with 10-bit precision, note that thereare error w ′ existed between theoretical mapping points w i,j and true mapping points T ′ i,j . Wave ShaperDispersion

Medium PD Wavelength P o w e r Wavelength P o w e r Programed Multi-Wavelength Source

WSBG Wavelength P o w e r Programed Multi-Wavelength Source

WSBGWavelength P o w e r Wavelength P o w e r Wavelength P o w e r Wavelength P o w e r Wavelength P o w e r WDM Generated from OFC

Wavelength P o w e r WDM Generated from OFC

Figure 3: TDU based on single dispersion medium and Programmed multi-wavelength source, which is generated bythe optical comb and wave shaper, with N groups wavelengths, and N wavelengths in each group, with the wavelengthdistance of ∆ λ , and the wavelength space between adjacent groups marked as W SBG = ∆ λ · M .10 PREPRINT - F

EBRUARY

22, 2021

Input Image (cid:263) (cid:264) Relu

Max PoolingConv.1 (2 kernels, ) Relu

Max PoolingConv.2 (4 kernels, ) Relu

Max PoolingConv.2 (4 kernels, ) ReluMax PoolingConv.3 (4 kernels, )ReluMax PoolingConv.3 (4 kernels, ) FlattenInput Image (cid:263) (cid:264) Relu

Max PoolingConv.1 (2 kernels, ) Relu

Max PoolingConv.2 (4 kernels, ) ReluMax PoolingConv.3 (4 kernels, ) Flatten

OCU. OCU. OCU. OCU. Bias and Relu

Max PoolingSampling

Bias and Relu

Max PoolingSamplingFlatten

Input Image (cid:263) (cid:264) Flatten

Input Image (cid:263) (cid:264) Conv.1 (2 OCUs)DSPConv.2 (8 OCUs)

Conv.3 (16 OCUs)

OCU. OCU. Bias and Relu

Max PoolingSamplingFlatten

Input Image (cid:263) (cid:264) Conv.1 (2 OCUs)DSPConv.2 (8 OCUs)

Conv.3 (16 OCUs)

Electric DomainOptical DomainElectric Domain (a) (b)

TCNN PCNN Feature Map of TCNN Feature Map of PCNN

Figure 4: (a)The architecture of convolutional neural network in TensorFlow (TCNN) with 3 convolution layers andthe PCNN with the same architecture of TCNN, (b) Compare of feature map extracted by TCNN and reshaped featurevector extracted by PCNN. (a) (b) (c)

Confusion Matrix of TCNN Confusion Matrix of PCNN at 10G Baud Rate Confusion Matrix of TCNN with Weighting Error (a) (b) (c)

Confusion Matrix of TCNN Confusion Matrix of PCNN at 10G Baud Rate Confusion Matrix of TCNN with Weighting Error

Figure 5: (a) Confusion matrix of TCNN for 200 samples test, (b) Confusion matrix of PCNN at 10G Baud Rate, (c)Confusion matrix of TCNN with weighting bank error separated from PCNN.11

PREPRINT - F

EBRUARY

22, 2021

Fineness = 100Fineness = 100Fineness = 150Fineness = 150

Fineness = 200Fineness = 200Fineness = 250Fineness = 250

Fineness = 100Fineness = 150

Fineness = 200Fineness = 250

Fineness = 100Fineness = 150

Fineness = 200Fineness = 250

Fineness = 100Fineness = 100Fineness = 150Fineness = 150

Fineness = 200Fineness = 200Fineness = 250Fineness = 250

Fineness = 100Fineness = 150

Fineness = 200Fineness = 250

Fineness = 100Fineness = 150

Fineness = 200Fineness = 250

Error RangeDError RangeDOptimal Point

Convergence PointPCNN Point T e s t A cc u r a c y T e s t A cc u r a c y (a) (b)(c) (d)(e) (f) Figure 6: (a) T − V curve of MRR with different Fineness from 100 to 250, (b) Compare of weighting precision ofMRR with different Fineness, (c) Compare of weighting precision of the MRR at different level of ADC precision, (d)The PCNN point which is equal to Convergence Point of TCNN with error may have shorter distance to the optimalpoint compared with that of TCNN, which leads to higher test accuracy, (e) Test Accuracy compare of MRR withdifferent Fineness at distinct ADC precision level when w i,j is mapped into T from 0 to 1, whereas (f) w i,j is mappedinto T in quasi-linear region. 12 PREPRINT - F

EBRUARY

22, 2021 (a) (b) (c) (d) (e) (f)

Figure 7: (a) to (e), the distribution statistics of

Error at the Baud Rate of 5,10,15,20, and 25G, respectively, (f) TheGaussian ﬁt curve of probability density function (PDF) of

Error at different Baud Rate.Figure 8: Performance of PCNN at different Baud Rate, the standard deviation is adopted here, note that,

Error ofTCNN and TCNN with Weighting Error (WE) are equal to , i.e. the std at TCNN and Weighting Error are 0.13 PREPRINT - F

EBRUARY

22, 2021 M Z MM Z M M Z MM Z M M Z MM Z M M Z MM Z M WDM TDU

TDU

TDUTDU PD PD PDPD DSP D e M U X D e M U X ... M U X M U X MRR Weighting Bank D e M U X ... M U X MRR Weighting Bank D e M U X ... M U X MRR Weighting Bank Figure 9: PCNN based on C × K MWB (MRR weighting bank) mesh, the MWB in each column are for the sameinput channel in different kernels, and the MWB in each row combine as one kernel with C channels. Width of Image (Square) Width of Image (Square) T i m e s o f M e m o r y A cc e ss TMA of 2Dconv and GeMM M e m o r y C o s t ( E l e m e n t s ) Compare of Memory Cost

Theoretical limit

PCNN in our work

PCNN with Electric Buffer (a) (b)(a) (b)