[PDF] A Winograd-based Integrated Photonics Accelerator for Convolutional Neural Networks

Abstract

Neural Networks (NNs) have become the mainstream technology in the artificial intelligence (AI) renaissance over the past decade. Among different types of neural networks, convolutional neural networks (CNNs) have been widely adopted as they have achieved leading results in many fields such as computer vision and speech recognition. This success in part is due to the widespread availability of capable underlying hardware platforms. Applications have always been a driving factor for design of such hardware architectures. Hardware specialization can expose us to novel architectural solutions, which can outperform general purpose computers for tasks at hand. Although different applications demand for different performance measures, they all share speed and energy efficiency as high priorities. Meanwhile, photonics processing has seen a resurgence due to its inherited high speed and low power nature. Here, we investigate the potential of using photonics in CNNs by proposing a CNN accelerator design based on Winograd filtering algorithm. Our evaluation results show that while a photonic accelerator can compete with current-state-of-the-art electronic platforms in terms of both speed and power, it has the potential to improve the energy efficiency by up to three orders of magnitude.

Full PDF

JJOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 1

A Winograd-based Integrated Photonics Acceleratorfor Convolutional Neural Networks

Armin Mehrabian,

Member, IEEE,

Mario Miscuglio,

Member, OSA,

Yousra Alkabani,

Member, IEEE,

Volker J.Sorger,

Senior Member, IEEE and Tarek El-Ghazawi,

Fellow, IEEE

Abstract —Neural Networks (NNs) have become the main-stream technology in the artiﬁcial intelligence (AI) renaissanceover the past decade. Among different types of neural networks,convolutional neural networks (CNNs) have been widely adoptedas they have achieved leading results in many ﬁelds such ascomputer vision and speech recognition. This success in part isdue to the widespread availability of capable underlying hard-ware platforms. In parallel, hardware specialization can exposeus to novel architectural solutions, which can outperform generalpurpose computers for the tasks at hand. Although differentapplications demand for different performance measures, they allshare speed and energy efﬁciency as high priorities. Meanwhile,photonics processing has seen a resurgence due to its inheritedhigh speed and low power nature. Here, we investigate thepotential of using photonics in CNNs by proposing a CNNaccelerator design based on Winograd ﬁltering algorithm. Ourevaluation results show that while a photonic accelerator cancompete with current state-of-the-art electronic platforms interms of both speed and power, it has the potential to improvethe energy efﬁciency by up to three orders of magnitude.

Index Terms —Convolutional Neural Networks, Photonics,Winograd

I. I

NTRODUCTION T HE ﬁeld of AI has undergone revolutionary progress overthe past decade. Wide availability of data and cheaperthan ever compute resources have contributed immenselyto this growth. At the same time, advancements is the ﬁeldof modern neural networks, known as deep learning (DL)have attracted the attention of academia and industry. Thispopularity is mainly owed to neural networks’ success in alarge gamut of AI applications including but not limited tocomputer vision, speech recognition, and natural languageprocessing. Among the different types of neural networks,CNNs are considered the most viable architecture for AIapplications. CNNs are remarkably versatile in most AI tasks.However, all of this comes at the price of high computationalcosts.In the meantime, the use of integrated photonics in neuralnetworks for implementing neuron functionalities has shapedto be an attainable alternative near future technology forlimiting the power consumption and increasing the operatingspeed [1][2][3]. Photonics beneﬁt from the coherent natureof electromagnetic waves, which interfere while propagatethrough a photonic integrated circuit (PIC). Central to manyAI techniques and algorithms is implementation of hardwaresolutions that mimic the multiply and accumulate (MAC) function. The main advantage of photonic neural networksover electronics is that the energy consumption for perform-ing a series of multiplications and additions does not scalewith MAC speed. The training of an optical neural networknecessitates an active modulation of the optical signal in ahybrid opticalelectronic conﬁguration [4]. For this reason,these architectures face signiﬁcant hurdles when comparedto their electronic counterparts. To be competitive, they areexpected to have low power consumption and high-speedelectro-optic modulation [5][6][7]. Additionally, they requireto pair with electrical to optical (EO), optical to electrical(OE) converters, and I/O interfaces. However, when trained,photonic neural networks do not rely on any additional energyfor active switching. Therefore the architectures that performtasks such as weighting, can be realized completely passive,and the computations happen without the consumption of anydynamic power [8][9][10]. In this panorama, all-optical neuralnetworks (AONNs) represent a promising future. Current all-optical implementations in free space [11] and in integratedphotonics [12][13][14] can outperform their electronic coun-terparts providing promises of great energy efﬁciency andspeed enhancement for learning tasks.In this manuscript, we explore the potentials of usinghigh-speed, low-power photonics in a CNN acceleratorby exploiting coherent all-optical matrix multiplication inwavelength division multiplexing (WDM), using microringresonator weight banks (MRRs). Our architecture is inspiredby [15][16], where Winograd ﬁltering algorithm is adoptedto perform convolution to speedup the execution time andreduce the computational complexity. We investigate theperformance of our proposed architecture in terms of speedand power. Since our proposed architecture is analog at thecore, we also investigate the robustness of neural networksexecuted on our proposed design in terms of tolerance againstnoise.We summarize the main contributions of this work as, • a ﬁrst proposed photonic CNN architecture based on theWinograd ﬁltering algorithm • an analytical framework to evaluate the speed perfor-mance of our proposed accelerator • an in-house simulator based on a modiﬁed Google Ten-sorﬂow tool to simulate the performance of our proposedphotonic accelerator with power and noise awareness c (cid:13) a r X i v : . [ c s . ET ] D ec OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 2 • a modiﬁed training process to enhance robustness toinevitable hardware noise sources during the inferencestageII. C ONVOLUTIONAL N EURAL N ETWORKS (CNN S )A CNN is a neural network comprised of one or moreconvolutional layers. CNNs are mostly known for their greatperformance on image data, however, their application extendto many other data types with local features. At the veryhigh level, each convolution layer uses a collection of featuredetectors, known as ﬁlters, that scan input data for presence orabsence of a particular set of features. Hence, in a CNN layer,inputs and outputs are referred to as feature maps (fmap). Bycascading multiple of these convolutional layers, a hierarchyof feature detectors are formed. In this hierarchy, featuredetectors closer to the input detect primitive features. As wemove towards the ﬁnal layers, the type of features detectedbecome more abstract. Conventionally, the dimension of eachﬁlter in a a CNN is D with the two ﬁrst dimensions beingthe height and width of the ﬁler and the last dimension, knownas the channel dimension, represents various ﬁlters. The useof convolutional ﬁlters to scan input data had been practicedwell before the rise of the ﬁeld of deep learning and CNNs.However, in traditional signal processing, such ﬁlters are hand-engineered by experts, which can be costly, only designed forspeciﬁc purposes, and vulnerable to designer bias. In a modernCNN, these ﬁlters are learned through the training process.Figure 1 shows the overall architecture of a CNN layer.Fig. 1: A single layer of a CNN. Each of the N ﬁlters (left)scan the input feature maps (middle) for features. This resultsin output feature maps, with N channels equal to the totalnumber of ﬁlters.III. P HOTONIC R EALIZATION OF

CNN S In data communication and computation, photonics hasthe potential to offer practical solutions to overcome someof the limitations currently facing electronic systems. In aneuromorphic system, processing elements (PEs) are arrangedin a distributed fashion with ideally large number of incoming(fan-in) and outgoing (fan-out) connections. Inspired bybiological neural systems, some of these connection are required to connect neurons across farthest parts of thebrain. In addition, neuromorphic PEs are in large partspeciﬁc-purpose processors in contrast to the general purposeprocessors.Neuromorphic processing can beneﬁt from photonics inthree major ways. First, photonics can signiﬁcantly reduce theamount of energy consumed in interconnects among PEs byavoiding energy dissipation due to charging and dischargingof electrical wires. Secondly, current neuromorphic algorithmsknown to neural networks, and in particular in CNNs, heavilyrely on the multiply and accumulate (MAC) operation, whichcan be realized with very low energy budgets in photonics.Thirdly, photonics can increase communication and computa-tion bandwidth by exploiting WDM. The adoption of WDMallow for higher density of computation and communicationbetween PEs by packing more channels and parallel compu-tations in a neuromorphic processor.

A. Photonic Convolution Kernels and MAC Operation

One major advantage of a photonic MAC operation is thatit can be performed with almost zero energy [17]. However,if the signal is converted from optical to electrical, theconversion and successive electronic manipulations imposeadditional energy loss. To build a photonic convolutionalﬁlter, we use a microring resonator (MRR) network proposedin [18]. Figure 2 depicts a single MRR neuron. In thisschema input WDM signals are weighted through tunableMRRs. These weighted inputs are later incoherently summedup using a photodetector, which amounts to a MAC operation.Thus, by the use of N wavelengths, it is possible toestablish up to N independent connections. Maximum N with current technologies is estimated to be around 108channels resulting in a total of 10k connections [19]. Itshould be note that having closely spaced wavelengths asmultiple laser sources while tuning the rings to match bothresonance and FSR is a very challenging task. Although,on the source side, a set of phase-locked, equally spacedlaser frequency lines can be generated using tunable opticalresonators in a chip-based frequency comb generator [20].Moreover, on the MRR side, our system can leverage onDense-WDM (DWDM). This is achievable due to strongoptical conﬁnement of silicon waveguides using tunableMRRs with more than nm Free Spectral Range (FSR)with Quality factors Q close to , which allow as many as channels [21]. Assuming approximately . nm channelspacing, the resonance bandwidth can be broadened up to . nm , while maintaining an estimated cross-talk level of − dB [22].Most of the modern neural networks have one or more fully-connected layers, which creates N synaptic connections.On the other hand, k connections are barely sufﬁcient toeven implement miniature fully-connected neural networksand simple benchmark datasets such as MNIST with 728neurons only in the input layer. In contrast, CNNs beneﬁt OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 3 from sparse connections between the local input regions andﬁlters. A common CNN architecture usually has ﬁlters ofshape × up to × that connect receptive ﬁelds andthe ﬁlters. From functional point of view smaller ﬁlters arefavored over larger ﬁlters, as they are capable of detectingﬁner local patterns. Larger and global patterns are detected inthe layers closer to the output of CNNs. These features aremore abstract and are built on top of the previously low-levelfeatures.Fig. 2: A broadcast-and-weight neuron. Inputs X i modulatedifferent wavelength lasers. Modulated beams are then bundledthrough WDM.Fig. 3: Microring resonator (MRR) operation for performingpoint-wise multiplication.We use the proposed scheme in Figure 2, to perform twoheuristic Winograd transformations and one element-wisematrix multiplication (EWMM) on each wavelength. Figure 3shows the details of a MRR weighting function that operateson a single wavelength λ . The MRR acts as a tunable analogﬁlter centred at λ , in which the voltage applied to the EOMmodule lets only a portion of light to travel through thewaveguide. The modulation can be triggered by an analogelectric ﬁeld generated by a memristor. In this work we usea memristor device which can store the weights with 6 bitsof resolution [23]. The transmission spectrum (T) of the ringhas a natural resonant frequency of λ . When WDM lightpasses through the coupled waveguide, the component withwavelength λ is coupled into the ring. By raising the biasvoltage to V , the resonant frequency shifts to λ due tothe change in the effective refractive index of the ring. Thedifference between V and V controls the difference between λ and λ , i.e. the transmission ( ∆ i ). The variation of thetransmission at λ represents, in our scheme, the point-wisemultiplication. The most used MRR modulator has silicon based p-i-njunction that is side coupled to a wave guide as describedin [24] or p-n junction reported [25]. Current silicon-basedMRR modulators [26][27][28], as well as foundry levelimplementations, exhibit a speed up to 50 GHz, with adriving voltage of usually a few Volts (1 − V ) and anefﬁciency (V π l) of few tenths of Vcm. Experimental resultsthat corroborate our estimation are reported in [29], whereSilicon-based electro-optic MRRs exhibit a modulation in aworking spectrum of . nm and a speed of GHz whilehave an insertion loss of as low as only dB . This by nomeans is a limiting factor in the inference stage, consideringthat the network has been trained and the weights areset. Therefore, the latency of the network is given by thetime-of-ﬂight of the photon.Beside the uncertainty due to fabrication imperfections,which could be compensated, the main source of noise thataffects a MRR modulator is electrical noise and, in thiscase, eventual non-ideality in setting the analog voltageswith memristive device that could vary over time. Moreover,for high data rate situations( > Gb/s ), the intra-channelcross-talk becomes relevant, and power penalties need tobe considered [30][31]. Regarding the operating dynamicpower, the maximum allowed optical power ﬂowing in eachphysical channel of the photonic accelerator is bound by theoptical power that would produce nonlinearities in the siliconwaveguides and the minimum power that the photo-detectorcan distinguish from noise (SNR=1). This sets the upper andlower operating range of photodiode, which we refer to asthe dynamic range of the photodiode. Foundry level [32]integrated Germanium photo-diodes can reach up to 40 GHzwith a responsivity of . A/W and a Noise equivalent power(NEP) of around 1pW/ √ Hz operating in reverse bias ( − V).Research-level photodetectors working in the 100s of GHzrange have also been demonstrated [33][34][35]. However,the dynamic range of the photodiode needs to be accuratelyset to avoid saturation and account for the bit resolution [12].For this scheme, according to the bit resolution, the estimateddynamic range is dB .The speed of the optical part of the accelerator, withoutconsidering the I/O interface, according to [36] is given bythe total number of the MRR and their pitch. Photodetectionand phase cross-talk are expected to be the main sources oferror in the proposed scheme.Another issue in using MRRs is attributed to variations indevice fabrications, which can result in the spectral shift of theresonance frequency. The resonance frequency of MRRs canbe tuned in multiple way. Due to high optical sensitivity ofmaterials such as silicon to temperature, thermal tuning is themost widely adopted tuning technique for MRRs. This canbe achieved by placing micro-heaters on top of each MRR[37][38]. OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 4

B. Memristors as Analogue Weight Storage

Neuromorphic systems inspired by human brain rely upontwo major principles, namely massively distributed processingand proximity of local memory to these processing elements.These memory units demand some level of programability(plasticity), with their programming speed requirements beingin the

M Hz regime. At this time, almost all state-of-the-artneural networks, perform the training and the inferencephases separately. This means that once the weights aretrained and set, for the inference phase, one does not needto change the weights. In addition, weights in our proposedsystem are represented by an analog voltage bias of MRRs.A potential weight storage for our system should be analogand non-volatile with long retention time.Having said that, memristive memory devices have attractedthe attention of researchers due to their interesting charac-teristics including but not limited to non-volatility, long stateretention time, and ultra-low power consumption [39][40][23].Over the past few years the bit-resolution of such memristivememory devices has risen monotonically [41][42][43][44].Recently, authors in [23] proposed, fabricated, and evaluatedan analog multibit memristive memory with bit-resolution ofup to . bits. Each memristive device takes up µ × µ in area and can retain the resistance state for up to hours.In AlexNet the 3rd convolutional layer has the largest numberof convolutional ﬁlter weights equal to , . Assumingoverhead circuitry increases the footprint to approximately µ × µ , the memristive memory required for the largestlayer of AlexNet can be realized in less than . cm .IV. F AST ALGORITHMS FOR CONVOLUTION OPERATION

As the name suggest the convolution operation account forthe bulk of all operations in a CNN during both training andinference stages. However, each of the training and inferencestages demand a different type of performance requirement.During the training, the emphasis is more on throughputrather than time. This is mainly due to the fact that the modelunder train needs to observe a large ”ensemble” of data, thebatch, as fast as possible. Therefore, time is amortized overmany inputs. On the other hand, during the inference stage,applications are mostly latency sensitive. For instance, in aself-driving car application only a few input image scenesare needed to be processed per second, but that is requiredto be at a very low latency timescale. Having said that, aneuromorphic processor designed for inference is expected tosatisfy stringent timing requirements.An important parameter that is shown to have a signiﬁcantimpact on the latency of CNNs is the size of their ﬁlters.It is generally known from a functional point of view thatCNNs with smaller ﬁlters are preferred over CNNs withlarge ﬁlters [45][46][47]. Table I shows the breakdown ofﬁlter size for some of the state of the art CNN architectures.This is mainly due to the fact that small ﬁlters are betterin ﬁnding local features without sacriﬁcing the resolution.More abstract and more global features can be detected in TABLE I: Kernel size breakdown in state-of-the-art CNNs. Itcan be seen that ﬁlter of size × comprise only a minutefraction of total ﬁlters. CNN × (%) × (%) Small 1D ﬁlters × (%)GoogLeNet 64.9 17.5 1.7 15.9Inception V3 43.2 17.9 35.7 3.2Inception V4 40.9 16.1 43 0MobileNet 93.3 6.7 0 0ResNet50 68.5 29.6 1.9 0VGG16 0 100 0 0 higher layers of a CNN built on previous local layer features.As we discussed in section III, a physical implementationof photonic MRRs favors small size ﬁlters due to limitednumber of available wavelength bands. This synergy betweenfunctional and photonic realization of CNNs is the primarymotivation behind this work.At the time of writing this paper, there are three majorways to speed up the convolution operation. First, the GeneralMatrix Multiplication (GEMM) approach, in which the con-volution is converted to matrix multiplication operation usingToeplitz matrix. The downside to this method is that Toeplitzconversion expands the input by a factor of r × r where r is thesize of the ﬁlter. Second method uses Fast Fourier Transform(FFT) to perform tiled convolution operations. From Fouriertheorem we know that cyclic convolution can be performedby transforming the input and ﬁlters into Fourier domain.An element-wise multiplication (also known as Hadamardmultiplication) results in an equivalent of convolution, but inFourier domain. An inverse FFT operation transforms the cal-culated convolution back into the original domain. FFT-basedconvolution had been the method of choice for convolutionoperation [48][49][50] until the recent past. Lately, it is shownthat FFT-based convolution is better suited for larger ﬁlter sizes[15]. The third method uses the Winograd ﬁltering algorithm,which we explain in detail in the following section. A. Winograd Algorithm

In a 2D convolution, a single output component of theconvolution is calculated by, y n,k,p,q = c (cid:88) n =1 r (cid:88) x =1 r (cid:88) y =1 x n,c,p + x,q + y × w k,c,x,y (1)The operation in equation 1 is repeated for all outputs convolu-tion components. In a brute-force convolution the total numberof multiplications required to perform a full convolution isequal to ( m × r ) (2)where m is the size of the output feature map channeland r is the size of the ﬁlter. At the time of writing thispaper, Winograd convolution in the most efﬁcient convolutionalgorithm being used for CNNs [15]. Winograd convolutionis based on the minimal ﬁltering principles. The algorithm OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 5

Fig. 4: High-level ﬂow diagram of Winograd ﬁltering technique for convolution operation. Unlike conventional convolution,which computes a single output at a time, Winograd algorithm computes a tile of output, here of size m × m . In order togenerate an output tile, Winograd requires to fetch an input tile of size n × n . Both input tile and ﬁlter are transformed intothe Winograd domain. Within the Winograd domain, previously transformed input and ﬁlter are multiplied in an element-wisefashion. Finally, the output of the element-wise multiplication is transformed back into the original domain and channels arecollapsed into a single value per tile element.states that in order to calculate m outputs with a ﬁnite impulseresponse (FIR) ﬁlter of size r , denoted by F ( m, r ) , the numberof required multiplications is, n = ( m + r − (3)While equation 3 is derived for the 1D convolution operation,one can nest it with itself to acquire a 2D convolution.Therefore, the number of multiplications needed for the same2D convolution is given by, ( m + r − (4)From equation 2 and 4 we can infer that Winograd resultsin a reduction in the complexity by a factor of, ( mr ) ( m + r − (5)It should be noted that in our proposed photonic accelerator,multiplication operations are carried out by MRRs. Anyreduction in the total number of multiplication operations,and thus MRRs, can save us not only in footprint of thedesign, but also in the design complexity.Now, in order to understand how minimal Winograd works,let us ﬁrst consider the case for 1D convolution. Let matrix W be the matrix of weights, and matrix D be the data matrix.Winograd computes the F (2 , convolution as following (cid:20) d d d d d d (cid:21)  w w w  = (cid:20) m + m + m m − m − m (cid:21) (6)where values m i are intermediate values found by m = ( d − d ) × w m = ( d + d ) × w + w + w m = ( d − d ) × w − w + w m = ( d − d ) × w Above equations show that with only 4 multiplicationsbetween inputs and weights, Winograd can compute a F (2 , convolution. All w i terms can be pre-computed after thetraining stage. In order word, during the inference time, whiledata values d i , corresponding to inputs change, the w i valuesremain the same throughout the inference. The 1D Winogradcan be expressed by a closed matrix form as Y = A T [( G × w ) (cid:12) ( B T × d )] (7)where A T , B T , and G are three heuristic transforms describedby equations 8, 9, and 10. w is the weight vector and d is theinput vector. A T = (cid:20) − − (cid:21) (8) B T =  − − −  (9) G = 

12 12 1212 −

12 12  (10)One conclusion from equation 6 is that to compute a singleoutput of 1D convolution only a window of ( m + r − inputvalues are needed.In a modern CNN the bulk of convolution operations arecomprised of 2D convolutions. Equation 7 can be easilyextrapolated for 2D convolution by nesting two 1D Winogradconvolutions. The resulting 2D Winograd would be, Y = A T [( G × w × G T ) (cid:12) ( B T × d ) × B ] (11) OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 6

From [15] for the case of F (4 × , × matrices B T , G ,and A T have the forms, A T =  − − − −  (12) B T =  − − − − − − − −  (13) G =  − − − −

16 16 − −

112 16  (14)The number of addition and multiplications required forWinograd transform, not the element-wise multiplications,increases quadratically with the tile size. Thus, Winograd isexpected to perform most efﬁciently for smaller ﬁlter sizes,and thus smaller input tiles.

Algorithm 1:

Winograd for 2D convolution for row=0; row < H; row+=m dofor column=0; column < H; column+=m dofor channel=0; channel < c; channel+=1 dofor ﬁlter=0; ﬁlter < N; kerne;+=1 do load input tile;transform input tile;load transformed ﬁlter;perform EWMM endendend Output Winograd convolution result end

V. A

RCHITECTURE D ESIGN

In this paper we propose a photonic CNN acceleratorbased on Winograd algorithm and realized using the photonicneuron introduced in [19]. Figure 4 depicts the architectureof a single Winograd PE. Our proposed accelerator processesa single layer of a CNN at a time. This is mainly due tothe fact that in a CNN different tiles of output feature mapsare computed sequentially, and thus arrive at different times.But, in order to initiate processing of the next layer, allthe inputs from the previous layer need to be available andsynchronized. Our approach to process one layer at a timeenforces this synchronization. Furthermore, implementingmultiple layers of a CNN will result in large area overheads.At the input of our accelerator, an input tile of shape n × n × c along with ﬁlters of size r × r × c are transformed into the Winograd domain. Input and ﬁlters’ transformsare then multiplied element by element. The output of thismultiplication needs to be transformed back using an inverseWinograd transform. The signals at this stage are digitizedusing an array of ADCs and placed onto the output linebuffers to be stored back in the off-chip memory.Figure 5 presents the overview of the our proposedarchitecture. Our proposed architecture runs on two clockdomains. First a high speed GHz clock domain, whichaccommodates low latency components of the acceleratorincluding the photonic components. In section VI-A weexplain our rationale on how we arrive at the GHz highspeed clock frequency. The rest of the accelerator includinginput feature map buffers, ﬁlter buffers, ﬁlter WinogradDSP module, and ﬁlter path DAC run on a slower clockdomain because there is no time sensitivity on ﬁlter path, anddata transfer from/to off-chip memory. At the heart of ouraccelerator, we have an Element-Wise Matrix Multiplicationunit, which we implement in photonics using photonicneurons. We store the input feature maps and ﬁlters in anoff-chip memory. Both the input feature maps and ﬁltersrequire to go through Winograd transformation, which arematrix multiplications described in equations 13 and 14. Itshould be noted that while input feature maps change fordifferent tiles of inputs, ﬁlters are ﬁxed for each layer. Forthat, we implemented the input feature map transformationsin photonics and ﬁlter Winograd transformations in electronicDSP. This way, we will not pay the overhead associatedwith photonic implementations including the conversion ofelectronic ﬁlters to photonics. Later, the transformed ﬁltersand input feature map tiles are converted into analog signalsto modulate the laser beams. However, as the ﬁlters areﬁxed over the processing time of the layer, analog ﬁltersignals need to be maintained for that time. Thus, we proposeto use the non-volatile analog memristive memory bank,which maintains these voltages in their analog form for longretention times.In Winograd convolution, and in each iteration, a tile of n × n is processed. In order to process an entire featuremap, the transformed ﬁlter tile needs to move across theinput feature map. In this paradigm two successive input tilesshare size ( r − × n elements. This, introduces data reuseopportunities, to avoid multiple queries of same data block.Here our goal is to exploit this opportunity at the front-endof our accelerator. Our design is inspired by the work in [16],where authors utilize line buffers to avoid redundant queriesfrom the off-chip memory. Figure 6 shows an example linebuffer design to load and hold a × input feature map tile.Input tiles are fetched from off-chip memory and loaded intothe line buffer. Buffered tiles are then passed into the digitalto analog converter (DAC) using parallel channels.In parallel to the input data stream, transformed ﬁlterweights are converted into the analog signals to program theanalog memristive memory. We then use the voltage generatedusing the stored analog signals to modulate the laser source OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 7

Fig. 5: High-level architecture of our proposed photonic accelerator. Input feature-maps and ﬁlters are initially stored in anoff-chip memory. Input tiles of size n × n are loaded into the input line buffer one at a time. Kernel weights do not changeonce the CNN is trained. Thus, we perform ﬁlters’ Winograd in electronics and the cost is amortized over many input tiles.Winograd transform for input feature map tiles are computed in photonics. The photonic element-wise matrix multiplication(EWMM) unit performs the core Winograd element-wise Winograd multiplications. Outputs are digitized and placed onto theoutput line buffer. Finally, processed layer outputs are stored in the off-chip memory.for the ﬁlters. Each output signal generated by a DAC is thenused to modulate a laser beam of a particular wavelength λ i .It is worth noting that for each set of ﬁlters modulated bythe laser source, input line goes through multiple iterationscorresponding to different input tiles. Once both input tilelaser beam and the ﬁlter laser beam are ready, the EWMM,multiplies each element of the Winograd input feature maptile with its corresponding Winograd ﬁlter value. The outputfrom EWMM unit must be transformed back from Winograddomain into the original domain by the inverse Winogradtransform. The result contains output feature map tiles formultiple channels c . Lastly, output feature map values aredigitized and stored back into the off-chip memory.A key principle in HPC is to try to minimize the IO andother communication latencies, compared to that of the compu-tation time, to avoid unit under-utilization. From Algorithm 1we can see that the two inner-most loops iterate over differentchannels of the input feature map tile and different ﬁlters.Moreover, operations within these two loops are independentform one another. This provides parallelization opportunitiesat the cost of additional hardware. In other words, the amountof parallelization and speed up we can achieve, scales linearlywith the number of pipeline replications in our system. Thislinear scaling plateaus as soon as the computation bandwidthapproaches the data transfer bandwidth. Our envisioned designuses an arbitrary number of 100 parallel paths. Our evaluationresults in the next section justiﬁes this selection.VI. E VALUATION

In this section we evaluate the performance of our acceler-ator for the × ﬁlters of the VGG16 network against the Fig. 6: An example line buffer design to load and hold aninput tile of size × .recent FPGA [51][16][52][53][54] and GPU implementations[16]. A. Speed

Here we develop a model to estimate the execution timeof the our accelerator. First we model the time required toconvolve one input tile with one ﬁlter and we call it T tile filter .Following that, we generalize the model to the case where weparallelize the process based on available resources. For oneinput feature map tile and a ﬁlter, both, the input branch andthe ﬁlter branch of the Figure 5 are fully pipelined. Therefore,the execution time of a layer is determined by the longer ofthe two paths of ﬁlter path and input path. For each iteration ofthe ﬁlter path, input data path goes through multiple iterations.This is because a single ﬁlter operates on many input datatiles. That said, the input data path sets the upper bound onthe delay. Our execution time model is comprised of twomajor components namely, the Input/Output time ( T IO ) andthe computation time ( T Comp ) . We deﬁne ( T IO ) as. T IO = max ( T load , T offload ) (15) OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 8 where T load is the time it takes to transfer data from theoff-chip memory to the input of the laser sources. Moreover,we can implement a total of P input DACs to speedup the datatransfer. Considering the fact that our input matrices are ofthe shape × , we used an array of 16 DACs in this work.Similarly T offload is the time to store back the computedoutputs from the inverse Winograd transform to the off-chipmemory. The goal is to match the rate of the ADC at theoutput with input DAC to avoid any speed mismatch andthus congestion in the pipeline. At the time of this reviewboth on-chip DACs and ADCs are capable of operating atsampling rates of more than 18 GS/s for bit resolution of atleast 8 bits [55][24]. Furthermore, with recent advances inmemory technology, current memories are able to transferdata at high IO bandwidths up to more than Gb/s [56].This high memory bandwidth allows us to buffer data andﬁlters from off-chip memory at high transfer rates and feedit to our input line buffers. However, for our line buffers weneed memories with high clock frequency and short accesstime. Current reported memory technologies have access timeas short as ps .At the photonic core of our accelerator, T compute is, T compute = T laser + T W inograd + T EW MM + T iW inograd (16)where T W inograd is the time to compute the Winograd trans-form, T EW MM is the time to perform the element-wise matrixmultiplication, and T iW inograd is the time compute the inverseWinograd transform. Once the laser is set up, input signalsonly incur a time delay equivalent to ﬂight time of the lightbefore they are fed into the ADC. Having said that the clockfrequency of the pipeline is determined by f requency clock ≤ min ( T load , T offload , T compute ) (17)As a result of equation 17, we picked a clock frequencyof Ghz , which satisﬁes equation 17. From equation 17 T tile filter is simply found by, T tile filter = 15 GHz = 200 ps (18)For a F (4 × , × Winograd, each T tile filter returns anoutput block of size 9 equivalent to 9 convolution operations.Having a clock frequency of GHz , our proposed acceleratorperforms at × G = 45 GOP/s . Figure 7 shows the averageconvolution speed comparison of our proposed acceleratoragainst the state-of-the-art FPGA and GPU implementations.

B. Power

In order to estimate the dynamic power consumption of ourproposed system, we built our in-house estimator by augment-ing the standard Google Tensorﬂow tool. While primarily usedfor training and inference stages of neural networks, at thecore, Tensorﬂow is a symbolic mathematical graph processingplatform. Tensorﬂow enables users to express arbitrary compu-tations into a dataﬂow graph, which is extremely useful in thecontext of neural networks. However, out-of-the-box Tensor-ﬂow is completely agnostic to physical realization of the neural Fig. 7: Comparison of convolution operation speed for FPGA,GPU, and our photonic implementation. The last columnlabeled with (p) represents the speed of the photonic core inthe absence of electronics.TABLE II: Mapping of primitive math operations to theirhardware realization.

Math Operation Photonic RepresentationAddition PhotodiodeMultiplication MRRConnection WaveguideNon-linear Activation Electro-absorption Modulator networks being implemented. Thus, we augmented Tensorﬂowhigh-level API with mathematical models of electro-opticalcomponents. Figure 8 depicts the native Tensorﬂow toolkithierarchy against our augmented version. In our estimator,Fig. 8: High level Tensorﬂow toolkit hierarchy vs. augmentedTensorﬂow.each primitive mathematical operation is given two physicalmodels namely, the power model and the noise model. While,the noise model can impact the functionality, thus accuracy,in a neural network, the power model only models/measuresconsumed power. Table II shows some of these mathematicaloperations mapped to to their physical realizations.Photodiode power can be simply derived from its Respon-

OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 9 sivity equation: R = I ph P in = λ qhc η [ AW ] (19)where I ph is the photocurrent, P in is the optical input signalpower, q is the electron charge, λ is the wavelength, h is thePlanck’s constant, and c is the speed of light. It should benoted that the signal is encoded in the optical input power P in . In this work use Aim photonics PDK values in [57].Similarly, we modeled both thermal noise and the shot noisein photodiode. I sn = (cid:113) q ( I ph + I D )∆ f (20) I tn = (cid:114) K B T ∆ fR SH (21)where I D is the dark current of the photodetector, ∆ f is thenoise measurement bandwidth, K B is the Boltzmann Constant, T is temperature in Kelvins and R SH is total equivalent shuntresistance of the photodiode. For MRRs we accounted for perunit length propagation loss.Fig. 9: Comparison of convolution operation power for FPGA,GPU, and our photonic implementation. The last columnlabeled with (p) represents the power consumption of thephotonic core in the absence of electronics.Figure 9 depicts the power comparison results. Finally Weplotted the energy efﬁciency ﬁgure of merit deﬁned by theratio of speed to power in Figure 10.VII. T RAINING , I

NFERENCE , AND N OISE

We initially trained our neural network ofﬂine on aconventional digital computer. Later during the inferencestage we loaded the trained weights into our in-housesimulator, which is equipped with noise sources modeling.Our hypothesis was that inference on a noisy neural networkwould result in some loss in accuracy. This is mostly due tothe fact that, the network used during the training is noise-less,with 32-bit ﬂoating point resolution, while during inferencethe weights all in a sudden face a noisy network. In other Fig. 10: Comparison of energy efﬁciency for FPGA, GPU, andour photonic implementation. The last column labeled with (p)represents the power consumption of the photonic core in theabsence of electronics. The results show that using photon-ics as an accelerator has the potential of improving energyefﬁciency by up to more than three orders of magnitude.Fig. 11: Visualalization of an augmented convolutional layerusing power and noise models for

VGG16 network.words, the network performing inference experiences unseennoise behavior that results in accuracy loss. We tested ourhypothesis by sweeping a range of inference noise levels andobserving its effect on accuracy. For that reason, we identiﬁedtwo major noise sources, namely the neuron output noise andthe weight noise. The neuron output noise represents the noiseintroduced at the output of each neuron by the photodiode andthe nonlinear activation function. The ﬁrst plot in Figure 12shows how accuracy is impacted by noise during inference forthe case that the network was trained free of any noise source.Our next hypothesis was that, if we allow for certain amountof noise during the training, the model would become morerobust to noise during the inference stage. To that end, wetrained the network with output noise source on. We onlyadded the output noise, and left the weight noise off, becauseweights are required to be calculated with maximum precisionduring training. In fact we observed that even a minute amountof noise added to the weights during the training could destroy

OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 10 the accuracy of the network to its baseline level of about 10%.We swept the addition of training noise at logarithmic stepsfrom . to . Figure 12 depicts the effect of adding anoutput noise equivalent to . and . of the maximumsignal swing at the output of neurons. In our experiments weobserved that the addition of about . noise during thetraining may result in slight accuracy loss for low levelof noise during inference. However, the model becomes morerobust to higher levels of inference noise. This shows thatmodeling noise by addition of noise during training can ﬁne-tune the network for a physical noisy realization as shownin Figure 12 (middle). Lastly, we noticed adding furtheramount of training noise beyond the initial . resultedvery signiﬁcant inference accuracy losses shown in Figure 12(bottom). VIII. C ONCLUSION

In this paper we presented a photonic CNN acceleratorbased on Winograd ﬁlering convolution algorithm. Winogradreduces the total number of multiplication, thus hardware, toperform convolution operation. We evaluated the speed of ouraccelerator by developing an analyical framework. Our resultsshow that a photonic accelerator can compete with state-of-the-art Winograd based FPGA and GPU implementations. Suchphotonic accelerator has the potential of improving the energyefﬁciency by up to three orders of magnitude. However, theoverall speed is bound by the limitations of IO and conver-sions in DAC and ADC. To evaluate power performance weaugmented the native hardware-agnostic Google Tensorﬂowtool with power models of our hardware components. Similarto speed performance, electronic IO and convertors are themajor consumers of power in our proposed design. However,the photonic core, without the electronic interface, can operatewhile consuming up to two orders of magnitude less power.In addition, we modeled noise into our Tensorﬂow-basedsimulator, to investigate the effect of hardware noise sourcessuch as photodiode noise and MRR noise on the functionality(accuracy) of our CNN. We found training the CNN witha small noise component, . of the signal swing in ourexperiment, can result in the CNN become more robust toinference-time noise introduced by noisy photodiodes andMRRs. R EFERENCES[1] P. R. Prucnal and B. J. Shastri,

Neuromorphic Photonics . CRC Press,May 2017, google-Books-ID: VbvODgAAQBAJ.[2] I. Chakraborty, G. Saha, A. Sengupta, and K. Roy, “Toward fast neuralcomputing using all-photonic phase change spiking neurons,”

Scientiﬁcreports , vol. 8, no. 1, p. 12980, 2018.[3] J. Feldmann, N. Youngblood, C. Wright, H. Bhaskaran, and W. Pernice,“All-optical spiking neurosynaptic networks with self-learning capabil-ities,”

Nature , vol. 569, no. 7755, p. 208, 2019.[4] J. K. George, A. Mehrabian, R. Amin, J. Meng, T. F. d. Lima, A. N.Tait, B. J. Shastri, T. El-Ghazawi, P. R. Prucnal, and V. J. Sorger,“Neuromorphic photonics with electro-absorption modulators,”

OpticsExpress

Nature

Fig. 12: The evaluation of the effect of physical photodiodeand MRR noise on inference accuracy. This effect can be par-tially compensated through introduction of an artiﬁcial noisesource during the training stage. At the absence of trainingnoise source (top) inference accuracy is quickly deterioratedas we sweep the photodiode and MRR noise. By introducingan equivalent of 0.1% guassian noise, the network becomesmore robust to inference noise. Further increase in trainingnoise level (bottom) hinders the network from proper training.

OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 11 [6] A. Liu, R. Jones, L. Liao, D. Samara-Rubio, D. Rubin, O. Cohen,R. Nicolaescu, and M. Paniccia, “A high-speed silicon opticalmodulator based on a metaloxidesemiconductor capacitor,”

Nature

APLPhotonics , vol. 3, no. 12, p. 126104, Dec. 2018. [Online]. Available:https://aip.scitation.org/doi/10.1063/1.5052635[8] H. Bagherian, S. Skirlo, Y. Shen, H. Meng, V. Ceperic, and M. Sol-jacic, “On-chip optical convolutional neural networks,” arXiv preprintarXiv:1808.03303 , 2018.[9] A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and T. El-Ghazawi, “Pcnna: Aphotonic convolutional neural network accelerator,” in . IEEE, 2018, pp.169–173.[10] W. Liu, W. Liu, Y. Ye, Q. Lou, Y. Xie, and L. Jiang, “Holylight: Ananophotonic accelerator for deep learning in data centers,” in

NaturePhotonics

Optica

Optical MaterialsExpress

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 4013–4021.[16] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms forconvolutional neural networks on fpgas,” in . IEEE, 2017, pp. 101–108.[17] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones,M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund et al. , “Deeplearning with coherent nanophotonic circuits,”

Nature Photonics , vol. 11,no. 7, p. 441, 2017.[18] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcastand weight: an integrated network for scalable photonic spike process-ing,”

Journal of Lightwave Technology , vol. 32, no. 21, pp. 3427–3439,2014.[19] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A.Nahmias, and P. R. Prucnal, “Microring weight banks,”

IEEE Journalof Selected Topics in Quantum Electronics , vol. 22, no. 6, pp. 312–325,2016.[20] H. Hu, F. Da Ros, M. Pu, F. Ye, K. Ingerslev, E. P. da Silva,M. Nooruzzaman, Y. Amma, Y. Sasaki, T. Mizuno et al. , “Single-sourcechip-based frequency comb enabling extreme parallel data transmission,”

Nature Photonics , vol. 12, no. 8, p. 469, 2018.[21] Q. Xu, D. Fattal, and R. G. Beausoleil, “Silicon microring resonatorswith 1.5- µ m radius,” Optics express , vol. 16, no. 6, pp. 4309–4315,2008.[22] L. Chen, K. Preston, S. Manipatruni, and M. Lipson, “Integrated ghzsilicon photonic interconnect with micrometer-scale modulators anddetectors,”

Optics express , vol. 17, no. 17, pp. 15 248–15 256, 2009.[23] S. Stathopoulos, A. Khiat, M. Trapatseli, S. Cortese, A. Serb, I. Valov,and T. Prodromakis, “Multibit memory operation of metal-oxide bi-layermemristors,”

Scientiﬁc reports , vol. 7, no. 1, p. 17532, 2017.[24] B. Xu, Y. Zhou, and Y. Chiu, “A 23-mw 24-gs/s 6-bit voltage-timehybrid time-interleaved adc in 28-nm cmos,”

IEEE Journal of Solid-State Circuits , vol. 52, no. 4, pp. 1091–1100, 2017.[25] M. Ziebell, D. Marris-Morini, G. Rasigade, P. Crozat, J.-M. Fdli,P. Grosse, E. Cassan, and L. Vivien, “Ten Gbit/s ring resonator siliconmodulator based on interdigitated PN junctions,”

Optics Express

Optics Express | Optics Express

Optics Express

Journal of Lightwave Technology , vol. 34, no. 12, pp. 2886–2896, Jun.2016. [Online]. Available: http://ieeexplore.ieee.org/document/7272050/[31] M. Bahadori, S. Rumley, H. Jayatilleka, K. Murray, N. A. F. Jaeger,L. Chrostowski, S. Shekhar, and K. Bergman, “Crosstalk Penalty inMicroring-Based Silicon Photonic Interconnect Systems,”

Journal ofLightwave Technology

OpticsExpress

ACS Photonics , vol. 5, no. 8, pp. 3291–3297, Aug.2018. [Online]. Available: http://pubs.acs.org/doi/10.1021/acsphotonics.8b00525[35] P. Ma, Y. Salamin, B. Baeuerle, A. Emboras, Y. Fedoryshyn,W. Heni, B. Cheng, A. Josten, and J. Leuthold, “100 GHzPhotoconductive Plasmonic Germanium Detector,” in

Conference onLasers and Electro-Optics (2018), paper SM2I.3

Scientiﬁc Reports ,vol. 7, no. 1, p. 7430, Aug. 2017. [Online]. Available: https://doi.org/10.1038/s41598-017-07754-z[37] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Ku-mar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, andR. Baets, “Silicon microring resonators,”

Laser & Photonics Reviews ,vol. 6, no. 1, pp. 47–73, 2012.[38] X. Xue, Y. Xuan, C. Wang, P.-H. Wang, Y. Liu, B. Niu, D. E. Leaird,M. Qi, and A. M. Weiner, “Thermal tuning of kerr frequency combs insilicon nitride microring resonators,”

Optics express , vol. 24, no. 1, pp.687–698, 2016.[39] C. Yoshida, K. Tsunoda, H. Noshiro, and Y. Sugiyama, “High speedresistive switching in pt/ ti o 2/ ti n ﬁlm for nonvolatile memoryapplication,”

Applied Physics Letters , vol. 91, no. 22, p. 223510, 2007.[40] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, andR. S. Williams, “memristiveswitches enable statefullogic operations viamaterial implication,”

Nature , vol. 464, no. 7290, p. 873, 2010.[41] I. Baek, M. Lee, S. Seo, M. Lee, D. Seo, D.-S. Suh, J. Park, S. Park,H. Kim, I. Yoo et al. , “Highly scalable nonvolatile resistive memoryusing simple binary oxide driven by asymmetric unipolar voltagepulses,” in

IEDM Technical Digest. IEEE International Electron DevicesMeeting, 2004.

IEEE, 2004, pp. 587–590.[42] E. J. Merced-Grafals, N. D´avila, N. Ge, R. S. Williams, and J. P.Strachan, “Repeatable, accurate, and high speed multi-level program-ming of memristor 1t1r arrays for power efﬁcient analog computingapplications,”

Nanotechnology , vol. 27, no. 36, p. 365202, 2016.[43] A. Prakash, D. Deleruyelle, J. Song, M. Bocquet, and H. Hwang,“Resistance controllability and variability improvement in a taox-based

OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 12 resistive memory for multilevel storage application,”

Applied PhysicsLetters , vol. 106, no. 23, p. 233104, 2015.[44] S. R. Lee, Y.-B. Kim, M. Chang, K. M. Kim, C. B. Lee, J. H. Hur,G.-S. Park, D. Lee, M.-J. Lee, C. J. Kim et al. , “Multi-level switchingof triple-layered taox rram with excellent reliability for storage classmemory,” in . IEEE,2012, pp. 71–72.[45] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[48] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutionalnetworks through ffts,” arXiv preprint arXiv:1312.5851 , 2013.[49] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, andY. LeCun, “Fast convolutional nets with fbfft: A gpu performanceevaluation,” arXiv preprint arXiv:1412.7580 , 2014.[50] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cudnn: Efﬁcient primitives for deep learning,” arXivpreprint arXiv:1410.0759 , 2014.[51] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizingfpga-based accelerator design for deep convolutional neural networks,”in

Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays . ACM, 2015, pp. 161–170.[52] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,N. Xu, S. Song et al. , “Going deeper with embedded fpga platform forconvolutional neural network,” in

Proceedings of the 2016 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays . ACM,2016, pp. 26–35.[53] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.Seo, and Y. Cao, “Throughput-optimized opencl-based fpga acceleratorfor large-scale convolutional neural networks,” in

Proceedings of the2016 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays . ACM, 2016, pp. 16–25.[54] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:Towards uniformed representation and acceleration for deep convolu-tional neural networks,”

IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems , 2018.[55] A. Nazemi, K. Hu, B. Catli, D. Cui, U. Singh, T. He, Z. Huang,B. Zhang, A. Momtaz, and J. Cao, “3.4 a 36gb/s pam4 transmitter usingan 8b 18gs/s dac in 28nm cmos,” in . IEEE, 2015,pp. 1–3.[56] D. U. Lee, K. W. Kim, K. W. Kim, K. S. Lee, S. J. Byeon, J. H. Kim,J. H. Cho, J. Lee, and J. H. Chun, “A 1.2 v 8 gb 8-channel 128 gb/s high-bandwidth memory (hbm) stacked dram with effective i/o test circuits,”

IEEE Journal of Solid-State Circuits

Armin Mehrabian is a PhD candidate in ElectricalEngineering at the George Washington University.His research interests include High PerformanceComputing (HPC), Neuromorphic Computing, Arti-ﬁcial Intelligence (AI) from both software and hard-ware point of view. He received his BS. degree inElectrical Engineering at Shahid Beheshti Universityof Tehran, Iran focusing on Analog Electronics, andhis MS. degree at the George Washington University(GWU), DC, USA in computer engineering focusingon VLSI and digital electronics design. His currentresearch involves leveraging nanophotonics for HPC architecture designs.

Mario Miscuglio

Mario Miscuglio is a post-doctoralresearcher in the Electrical Engineering departmentat the George Washington University. He receivedhis Masters in Electric and Computer engineeringfrom Polytechnic of Turin, working as researcher atHarvard/MIT. He completed his PhD in Optoelec-tronics from University of Genova (IIT), working asresearch fellow at the Molecular Foundry in LBNL.His interests extend across science and engineering,including photonic neuromorphic computing, nano-optics and plasmonics.

Yousra Alkabani received the BSc and MSc de-grees in computer and systems engineering fromAin Shams University, Cairo, Egypt, in 2003 and2006, respectively. She received the PhD degree incomputer science from Rice University, Houston,TX, USA, in December 2010. She has been anassistant professor of computer and systems engi-neering at Ain Shams University since May 2011and a visiting assistant professor of computer scienceand engineering at the American University in Cairosince 2013. Her research interests include hardwaresecurity, low power design, and embedded systems. She is a member of theIEEE.

Volker J. Sorger is an Associate Professor inthe Department of Electrical and Computer Engi-neering and the leader of the Orthogonal PhysicsEnabled Nanophotonics (OPEN) lab at the GeorgeWashington University. He received his PhD fromthe University of California Berkeley. His researchareas include opto-electronic devices, plasmonicsand nanophotonics and photonic analog informationprocessing and neuromorphic computing. Amongsthis breakthroughs are the ﬁrst demonstration ofa semiconductor plasmon laser, attojoule-efﬁcientmodulators, and PMAC/s-fast photonic neural networks and near real-timeanalog signal processors. Dr. Sorger has received multiple awards among arethe Presidential Early Career Award for Scientists and Engineers (PECASE),the AFOSR Young Investigator Award (YIP), the Hegarty Innovation Prize,and the National Academy of Sciences award of the year. Dr. Sorger is theeditor-in-chief of Nanophotonics, the OSA Division Chair for Photonics andOpto-electronics and serves at the board-of-meetings at OSA & SPIE, andthe scholarship committee. He is a senior member of IEEE, OSA & SPIE.