A Winograd-based Integrated Photonics Accelerator for Convolutional Neural Networks
Armin Mehrabian, Mario Miscuglio, Yousra Alkabani, Volker J. Sorger, Tarek El-Ghazawi
JJOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 1
A Winograd-based Integrated Photonics Acceleratorfor Convolutional Neural Networks
Armin Mehrabian,
Member, IEEE,
Mario Miscuglio,
Member, OSA,
Yousra Alkabani,
Member, IEEE,
Volker J.Sorger,
Senior Member, IEEE and Tarek El-Ghazawi,
Fellow, IEEE
Abstract —Neural Networks (NNs) have become the main-stream technology in the artificial intelligence (AI) renaissanceover the past decade. Among different types of neural networks,convolutional neural networks (CNNs) have been widely adoptedas they have achieved leading results in many fields such ascomputer vision and speech recognition. This success in part isdue to the widespread availability of capable underlying hard-ware platforms. In parallel, hardware specialization can exposeus to novel architectural solutions, which can outperform generalpurpose computers for the tasks at hand. Although differentapplications demand for different performance measures, they allshare speed and energy efficiency as high priorities. Meanwhile,photonics processing has seen a resurgence due to its inheritedhigh speed and low power nature. Here, we investigate thepotential of using photonics in CNNs by proposing a CNNaccelerator design based on Winograd filtering algorithm. Ourevaluation results show that while a photonic accelerator cancompete with current state-of-the-art electronic platforms interms of both speed and power, it has the potential to improvethe energy efficiency by up to three orders of magnitude.
Index Terms —Convolutional Neural Networks, Photonics,Winograd
I. I
NTRODUCTION T HE field of AI has undergone revolutionary progress overthe past decade. Wide availability of data and cheaperthan ever compute resources have contributed immenselyto this growth. At the same time, advancements is the fieldof modern neural networks, known as deep learning (DL)have attracted the attention of academia and industry. Thispopularity is mainly owed to neural networks’ success in alarge gamut of AI applications including but not limited tocomputer vision, speech recognition, and natural languageprocessing. Among the different types of neural networks,CNNs are considered the most viable architecture for AIapplications. CNNs are remarkably versatile in most AI tasks.However, all of this comes at the price of high computationalcosts.In the meantime, the use of integrated photonics in neuralnetworks for implementing neuron functionalities has shapedto be an attainable alternative near future technology forlimiting the power consumption and increasing the operatingspeed [1][2][3]. Photonics benefit from the coherent natureof electromagnetic waves, which interfere while propagatethrough a photonic integrated circuit (PIC). Central to manyAI techniques and algorithms is implementation of hardwaresolutions that mimic the multiply and accumulate (MAC) function. The main advantage of photonic neural networksover electronics is that the energy consumption for perform-ing a series of multiplications and additions does not scalewith MAC speed. The training of an optical neural networknecessitates an active modulation of the optical signal in ahybrid opticalelectronic configuration [4]. For this reason,these architectures face significant hurdles when comparedto their electronic counterparts. To be competitive, they areexpected to have low power consumption and high-speedelectro-optic modulation [5][6][7]. Additionally, they requireto pair with electrical to optical (EO), optical to electrical(OE) converters, and I/O interfaces. However, when trained,photonic neural networks do not rely on any additional energyfor active switching. Therefore the architectures that performtasks such as weighting, can be realized completely passive,and the computations happen without the consumption of anydynamic power [8][9][10]. In this panorama, all-optical neuralnetworks (AONNs) represent a promising future. Current all-optical implementations in free space [11] and in integratedphotonics [12][13][14] can outperform their electronic coun-terparts providing promises of great energy efficiency andspeed enhancement for learning tasks.In this manuscript, we explore the potentials of usinghigh-speed, low-power photonics in a CNN acceleratorby exploiting coherent all-optical matrix multiplication inwavelength division multiplexing (WDM), using microringresonator weight banks (MRRs). Our architecture is inspiredby [15][16], where Winograd filtering algorithm is adoptedto perform convolution to speedup the execution time andreduce the computational complexity. We investigate theperformance of our proposed architecture in terms of speedand power. Since our proposed architecture is analog at thecore, we also investigate the robustness of neural networksexecuted on our proposed design in terms of tolerance againstnoise.We summarize the main contributions of this work as, • a first proposed photonic CNN architecture based on theWinograd filtering algorithm • an analytical framework to evaluate the speed perfor-mance of our proposed accelerator • an in-house simulator based on a modified Google Ten-sorflow tool to simulate the performance of our proposedphotonic accelerator with power and noise awareness c (cid:13) a r X i v : . [ c s . ET ] D ec OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 2 • a modified training process to enhance robustness toinevitable hardware noise sources during the inferencestageII. C ONVOLUTIONAL N EURAL N ETWORKS (CNN S )A CNN is a neural network comprised of one or moreconvolutional layers. CNNs are mostly known for their greatperformance on image data, however, their application extendto many other data types with local features. At the veryhigh level, each convolution layer uses a collection of featuredetectors, known as filters, that scan input data for presence orabsence of a particular set of features. Hence, in a CNN layer,inputs and outputs are referred to as feature maps (fmap). Bycascading multiple of these convolutional layers, a hierarchyof feature detectors are formed. In this hierarchy, featuredetectors closer to the input detect primitive features. As wemove towards the final layers, the type of features detectedbecome more abstract. Conventionally, the dimension of eachfilter in a a CNN is D with the two first dimensions beingthe height and width of the filer and the last dimension, knownas the channel dimension, represents various filters. The useof convolutional filters to scan input data had been practicedwell before the rise of the field of deep learning and CNNs.However, in traditional signal processing, such filters are hand-engineered by experts, which can be costly, only designed forspecific purposes, and vulnerable to designer bias. In a modernCNN, these filters are learned through the training process.Figure 1 shows the overall architecture of a CNN layer.Fig. 1: A single layer of a CNN. Each of the N filters (left)scan the input feature maps (middle) for features. This resultsin output feature maps, with N channels equal to the totalnumber of filters.III. P HOTONIC R EALIZATION OF
CNN S In data communication and computation, photonics hasthe potential to offer practical solutions to overcome someof the limitations currently facing electronic systems. In aneuromorphic system, processing elements (PEs) are arrangedin a distributed fashion with ideally large number of incoming(fan-in) and outgoing (fan-out) connections. Inspired bybiological neural systems, some of these connection are required to connect neurons across farthest parts of thebrain. In addition, neuromorphic PEs are in large partspecific-purpose processors in contrast to the general purposeprocessors.Neuromorphic processing can benefit from photonics inthree major ways. First, photonics can significantly reduce theamount of energy consumed in interconnects among PEs byavoiding energy dissipation due to charging and dischargingof electrical wires. Secondly, current neuromorphic algorithmsknown to neural networks, and in particular in CNNs, heavilyrely on the multiply and accumulate (MAC) operation, whichcan be realized with very low energy budgets in photonics.Thirdly, photonics can increase communication and computa-tion bandwidth by exploiting WDM. The adoption of WDMallow for higher density of computation and communicationbetween PEs by packing more channels and parallel compu-tations in a neuromorphic processor.
A. Photonic Convolution Kernels and MAC Operation
One major advantage of a photonic MAC operation is thatit can be performed with almost zero energy [17]. However,if the signal is converted from optical to electrical, theconversion and successive electronic manipulations imposeadditional energy loss. To build a photonic convolutionalfilter, we use a microring resonator (MRR) network proposedin [18]. Figure 2 depicts a single MRR neuron. In thisschema input WDM signals are weighted through tunableMRRs. These weighted inputs are later incoherently summedup using a photodetector, which amounts to a MAC operation.Thus, by the use of N wavelengths, it is possible toestablish up to N independent connections. Maximum N with current technologies is estimated to be around 108channels resulting in a total of 10k connections [19]. Itshould be note that having closely spaced wavelengths asmultiple laser sources while tuning the rings to match bothresonance and FSR is a very challenging task. Although,on the source side, a set of phase-locked, equally spacedlaser frequency lines can be generated using tunable opticalresonators in a chip-based frequency comb generator [20].Moreover, on the MRR side, our system can leverage onDense-WDM (DWDM). This is achievable due to strongoptical confinement of silicon waveguides using tunableMRRs with more than nm Free Spectral Range (FSR)with Quality factors Q close to , which allow as many as channels [21]. Assuming approximately . nm channelspacing, the resonance bandwidth can be broadened up to . nm , while maintaining an estimated cross-talk level of − dB [22].Most of the modern neural networks have one or more fully-connected layers, which creates N synaptic connections.On the other hand, k connections are barely sufficient toeven implement miniature fully-connected neural networksand simple benchmark datasets such as MNIST with 728neurons only in the input layer. In contrast, CNNs benefit OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 3 from sparse connections between the local input regions andfilters. A common CNN architecture usually has filters ofshape × up to × that connect receptive fields andthe filters. From functional point of view smaller filters arefavored over larger filters, as they are capable of detectingfiner local patterns. Larger and global patterns are detected inthe layers closer to the output of CNNs. These features aremore abstract and are built on top of the previously low-levelfeatures.Fig. 2: A broadcast-and-weight neuron. Inputs X i modulatedifferent wavelength lasers. Modulated beams are then bundledthrough WDM.Fig. 3: Microring resonator (MRR) operation for performingpoint-wise multiplication.We use the proposed scheme in Figure 2, to perform twoheuristic Winograd transformations and one element-wisematrix multiplication (EWMM) on each wavelength. Figure 3shows the details of a MRR weighting function that operateson a single wavelength λ . The MRR acts as a tunable analogfilter centred at λ , in which the voltage applied to the EOMmodule lets only a portion of light to travel through thewaveguide. The modulation can be triggered by an analogelectric field generated by a memristor. In this work we usea memristor device which can store the weights with 6 bitsof resolution [23]. The transmission spectrum (T) of the ringhas a natural resonant frequency of λ . When WDM lightpasses through the coupled waveguide, the component withwavelength λ is coupled into the ring. By raising the biasvoltage to V , the resonant frequency shifts to λ due tothe change in the effective refractive index of the ring. Thedifference between V and V controls the difference between λ and λ , i.e. the transmission ( ∆ i ). The variation of thetransmission at λ represents, in our scheme, the point-wisemultiplication. The most used MRR modulator has silicon based p-i-njunction that is side coupled to a wave guide as describedin [24] or p-n junction reported [25]. Current silicon-basedMRR modulators [26][27][28], as well as foundry levelimplementations, exhibit a speed up to 50 GHz, with adriving voltage of usually a few Volts (1 − V ) and anefficiency (V π l) of few tenths of Vcm. Experimental resultsthat corroborate our estimation are reported in [29], whereSilicon-based electro-optic MRRs exhibit a modulation in aworking spectrum of . nm and a speed of GHz whilehave an insertion loss of as low as only dB . This by nomeans is a limiting factor in the inference stage, consideringthat the network has been trained and the weights areset. Therefore, the latency of the network is given by thetime-of-flight of the photon.Beside the uncertainty due to fabrication imperfections,which could be compensated, the main source of noise thataffects a MRR modulator is electrical noise and, in thiscase, eventual non-ideality in setting the analog voltageswith memristive device that could vary over time. Moreover,for high data rate situations( > Gb/s ), the intra-channelcross-talk becomes relevant, and power penalties need tobe considered [30][31]. Regarding the operating dynamicpower, the maximum allowed optical power flowing in eachphysical channel of the photonic accelerator is bound by theoptical power that would produce nonlinearities in the siliconwaveguides and the minimum power that the photo-detectorcan distinguish from noise (SNR=1). This sets the upper andlower operating range of photodiode, which we refer to asthe dynamic range of the photodiode. Foundry level [32]integrated Germanium photo-diodes can reach up to 40 GHzwith a responsivity of . A/W and a Noise equivalent power(NEP) of around 1pW/ √ Hz operating in reverse bias ( − V).Research-level photodetectors working in the 100s of GHzrange have also been demonstrated [33][34][35]. However,the dynamic range of the photodiode needs to be accuratelyset to avoid saturation and account for the bit resolution [12].For this scheme, according to the bit resolution, the estimateddynamic range is dB .The speed of the optical part of the accelerator, withoutconsidering the I/O interface, according to [36] is given bythe total number of the MRR and their pitch. Photodetectionand phase cross-talk are expected to be the main sources oferror in the proposed scheme.Another issue in using MRRs is attributed to variations indevice fabrications, which can result in the spectral shift of theresonance frequency. The resonance frequency of MRRs canbe tuned in multiple way. Due to high optical sensitivity ofmaterials such as silicon to temperature, thermal tuning is themost widely adopted tuning technique for MRRs. This canbe achieved by placing micro-heaters on top of each MRR[37][38]. OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 4
B. Memristors as Analogue Weight Storage
Neuromorphic systems inspired by human brain rely upontwo major principles, namely massively distributed processingand proximity of local memory to these processing elements.These memory units demand some level of programability(plasticity), with their programming speed requirements beingin the
M Hz regime. At this time, almost all state-of-the-artneural networks, perform the training and the inferencephases separately. This means that once the weights aretrained and set, for the inference phase, one does not needto change the weights. In addition, weights in our proposedsystem are represented by an analog voltage bias of MRRs.A potential weight storage for our system should be analogand non-volatile with long retention time.Having said that, memristive memory devices have attractedthe attention of researchers due to their interesting charac-teristics including but not limited to non-volatility, long stateretention time, and ultra-low power consumption [39][40][23].Over the past few years the bit-resolution of such memristivememory devices has risen monotonically [41][42][43][44].Recently, authors in [23] proposed, fabricated, and evaluatedan analog multibit memristive memory with bit-resolution ofup to . bits. Each memristive device takes up µ × µ in area and can retain the resistance state for up to hours.In AlexNet the 3rd convolutional layer has the largest numberof convolutional filter weights equal to , . Assumingoverhead circuitry increases the footprint to approximately µ × µ , the memristive memory required for the largestlayer of AlexNet can be realized in less than . cm .IV. F AST ALGORITHMS FOR CONVOLUTION OPERATION
As the name suggest the convolution operation account forthe bulk of all operations in a CNN during both training andinference stages. However, each of the training and inferencestages demand a different type of performance requirement.During the training, the emphasis is more on throughputrather than time. This is mainly due to the fact that the modelunder train needs to observe a large ”ensemble” of data, thebatch, as fast as possible. Therefore, time is amortized overmany inputs. On the other hand, during the inference stage,applications are mostly latency sensitive. For instance, in aself-driving car application only a few input image scenesare needed to be processed per second, but that is requiredto be at a very low latency timescale. Having said that, aneuromorphic processor designed for inference is expected tosatisfy stringent timing requirements.An important parameter that is shown to have a significantimpact on the latency of CNNs is the size of their filters.It is generally known from a functional point of view thatCNNs with smaller filters are preferred over CNNs withlarge filters [45][46][47]. Table I shows the breakdown offilter size for some of the state of the art CNN architectures.This is mainly due to the fact that small filters are betterin finding local features without sacrificing the resolution.More abstract and more global features can be detected in TABLE I: Kernel size breakdown in state-of-the-art CNNs. Itcan be seen that filter of size × comprise only a minutefraction of total filters. CNN × (%) × (%) Small 1D filters × (%)GoogLeNet 64.9 17.5 1.7 15.9Inception V3 43.2 17.9 35.7 3.2Inception V4 40.9 16.1 43 0MobileNet 93.3 6.7 0 0ResNet50 68.5 29.6 1.9 0VGG16 0 100 0 0 higher layers of a CNN built on previous local layer features.As we discussed in section III, a physical implementationof photonic MRRs favors small size filters due to limitednumber of available wavelength bands. This synergy betweenfunctional and photonic realization of CNNs is the primarymotivation behind this work.At the time of writing this paper, there are three majorways to speed up the convolution operation. First, the GeneralMatrix Multiplication (GEMM) approach, in which the con-volution is converted to matrix multiplication operation usingToeplitz matrix. The downside to this method is that Toeplitzconversion expands the input by a factor of r × r where r is thesize of the filter. Second method uses Fast Fourier Transform(FFT) to perform tiled convolution operations. From Fouriertheorem we know that cyclic convolution can be performedby transforming the input and filters into Fourier domain.An element-wise multiplication (also known as Hadamardmultiplication) results in an equivalent of convolution, but inFourier domain. An inverse FFT operation transforms the cal-culated convolution back into the original domain. FFT-basedconvolution had been the method of choice for convolutionoperation [48][49][50] until the recent past. Lately, it is shownthat FFT-based convolution is better suited for larger filter sizes[15]. The third method uses the Winograd filtering algorithm,which we explain in detail in the following section. A. Winograd Algorithm
In a 2D convolution, a single output component of theconvolution is calculated by, y n,k,p,q = c (cid:88) n =1 r (cid:88) x =1 r (cid:88) y =1 x n,c,p + x,q + y × w k,c,x,y (1)The operation in equation 1 is repeated for all outputs convolu-tion components. In a brute-force convolution the total numberof multiplications required to perform a full convolution isequal to ( m × r ) (2)where m is the size of the output feature map channeland r is the size of the filter. At the time of writing thispaper, Winograd convolution in the most efficient convolutionalgorithm being used for CNNs [15]. Winograd convolutionis based on the minimal filtering principles. The algorithm OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 5
Fig. 4: High-level flow diagram of Winograd filtering technique for convolution operation. Unlike conventional convolution,which computes a single output at a time, Winograd algorithm computes a tile of output, here of size m × m . In order togenerate an output tile, Winograd requires to fetch an input tile of size n × n . Both input tile and filter are transformed intothe Winograd domain. Within the Winograd domain, previously transformed input and filter are multiplied in an element-wisefashion. Finally, the output of the element-wise multiplication is transformed back into the original domain and channels arecollapsed into a single value per tile element.states that in order to calculate m outputs with a finite impulseresponse (FIR) filter of size r , denoted by F ( m, r ) , the numberof required multiplications is, n = ( m + r − (3)While equation 3 is derived for the 1D convolution operation,one can nest it with itself to acquire a 2D convolution.Therefore, the number of multiplications needed for the same2D convolution is given by, ( m + r − (4)From equation 2 and 4 we can infer that Winograd resultsin a reduction in the complexity by a factor of, ( mr ) ( m + r − (5)It should be noted that in our proposed photonic accelerator,multiplication operations are carried out by MRRs. Anyreduction in the total number of multiplication operations,and thus MRRs, can save us not only in footprint of thedesign, but also in the design complexity.Now, in order to understand how minimal Winograd works,let us first consider the case for 1D convolution. Let matrix W be the matrix of weights, and matrix D be the data matrix.Winograd computes the F (2 , convolution as following (cid:20) d d d d d d (cid:21) w w w = (cid:20) m + m + m m − m − m (cid:21) (6)where values m i are intermediate values found by m = ( d − d ) × w m = ( d + d ) × w + w + w m = ( d − d ) × w − w + w m = ( d − d ) × w Above equations show that with only 4 multiplicationsbetween inputs and weights, Winograd can compute a F (2 , convolution. All w i terms can be pre-computed after thetraining stage. In order word, during the inference time, whiledata values d i , corresponding to inputs change, the w i valuesremain the same throughout the inference. The 1D Winogradcan be expressed by a closed matrix form as Y = A T [( G × w ) (cid:12) ( B T × d )] (7)where A T , B T , and G are three heuristic transforms describedby equations 8, 9, and 10. w is the weight vector and d is theinput vector. A T = (cid:20) − − (cid:21) (8) B T = − − − (9) G =
12 12 1212 −
12 12 (10)One conclusion from equation 6 is that to compute a singleoutput of 1D convolution only a window of ( m + r − inputvalues are needed.In a modern CNN the bulk of convolution operations arecomprised of 2D convolutions. Equation 7 can be easilyextrapolated for 2D convolution by nesting two 1D Winogradconvolutions. The resulting 2D Winograd would be, Y = A T [( G × w × G T ) (cid:12) ( B T × d ) × B ] (11) OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 6
From [15] for the case of F (4 × , × matrices B T , G ,and A T have the forms, A T = − − − − (12) B T = − − − − − − − − (13) G = − − − −
16 16 − −
112 16 (14)The number of addition and multiplications required forWinograd transform, not the element-wise multiplications,increases quadratically with the tile size. Thus, Winograd isexpected to perform most efficiently for smaller filter sizes,and thus smaller input tiles.
Algorithm 1:
Winograd for 2D convolution for row=0; row < H; row+=m dofor column=0; column < H; column+=m dofor channel=0; channel < c; channel+=1 dofor filter=0; filter < N; kerne;+=1 do load input tile;transform input tile;load transformed filter;perform EWMM endendend Output Winograd convolution result end
V. A
RCHITECTURE D ESIGN
In this paper we propose a photonic CNN acceleratorbased on Winograd algorithm and realized using the photonicneuron introduced in [19]. Figure 4 depicts the architectureof a single Winograd PE. Our proposed accelerator processesa single layer of a CNN at a time. This is mainly due tothe fact that in a CNN different tiles of output feature mapsare computed sequentially, and thus arrive at different times.But, in order to initiate processing of the next layer, allthe inputs from the previous layer need to be available andsynchronized. Our approach to process one layer at a timeenforces this synchronization. Furthermore, implementingmultiple layers of a CNN will result in large area overheads.At the input of our accelerator, an input tile of shape n × n × c along with filters of size r × r × c are transformed into the Winograd domain. Input and filters’ transformsare then multiplied element by element. The output of thismultiplication needs to be transformed back using an inverseWinograd transform. The signals at this stage are digitizedusing an array of ADCs and placed onto the output linebuffers to be stored back in the off-chip memory.Figure 5 presents the overview of the our proposedarchitecture. Our proposed architecture runs on two clockdomains. First a high speed GHz clock domain, whichaccommodates low latency components of the acceleratorincluding the photonic components. In section VI-A weexplain our rationale on how we arrive at the GHz highspeed clock frequency. The rest of the accelerator includinginput feature map buffers, filter buffers, filter WinogradDSP module, and filter path DAC run on a slower clockdomain because there is no time sensitivity on filter path, anddata transfer from/to off-chip memory. At the heart of ouraccelerator, we have an Element-Wise Matrix Multiplicationunit, which we implement in photonics using photonicneurons. We store the input feature maps and filters in anoff-chip memory. Both the input feature maps and filtersrequire to go through Winograd transformation, which arematrix multiplications described in equations 13 and 14. Itshould be noted that while input feature maps change fordifferent tiles of inputs, filters are fixed for each layer. Forthat, we implemented the input feature map transformationsin photonics and filter Winograd transformations in electronicDSP. This way, we will not pay the overhead associatedwith photonic implementations including the conversion ofelectronic filters to photonics. Later, the transformed filtersand input feature map tiles are converted into analog signalsto modulate the laser beams. However, as the filters arefixed over the processing time of the layer, analog filtersignals need to be maintained for that time. Thus, we proposeto use the non-volatile analog memristive memory bank,which maintains these voltages in their analog form for longretention times.In Winograd convolution, and in each iteration, a tile of n × n is processed. In order to process an entire featuremap, the transformed filter tile needs to move across theinput feature map. In this paradigm two successive input tilesshare size ( r − × n elements. This, introduces data reuseopportunities, to avoid multiple queries of same data block.Here our goal is to exploit this opportunity at the front-endof our accelerator. Our design is inspired by the work in [16],where authors utilize line buffers to avoid redundant queriesfrom the off-chip memory. Figure 6 shows an example linebuffer design to load and hold a × input feature map tile.Input tiles are fetched from off-chip memory and loaded intothe line buffer. Buffered tiles are then passed into the digitalto analog converter (DAC) using parallel channels.In parallel to the input data stream, transformed filterweights are converted into the analog signals to program theanalog memristive memory. We then use the voltage generatedusing the stored analog signals to modulate the laser source OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 7
Fig. 5: High-level architecture of our proposed photonic accelerator. Input feature-maps and filters are initially stored in anoff-chip memory. Input tiles of size n × n are loaded into the input line buffer one at a time. Kernel weights do not changeonce the CNN is trained. Thus, we perform filters’ Winograd in electronics and the cost is amortized over many input tiles.Winograd transform for input feature map tiles are computed in photonics. The photonic element-wise matrix multiplication(EWMM) unit performs the core Winograd element-wise Winograd multiplications. Outputs are digitized and placed onto theoutput line buffer. Finally, processed layer outputs are stored in the off-chip memory.for the filters. Each output signal generated by a DAC is thenused to modulate a laser beam of a particular wavelength λ i .It is worth noting that for each set of filters modulated bythe laser source, input line goes through multiple iterationscorresponding to different input tiles. Once both input tilelaser beam and the filter laser beam are ready, the EWMM,multiplies each element of the Winograd input feature maptile with its corresponding Winograd filter value. The outputfrom EWMM unit must be transformed back from Winograddomain into the original domain by the inverse Winogradtransform. The result contains output feature map tiles formultiple channels c . Lastly, output feature map values aredigitized and stored back into the off-chip memory.A key principle in HPC is to try to minimize the IO andother communication latencies, compared to that of the compu-tation time, to avoid unit under-utilization. From Algorithm 1we can see that the two inner-most loops iterate over differentchannels of the input feature map tile and different filters.Moreover, operations within these two loops are independentform one another. This provides parallelization opportunitiesat the cost of additional hardware. In other words, the amountof parallelization and speed up we can achieve, scales linearlywith the number of pipeline replications in our system. Thislinear scaling plateaus as soon as the computation bandwidthapproaches the data transfer bandwidth. Our envisioned designuses an arbitrary number of 100 parallel paths. Our evaluationresults in the next section justifies this selection.VI. E VALUATION
In this section we evaluate the performance of our acceler-ator for the × filters of the VGG16 network against the Fig. 6: An example line buffer design to load and hold aninput tile of size × .recent FPGA [51][16][52][53][54] and GPU implementations[16]. A. Speed
Here we develop a model to estimate the execution timeof the our accelerator. First we model the time required toconvolve one input tile with one filter and we call it T tile filter .Following that, we generalize the model to the case where weparallelize the process based on available resources. For oneinput feature map tile and a filter, both, the input branch andthe filter branch of the Figure 5 are fully pipelined. Therefore,the execution time of a layer is determined by the longer ofthe two paths of filter path and input path. For each iteration ofthe filter path, input data path goes through multiple iterations.This is because a single filter operates on many input datatiles. That said, the input data path sets the upper bound onthe delay. Our execution time model is comprised of twomajor components namely, the Input/Output time ( T IO ) andthe computation time ( T Comp ) . We define ( T IO ) as. T IO = max ( T load , T offload ) (15) OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 8 where T load is the time it takes to transfer data from theoff-chip memory to the input of the laser sources. Moreover,we can implement a total of P input DACs to speedup the datatransfer. Considering the fact that our input matrices are ofthe shape × , we used an array of 16 DACs in this work.Similarly T offload is the time to store back the computedoutputs from the inverse Winograd transform to the off-chipmemory. The goal is to match the rate of the ADC at theoutput with input DAC to avoid any speed mismatch andthus congestion in the pipeline. At the time of this reviewboth on-chip DACs and ADCs are capable of operating atsampling rates of more than 18 GS/s for bit resolution of atleast 8 bits [55][24]. Furthermore, with recent advances inmemory technology, current memories are able to transferdata at high IO bandwidths up to more than Gb/s [56].This high memory bandwidth allows us to buffer data andfilters from off-chip memory at high transfer rates and feedit to our input line buffers. However, for our line buffers weneed memories with high clock frequency and short accesstime. Current reported memory technologies have access timeas short as ps .At the photonic core of our accelerator, T compute is, T compute = T laser + T W inograd + T EW MM + T iW inograd (16)where T W inograd is the time to compute the Winograd trans-form, T EW MM is the time to perform the element-wise matrixmultiplication, and T iW inograd is the time compute the inverseWinograd transform. Once the laser is set up, input signalsonly incur a time delay equivalent to flight time of the lightbefore they are fed into the ADC. Having said that the clockfrequency of the pipeline is determined by f requency clock ≤ min ( T load , T offload , T compute ) (17)As a result of equation 17, we picked a clock frequencyof Ghz , which satisfies equation 17. From equation 17 T tile filter is simply found by, T tile filter = 15 GHz = 200 ps (18)For a F (4 × , × Winograd, each T tile filter returns anoutput block of size 9 equivalent to 9 convolution operations.Having a clock frequency of GHz , our proposed acceleratorperforms at × G = 45 GOP/s . Figure 7 shows the averageconvolution speed comparison of our proposed acceleratoragainst the state-of-the-art FPGA and GPU implementations.
B. Power
In order to estimate the dynamic power consumption of ourproposed system, we built our in-house estimator by augment-ing the standard Google Tensorflow tool. While primarily usedfor training and inference stages of neural networks, at thecore, Tensorflow is a symbolic mathematical graph processingplatform. Tensorflow enables users to express arbitrary compu-tations into a dataflow graph, which is extremely useful in thecontext of neural networks. However, out-of-the-box Tensor-flow is completely agnostic to physical realization of the neural Fig. 7: Comparison of convolution operation speed for FPGA,GPU, and our photonic implementation. The last columnlabeled with (p) represents the speed of the photonic core inthe absence of electronics.TABLE II: Mapping of primitive math operations to theirhardware realization.
Math Operation Photonic RepresentationAddition PhotodiodeMultiplication MRRConnection WaveguideNon-linear Activation Electro-absorption Modulator networks being implemented. Thus, we augmented Tensorflowhigh-level API with mathematical models of electro-opticalcomponents. Figure 8 depicts the native Tensorflow toolkithierarchy against our augmented version. In our estimator,Fig. 8: High level Tensorflow toolkit hierarchy vs. augmentedTensorflow.each primitive mathematical operation is given two physicalmodels namely, the power model and the noise model. While,the noise model can impact the functionality, thus accuracy,in a neural network, the power model only models/measuresconsumed power. Table II shows some of these mathematicaloperations mapped to to their physical realizations.Photodiode power can be simply derived from its Respon-
OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 9 sivity equation: R = I ph P in = λ qhc η [ AW ] (19)where I ph is the photocurrent, P in is the optical input signalpower, q is the electron charge, λ is the wavelength, h is thePlanck’s constant, and c is the speed of light. It should benoted that the signal is encoded in the optical input power P in . In this work use Aim photonics PDK values in [57].Similarly, we modeled both thermal noise and the shot noisein photodiode. I sn = (cid:113) q ( I ph + I D )∆ f (20) I tn = (cid:114) K B T ∆ fR SH (21)where I D is the dark current of the photodetector, ∆ f is thenoise measurement bandwidth, K B is the Boltzmann Constant, T is temperature in Kelvins and R SH is total equivalent shuntresistance of the photodiode. For MRRs we accounted for perunit length propagation loss.Fig. 9: Comparison of convolution operation power for FPGA,GPU, and our photonic implementation. The last columnlabeled with (p) represents the power consumption of thephotonic core in the absence of electronics.Figure 9 depicts the power comparison results. Finally Weplotted the energy efficiency figure of merit defined by theratio of speed to power in Figure 10.VII. T RAINING , I
NFERENCE , AND N OISE
We initially trained our neural network offline on aconventional digital computer. Later during the inferencestage we loaded the trained weights into our in-housesimulator, which is equipped with noise sources modeling.Our hypothesis was that inference on a noisy neural networkwould result in some loss in accuracy. This is mostly due tothe fact that, the network used during the training is noise-less,with 32-bit floating point resolution, while during inferencethe weights all in a sudden face a noisy network. In other Fig. 10: Comparison of energy efficiency for FPGA, GPU, andour photonic implementation. The last column labeled with (p)represents the power consumption of the photonic core in theabsence of electronics. The results show that using photon-ics as an accelerator has the potential of improving energyefficiency by up to more than three orders of magnitude.Fig. 11: Visualalization of an augmented convolutional layerusing power and noise models for
VGG16 network.words, the network performing inference experiences unseennoise behavior that results in accuracy loss. We tested ourhypothesis by sweeping a range of inference noise levels andobserving its effect on accuracy. For that reason, we identifiedtwo major noise sources, namely the neuron output noise andthe weight noise. The neuron output noise represents the noiseintroduced at the output of each neuron by the photodiode andthe nonlinear activation function. The first plot in Figure 12shows how accuracy is impacted by noise during inference forthe case that the network was trained free of any noise source.Our next hypothesis was that, if we allow for certain amountof noise during the training, the model would become morerobust to noise during the inference stage. To that end, wetrained the network with output noise source on. We onlyadded the output noise, and left the weight noise off, becauseweights are required to be calculated with maximum precisionduring training. In fact we observed that even a minute amountof noise added to the weights during the training could destroy
OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 10 the accuracy of the network to its baseline level of about 10%.We swept the addition of training noise at logarithmic stepsfrom . to . Figure 12 depicts the effect of adding anoutput noise equivalent to . and . of the maximumsignal swing at the output of neurons. In our experiments weobserved that the addition of about . noise during thetraining may result in slight accuracy loss for low levelof noise during inference. However, the model becomes morerobust to higher levels of inference noise. This shows thatmodeling noise by addition of noise during training can fine-tune the network for a physical noisy realization as shownin Figure 12 (middle). Lastly, we noticed adding furtheramount of training noise beyond the initial . resultedvery significant inference accuracy losses shown in Figure 12(bottom). VIII. C ONCLUSION
In this paper we presented a photonic CNN acceleratorbased on Winograd filering convolution algorithm. Winogradreduces the total number of multiplication, thus hardware, toperform convolution operation. We evaluated the speed of ouraccelerator by developing an analyical framework. Our resultsshow that a photonic accelerator can compete with state-of-the-art Winograd based FPGA and GPU implementations. Suchphotonic accelerator has the potential of improving the energyefficiency by up to three orders of magnitude. However, theoverall speed is bound by the limitations of IO and conver-sions in DAC and ADC. To evaluate power performance weaugmented the native hardware-agnostic Google Tensorflowtool with power models of our hardware components. Similarto speed performance, electronic IO and convertors are themajor consumers of power in our proposed design. However,the photonic core, without the electronic interface, can operatewhile consuming up to two orders of magnitude less power.In addition, we modeled noise into our Tensorflow-basedsimulator, to investigate the effect of hardware noise sourcessuch as photodiode noise and MRR noise on the functionality(accuracy) of our CNN. We found training the CNN witha small noise component, . of the signal swing in ourexperiment, can result in the CNN become more robust toinference-time noise introduced by noisy photodiodes andMRRs. R EFERENCES[1] P. R. Prucnal and B. J. Shastri,
Neuromorphic Photonics . CRC Press,May 2017, google-Books-ID: VbvODgAAQBAJ.[2] I. Chakraborty, G. Saha, A. Sengupta, and K. Roy, “Toward fast neuralcomputing using all-photonic phase change spiking neurons,”
Scientificreports , vol. 8, no. 1, p. 12980, 2018.[3] J. Feldmann, N. Youngblood, C. Wright, H. Bhaskaran, and W. Pernice,“All-optical spiking neurosynaptic networks with self-learning capabil-ities,”
Nature , vol. 569, no. 7755, p. 208, 2019.[4] J. K. George, A. Mehrabian, R. Amin, J. Meng, T. F. d. Lima, A. N.Tait, B. J. Shastri, T. El-Ghazawi, P. R. Prucnal, and V. J. Sorger,“Neuromorphic photonics with electro-absorption modulators,”
OpticsExpress
Nature
Fig. 12: The evaluation of the effect of physical photodiodeand MRR noise on inference accuracy. This effect can be par-tially compensated through introduction of an artificial noisesource during the training stage. At the absence of trainingnoise source (top) inference accuracy is quickly deterioratedas we sweep the photodiode and MRR noise. By introducingan equivalent of 0.1% guassian noise, the network becomesmore robust to inference noise. Further increase in trainingnoise level (bottom) hinders the network from proper training.
OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 11 [6] A. Liu, R. Jones, L. Liao, D. Samara-Rubio, D. Rubin, O. Cohen,R. Nicolaescu, and M. Paniccia, “A high-speed silicon opticalmodulator based on a metaloxidesemiconductor capacitor,”
Nature
APLPhotonics , vol. 3, no. 12, p. 126104, Dec. 2018. [Online]. Available:https://aip.scitation.org/doi/10.1063/1.5052635[8] H. Bagherian, S. Skirlo, Y. Shen, H. Meng, V. Ceperic, and M. Sol-jacic, “On-chip optical convolutional neural networks,” arXiv preprintarXiv:1808.03303 , 2018.[9] A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and T. El-Ghazawi, “Pcnna: Aphotonic convolutional neural network accelerator,” in . IEEE, 2018, pp.169–173.[10] W. Liu, W. Liu, Y. Ye, Q. Lou, Y. Xie, and L. Jiang, “Holylight: Ananophotonic accelerator for deep learning in data centers,” in
NaturePhotonics
Optica
Optical MaterialsExpress
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 4013–4021.[16] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms forconvolutional neural networks on fpgas,” in . IEEE, 2017, pp. 101–108.[17] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones,M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund et al. , “Deeplearning with coherent nanophotonic circuits,”
Nature Photonics , vol. 11,no. 7, p. 441, 2017.[18] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcastand weight: an integrated network for scalable photonic spike process-ing,”
Journal of Lightwave Technology , vol. 32, no. 21, pp. 3427–3439,2014.[19] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A.Nahmias, and P. R. Prucnal, “Microring weight banks,”
IEEE Journalof Selected Topics in Quantum Electronics , vol. 22, no. 6, pp. 312–325,2016.[20] H. Hu, F. Da Ros, M. Pu, F. Ye, K. Ingerslev, E. P. da Silva,M. Nooruzzaman, Y. Amma, Y. Sasaki, T. Mizuno et al. , “Single-sourcechip-based frequency comb enabling extreme parallel data transmission,”
Nature Photonics , vol. 12, no. 8, p. 469, 2018.[21] Q. Xu, D. Fattal, and R. G. Beausoleil, “Silicon microring resonatorswith 1.5- µ m radius,” Optics express , vol. 16, no. 6, pp. 4309–4315,2008.[22] L. Chen, K. Preston, S. Manipatruni, and M. Lipson, “Integrated ghzsilicon photonic interconnect with micrometer-scale modulators anddetectors,”
Optics express , vol. 17, no. 17, pp. 15 248–15 256, 2009.[23] S. Stathopoulos, A. Khiat, M. Trapatseli, S. Cortese, A. Serb, I. Valov,and T. Prodromakis, “Multibit memory operation of metal-oxide bi-layermemristors,”
Scientific reports , vol. 7, no. 1, p. 17532, 2017.[24] B. Xu, Y. Zhou, and Y. Chiu, “A 23-mw 24-gs/s 6-bit voltage-timehybrid time-interleaved adc in 28-nm cmos,”
IEEE Journal of Solid-State Circuits , vol. 52, no. 4, pp. 1091–1100, 2017.[25] M. Ziebell, D. Marris-Morini, G. Rasigade, P. Crozat, J.-M. Fdli,P. Grosse, E. Cassan, and L. Vivien, “Ten Gbit/s ring resonator siliconmodulator based on interdigitated PN junctions,”
Optics Express
Optics Express | Optics Express
Optics Express
Journal of Lightwave Technology , vol. 34, no. 12, pp. 2886–2896, Jun.2016. [Online]. Available: http://ieeexplore.ieee.org/document/7272050/[31] M. Bahadori, S. Rumley, H. Jayatilleka, K. Murray, N. A. F. Jaeger,L. Chrostowski, S. Shekhar, and K. Bergman, “Crosstalk Penalty inMicroring-Based Silicon Photonic Interconnect Systems,”
Journal ofLightwave Technology
OpticsExpress
ACS Photonics , vol. 5, no. 8, pp. 3291–3297, Aug.2018. [Online]. Available: http://pubs.acs.org/doi/10.1021/acsphotonics.8b00525[35] P. Ma, Y. Salamin, B. Baeuerle, A. Emboras, Y. Fedoryshyn,W. Heni, B. Cheng, A. Josten, and J. Leuthold, “100 GHzPhotoconductive Plasmonic Germanium Detector,” in
Conference onLasers and Electro-Optics (2018), paper SM2I.3
Scientific Reports ,vol. 7, no. 1, p. 7430, Aug. 2017. [Online]. Available: https://doi.org/10.1038/s41598-017-07754-z[37] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Ku-mar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, andR. Baets, “Silicon microring resonators,”
Laser & Photonics Reviews ,vol. 6, no. 1, pp. 47–73, 2012.[38] X. Xue, Y. Xuan, C. Wang, P.-H. Wang, Y. Liu, B. Niu, D. E. Leaird,M. Qi, and A. M. Weiner, “Thermal tuning of kerr frequency combs insilicon nitride microring resonators,”
Optics express , vol. 24, no. 1, pp.687–698, 2016.[39] C. Yoshida, K. Tsunoda, H. Noshiro, and Y. Sugiyama, “High speedresistive switching in pt/ ti o 2/ ti n film for nonvolatile memoryapplication,”
Applied Physics Letters , vol. 91, no. 22, p. 223510, 2007.[40] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, andR. S. Williams, “memristiveswitches enable statefullogic operations viamaterial implication,”
Nature , vol. 464, no. 7290, p. 873, 2010.[41] I. Baek, M. Lee, S. Seo, M. Lee, D. Seo, D.-S. Suh, J. Park, S. Park,H. Kim, I. Yoo et al. , “Highly scalable nonvolatile resistive memoryusing simple binary oxide driven by asymmetric unipolar voltagepulses,” in
IEDM Technical Digest. IEEE International Electron DevicesMeeting, 2004.
IEEE, 2004, pp. 587–590.[42] E. J. Merced-Grafals, N. D´avila, N. Ge, R. S. Williams, and J. P.Strachan, “Repeatable, accurate, and high speed multi-level program-ming of memristor 1t1r arrays for power efficient analog computingapplications,”
Nanotechnology , vol. 27, no. 36, p. 365202, 2016.[43] A. Prakash, D. Deleruyelle, J. Song, M. Bocquet, and H. Hwang,“Resistance controllability and variability improvement in a taox-based
OURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 12 resistive memory for multilevel storage application,”
Applied PhysicsLetters , vol. 106, no. 23, p. 233104, 2015.[44] S. R. Lee, Y.-B. Kim, M. Chang, K. M. Kim, C. B. Lee, J. H. Hur,G.-S. Park, D. Lee, M.-J. Lee, C. J. Kim et al. , “Multi-level switchingof triple-layered taox rram with excellent reliability for storage classmemory,” in . IEEE,2012, pp. 71–72.[45] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[47] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[48] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutionalnetworks through ffts,” arXiv preprint arXiv:1312.5851 , 2013.[49] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, andY. LeCun, “Fast convolutional nets with fbfft: A gpu performanceevaluation,” arXiv preprint arXiv:1412.7580 , 2014.[50] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXivpreprint arXiv:1410.0759 , 2014.[51] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizingfpga-based accelerator design for deep convolutional neural networks,”in
Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays . ACM, 2015, pp. 161–170.[52] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,N. Xu, S. Song et al. , “Going deeper with embedded fpga platform forconvolutional neural network,” in
Proceedings of the 2016 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays . ACM,2016, pp. 26–35.[53] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.Seo, and Y. Cao, “Throughput-optimized opencl-based fpga acceleratorfor large-scale convolutional neural networks,” in
Proceedings of the2016 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays . ACM, 2016, pp. 16–25.[54] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:Towards uniformed representation and acceleration for deep convolu-tional neural networks,”
IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems , 2018.[55] A. Nazemi, K. Hu, B. Catli, D. Cui, U. Singh, T. He, Z. Huang,B. Zhang, A. Momtaz, and J. Cao, “3.4 a 36gb/s pam4 transmitter usingan 8b 18gs/s dac in 28nm cmos,” in . IEEE, 2015,pp. 1–3.[56] D. U. Lee, K. W. Kim, K. W. Kim, K. S. Lee, S. J. Byeon, J. H. Kim,J. H. Cho, J. Lee, and J. H. Chun, “A 1.2 v 8 gb 8-channel 128 gb/s high-bandwidth memory (hbm) stacked dram with effective i/o test circuits,”
IEEE Journal of Solid-State Circuits
Armin Mehrabian is a PhD candidate in ElectricalEngineering at the George Washington University.His research interests include High PerformanceComputing (HPC), Neuromorphic Computing, Arti-ficial Intelligence (AI) from both software and hard-ware point of view. He received his BS. degree inElectrical Engineering at Shahid Beheshti Universityof Tehran, Iran focusing on Analog Electronics, andhis MS. degree at the George Washington University(GWU), DC, USA in computer engineering focusingon VLSI and digital electronics design. His currentresearch involves leveraging nanophotonics for HPC architecture designs.
Mario Miscuglio
Mario Miscuglio is a post-doctoralresearcher in the Electrical Engineering departmentat the George Washington University. He receivedhis Masters in Electric and Computer engineeringfrom Polytechnic of Turin, working as researcher atHarvard/MIT. He completed his PhD in Optoelec-tronics from University of Genova (IIT), working asresearch fellow at the Molecular Foundry in LBNL.His interests extend across science and engineering,including photonic neuromorphic computing, nano-optics and plasmonics.
Yousra Alkabani received the BSc and MSc de-grees in computer and systems engineering fromAin Shams University, Cairo, Egypt, in 2003 and2006, respectively. She received the PhD degree incomputer science from Rice University, Houston,TX, USA, in December 2010. She has been anassistant professor of computer and systems engi-neering at Ain Shams University since May 2011and a visiting assistant professor of computer scienceand engineering at the American University in Cairosince 2013. Her research interests include hardwaresecurity, low power design, and embedded systems. She is a member of theIEEE.
Volker J. Sorger is an Associate Professor inthe Department of Electrical and Computer Engi-neering and the leader of the Orthogonal PhysicsEnabled Nanophotonics (OPEN) lab at the GeorgeWashington University. He received his PhD fromthe University of California Berkeley. His researchareas include opto-electronic devices, plasmonicsand nanophotonics and photonic analog informationprocessing and neuromorphic computing. Amongsthis breakthroughs are the first demonstration ofa semiconductor plasmon laser, attojoule-efficientmodulators, and PMAC/s-fast photonic neural networks and near real-timeanalog signal processors. Dr. Sorger has received multiple awards among arethe Presidential Early Career Award for Scientists and Engineers (PECASE),the AFOSR Young Investigator Award (YIP), the Hegarty Innovation Prize,and the National Academy of Sciences award of the year. Dr. Sorger is theeditor-in-chief of Nanophotonics, the OSA Division Chair for Photonics andOpto-electronics and serves at the board-of-meetings at OSA & SPIE, andthe scholarship committee. He is a senior member of IEEE, OSA & SPIE.