[PDF] Freely scalable and reconfigurable optical hardware for deep learning

Abstract

As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve more complex problems. This trend has been enabled by an increase in available compute power; however, efforts to continue to scale electronic processors are impeded by the costs of communication, thermal management, power delivery and clocking. To improve scalability, we propose a digital optical neural network (DONN) with intralayer optical interconnects and reconfigurable input values. The near path-length-independence of optical energy consumption enables information locality between a transmitter and arbitrarily arranged receivers, which allows greater flexibility in architecture design to circumvent scaling limitations. In a proof-of-concept experiment, we demonstrate optical multicast in the classification of 500 MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumption of the DONN and find that optical data transfer is beneficial over electronics when the spacing of computational units is on the order of >10 micrometers.

Full PDF

FFreely scalable and reconfigurable opticalhardware for deep learning L IANE B ERNSTEIN † ,5 , A LEXANDER S LUDDS † ,6 , R YAN H AMERLY , V

IVIENNE S ZE , J OEL E MER , AND D IRK E NGLUND Research Laboratory of Electronics, Massachusetts Institute of Technology, 50 Vassar St, Cambridge, MA02139, USA NTT Research Inc. Physics & Informatics Laboratories, 1950 University Ave NVIDIA, 2 Technology Park Drive, Westford, MA 01886, USA Computer Science and Artiﬁcial Intelligence Laboratory, Massachusetts Institute of Technology, 32 VassarSt, Cambridge, MA 02139, USA † These authors contributed equally to this work e-mail: [email protected] e-mail: [email protected] e-mail: [email protected] Abstract:

As deep neural network (DNN) models grow ever-larger, they can achieve higheraccuracy and solve more complex problems. This trend has been enabled by an increasein available compute power; however, eﬀorts to continue to scale electronic processors areimpeded by the costs of communication, thermal management, power delivery and clocking. Toimprove scalability, we propose a digital optical neural network (DONN) with intralayer opticalinterconnects and reconﬁgurable input values. The near path-length-independence of opticalenergy consumption enables information locality between a transmitter and arbitrarily arrangedreceivers, which allows greater ﬂexibility in architecture design to circumvent scaling limitations.In a proof-of-concept experiment, we demonstrate optical multicast in the classiﬁcation of 500MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumptionof the DONN and ﬁnd that optical data transfer is beneﬁcial over electronics when the spacing ofcomputational units is on the order of >10 µ m. Introduction

Machine learning has become ubiquitous in modern data analysis, decision-making, andoptimization. A prominent subset of machine learning is the artiﬁcial deep neural network(DNN), which has revolutionized many ﬁelds, including classiﬁcation [1], translation andprediction [2, 3]. An important step toward unlocking the full potential of DNNs is improvingthe energy consumption and speed of DNN tasks. To this end, emerging DNN-speciﬁc hardwareoptimizes data access, reuse and communication for mathematical operations: most importantly,general matrix-matrix multiplication (GEMM) and convolution [4]. One approach is to use aspecialized memory hierarchy to store and reuse data near an array of computation units, whichminimizes reliance on expensive large-scale data distribution networks [5–7]. Another optionfor GEMM is a large array of electronic multipliers with fewer intermediate memory tiers [8].The large multiplier array reduces overhead, and if the DNN is sizable enough to keep all theprocessing elements occupied, this design can be more eﬃcient in energy consumption andthroughput, thanks to the ability to perform more parallel operations.However, despite these advances, a central challenge in the ﬁeld is scaling hardware to keep upwith exponentially-growing DNN models (see Fig. 1 and Ref. [9]). Many popular DNN modelscomprise matrices exceeding the GEMM capacity of leading DNN processors (e.g., Google’sTensor Processing Unit (TPU) [8]), and therefore, matrices must be computed in multiple a r X i v : . [ c s . ET ] J un tiles’. Tiling requires many inputs and intermediate values to be stored rather than streamed,which increases data movement. Thus, tiling restricts the use of DNNs in high-throughputapplications such as the observation of new phenomena in fundamental physics [10–14], andreduces the throughput of large DNN models such as recommender systems [15], vision [8]and natural language processing [16]. Though the current trend is to scale up conventionalelectronic hardware, these eﬀorts are impeded by communication [17], clocking [18], thermalmanagement [19] and power delivery [20]. Parallel processing with multiple chips [21] orpartitioned chips [22, 23] can ease these constraints and improve performance over a monolithicequivalent through greater mapping ﬂexibility [24], at the cost of increased communicationenergy. Year N u m b e r o f M o d e l P a r a m e t e r s AlexNet VGG16GoogLeNet ResNet-50DQNInception V3 XceptionTransformer (Base)Transformer (Big)NASNet SENet BERTGPT-2 ALBERTTransformer-XL

Fig. 1. Number of parameters, i.e., weights, in recent landmark neural networks [1, 16,25–36] (references dated by ﬁrst release, e.g., on arXiv). The number of multiplications(not always reported) is not equivalent to the number of parameters, but larger modelstend to require more compute power, notably in fully-connected layers. The twooutlying nodes (pink) are AlexNet and VGG16, now considered over-parameterized.Subsequently, eﬀorts have been made to reduce DNN sizes, but there remains anexponential growth in model sizes to solve increasingly complex problems with higheraccuracy.

In this Article, we introduce an optical DNN accelerator that encodes data into reconﬁgurableon-oﬀ optical pulses for transmission and passive copying (or fan-out ) to large-scale electronicmultiplier arrays. The near length-independence of optical data routing enables freely scalablesystems, where single transmitters are fanned out to many arbitrarily arranged receivers with fastand energy-eﬃcient links. Optics has previously been proposed for analog DNN accelerators, withpotential orders-of-magnitude reductions in energy consumption and improved throughput [37–41]. In contrast, we propose an entirely digital system, where we replace electrical on-chipinterconnects with optical paths for data transmission, but not computation, thus with thecapability to preserve accuracy. This ‘digital optical neural network’ (DONN) performs large-scale data distribution from memory to an arbitrary set of electronic multipliers. We ﬁrst illustratethe DONN architecture and discuss possible implementations. Then, in a proof-of-conceptexperiment, we demonstrate that digital optical transmission and fan-out with cylindrical lenseshas little eﬀect on the classiﬁcation accuracy of the MNIST handwritten digit dataset (<0.6%).rosstalk is the primary cause of this drop in accuracy, and because it is deterministic, it can becompensated: with a simple crosstalk correction scheme, we reduce our bit error rates by twoorders of magnitude. Alternatively, crosstalk can be greatly reduced through optimized opticaldesign. Since shot and thermal noise are negligible (see Discussion), the accuracy of the DONNcan therefore be equivalent to an all-electronic DNN accelerator.We also compare the energy consumption of optical interconnects (including light sourceenergy) against that of electronic interconnects over distances representative of logic, memory,and multi-chiplet interconnects in a 7 nm CMOS node. Our calculations show an advantage indata transmission costs for distances ≥ µ m (roughly the size of the basic computation unit:an 8-bit multiply-and-accumulate (MAC), with length 5-8 µ m). Moreover, the DONN scalesfavorably with respect to very large DNN accelerators that require partitioning into multiplechiplets: the DONN’s optical communication cost remains nearly constant at ∼ ∼

90 fJ/bit). Thus, theeﬃcient optical data distribution provided by the DONN architecture will become critical forcontinued growth of DNN performance through increased model sizes and greater connectivity.

Results

Problem statement

A DNN consists of a sequence of layers, in which input activations from one layer are connectedto the next layer via weighted paths (weights), as shown in Fig. 2a. We focus on inferencetasks in this paper (where weights are known from prior training), which, in addition to theenergy consumption problem, places stringent requirements on latency and throughput. Moderninference accelerators expend the majority of energy ( > x of input values (‘input activations’, of length K ) is multiplied by a matrix W K × N ofweights (Fig. 2b). This matrix-vector product yields a vector of output activations ( y , of length N ). Most DNN accelerators process vectors in B -sized batches, where the inputs are representedby a matrix X B × K . The FC layer then becomes a matrix-matrix multiplication ( X B × K · W K × N ).CONV layers can also be processed as matrix multiplications, e.g., with a Toeplitz matrix [4].In matrix multiplication, fan-out, where data is read once from main memory (DRAM) andused multiple times, can greatly reduce data movement and memory access. This amortizationof read cost across numerous operations is critical for overall eﬃciency, since retrieving a singlematrix element from DRAM requires two to three orders of magnitude more energy than theMAC [17]. A simple input-weight product illustrates the beneﬁt of fan-out, since activation andweight elements appear repeatedly, as highlighted by the repetition of X and W : W  X X X X   W W W W  =  X W + X W X W + X W X W + X W X W + X W  (1)Consequently, DNN hardware design focuses on optimizing data transfer and input and weightmatrix element reuse. Accelerators based on conventional electronics use eﬃcient memoryhierarchies, a large array of tightly packed processing elements (PEs, i.e., multipliers with orwithout local storage), or some combination of the these approaches. Memory hierarchiesoptimize temporal data reuse in memory blocks near the PEs to boost performance under theconstraint of chip area [4]. This strategy can enable high throughput in CONV layers [5]. Withfewer intermediate memory levels, a larger array of PEs (e.g., TPU v1 [8]) can further increasethroughput and lower energy consumption on workloads with a high-utilization mapping due to xy W (b)(a) Input activationsOutput classification x W = y . Singleclassification X Batch classification( B objects to process) K K x N N W = Y . B x K K x N B x N == .. k = 18 bits n = n = N k = N Bb = 1... b = B (c) X Optical Electronic

Fig. 2. Digital fully-connected neural network (FC-NN) and hardware implementations.(a) FC-NN with input activations (red, vector length K ) connected to output activations(vector length N ) via weighted paths, i.e., weights (blue, matrix size K × N ). (b) Matrixrepresentation of one layer of an FC-NN with B -sized batching. (c) Example bit-serialmultiplier array, with output-stationary accumulation across k . Fan-out of X across n ∈ { ... N } ; fan-out of W across b ∈ { ... B } . Bottom panel: all-electronic versionwith fan-out by copper wire (for clarity, fan-out of W not illustrated). Top panel: digitaloptical neural network version, where X and W are fanned out passively using optics,and transmitted to an array of photodetectors. Each pixel contains two photodetectors,where the activations and weights can be separated by, e.g., polarization or wavelengthﬁlters. Each photodetector pair is directly connected to a multiplier in close proximity. potentially reduced overall memory accesses and a greater number of parallel multipliers (spatialreuse). Therefore, for workloads with large-scale matrix multiplication such as those mentionedin the Introduction, if we maximize the number of available PEs, we can improve eﬃciency. Digital optical neural network architecture

Our DONN architecture replaces electrical interconnects with optical links to relax the designconstraints of reducing inter-multiplier spacing or colocating multipliers with memory. Speciﬁ-cally, optical elements transfer and fan out activation and weight bits to electronic multipliersto reduce communication costs in matrix multiplication, where each element X bk is fannedout N times, and W kn is fanned out B times. The DONN scheme shown in Fig. 2c spatiallyencodes the ﬁrst column of X B × K activations into a column of on-oﬀ optical pulses. At the ﬁrsttime step, the activation matrix transmitters fan out the ﬁrst bit of each of the matrix elements X b , ∀ b ∈ { ... B } to the PEs (here, k = W n to each PE. The photons from these activation andweight bits generate photoelectrons in the detectors, producing the voltages required at the inputsof electronic multipliers (either 0 V for a ‘0’ or 0.8 V for a ‘1’). After 8 time steps, a multiplierhas received 2 × × K time steps; this dataﬂow is commonly called ‘outputstationary’. Instead of this bit-serial implementation, bits can be encoded spatially, using a bus ofparallel transmitters and receivers. The trade-oﬀ between added energy and latency in bit-serialultiplication versus increased area from photodetectors for a parallel multiplier can be analyzedfor speciﬁc applications and CMOS nodes. W n = 1 n = N ... W n = 1 n = N ... ... b = 1 b = B Scattering element X (a) (b) (c) + + Multiplier V out V out V DD + + V bias V bias To memory or next layer X b = 1... b = B DOE DOE BSLensLens

Fig. 3. Possible implementations of digital optical neural network. (a) Free-spaceversion. Digital inputs and weights are transmitted electronically to an array of lightsources (red and blue, respectively, illustrating diﬀerent paths). Single-mode lightfrom a source is collimated by a spherical lens (Lens), then focused to a 1D spot arrayby a diﬀractive optical element (DOE). A 50:50 beamsplitter brings light from theinputs and weights into close proximity on a custom CMOS receiver. (b) Waveguide orchip-integrated implementation with scatterers above each processing element (PE).(c) Example circuit with 2 photodetectors per PE: 1 for activations; 1 for weights.Received bits proceed to multiplier, then memory or next layer.

We illustrate two exemplary experimental DONN implementations in Fig. 3. In the free-spaceversion (Fig. 3a), each source in a linear array of vertical cavity surface emitting lasers (VCSELs)or µ LEDs emits a cone of light into free space, which is collimated by a spherical lens. Adiﬀractive optical element (DOE) focuses the light to a 1D spot array on a 2D receiver, where theactivations and weights are brought into close proximity using a beamsplitter. Figure 3b showsa waveguide or chip-integrated alternative, where each light source is coupled into an opticalwaveguide. The waveguides are low-loss, except at the scattering elements above each detectorpixel. These scattering elements are tuned such that, along one row, an equal amount of lightenters each photodetector for a ‘1’ (similar concepts have been experimentally demonstrated [42]).In both the free-space and integrated implementations, ‘receiverless’ photodiodes [43] convertthe optical signals to the electrical domain (Fig. 3c). An electronic multiplier then multipliesthe values. The output is either saved to memory, or routed directly to another DONN thatimplements the next layer of computation. Note that the data distribution pattern is not conﬁned toregular rows and columns. A spatial light modulator (SLM), an array of micromirrors, scatteringwaveguides or a DOE can route and fan out bits to arbitrary locations.There will be some length-dependent optical loss that will vary based on the implementation.Since free-space propagation is lossless and mirrors, SLMs and diﬀractive elements are highlyeﬃcient (> 95%), most length- or receiver-number-dependent losses can be attributed to imperfectfocusing, e.g., from optical aberrations far from the optical axis. These eﬀects can be mitigatedhrough judicious optical design. There are also waveguide losses in integrated photonics, butthese can be very low, e.g., 3 dB/m with mm-scale bend radii in silicon nitride [44]. (TheDONN does not require any active components, which makes silicon nitride a good choice here.)Therefore, even if we design a meter-length chip with mm-scale bends in the waveguides, wecan compensate optical losses by increasing the number of photons generated at the sources (forexample, in silicon nitride, by a factor of 2). We assume for the remainder of our analysis thatenergy is length-independent.

Bit error rate and inference experiments

We used a DONN implementation similar to Fig. 3a to test optical digital data transmission andfan-out for DNNs, as described in Methods. In our ﬁrst experiment, we determined the bit errorrate of our system. Figure 4a shows an example of a background-subtracted and normalizedimage, captured on the camera when the DMDs displayed random vectors of ‘1’s and ‘0’s. Thecamera’s Bayer ﬁlter (described in Methods), as well as optical aberrations and misalignment,caused some crosstalk between pixels (see Fig. 4b). Using a region of 357 ×

477 superpixelson the camera, we calculated bit error rates (in a single shot) of 1 . × − and 2 . × − forthe blue and red channels, respectively. When we conﬁned the region of interest to 151 × . × − and 4 . × − for the blue and red arms. See Supplementary Note 1 for moredetails on bit error rate and error maps. Because crosstalk is deterministic, and not a sourceof random noise, we can compensate for it. We applied a simple crosstalk correction schemethat assumes uniform crosstalk on the detector and subtracts a ﬁxed fraction of an element’snearest neighbors from the element itself (see Supplementary Note 2). The bit error rates forthe blue and red channels then respectively dropped to 2 . × − and 0 for the 357 × . × − and 0 for the 151 ×

200 300 400

Correctly received as 1Correctly received as 0Threshold

Multiplier ( y ) I n t en s i t y Multiplier ( x ) (a) (b) ActivationsWeights Optical fan-out M u l t i p li e r ( y ) Fig. 4. Background-subtracted and normalized receiver output from free-space digitaloptical neural network experiment with random vectors of ‘1’s and ‘0’s diplayed onDMDs. (a) Full 2D image. (b) One column: pixels received as ‘1’ in red and ‘0’ inblack.

Next, we experimentally tested the DONN’s eﬀect on the classiﬁcation accuracy of 500 MNISTimages using a three-layer (i.e., two-hidden-layer), fully-connected neural network (FC-NN), withthe dataset and training steps described in Supplementary Note 3. We compared our experimentalclassiﬁcation results with inference performed entirely on CPU (ground truth) in two ways. The

Actual MNIST number P r e d i c t e d M N I S T nu m b e r (a) DONN output scores Actual MNIST number P r e d i c t e d M N I S T nu m b e r (b) Ground truth output scores (c) DONN diagonal (d)

Ground truth diagonal (e)

Difference diagonal

Fig. 5. Experimentally measured 3-layer FC-NN output scores (otherwise known asconfusion matrix) for 500 MNIST images from test dataset. The values along thediagonal represent correct classiﬁcation by the model. Each column is an average of ∼

50 vectors. (a) DONN output scores. (b) Ground-truth (all-electronic) output scores.(c)-(d) Box plot of the diagonals of (a)-(b). (e) Diﬀerence in diagonals of DONN versusground-truth output scores. simplest analysis, reported in Table 1, shows a 0.6% drop in classiﬁcation accuracy for the DONNversus the ground truth values (or 3 additional incorrectly classiﬁed images). Figure 5 illustratesmore detailed results, where we analyzed the network output scores. An output score is roughlyequivalent to the assigned likelihood that an input image belongs to a given class, and is deﬁnedas the normalized (via the softmax function) output vector of a DNN. We found that, along thematrix diagonal, the ﬁrst and third quartiles in the diﬀerence in output scores between the DONNand the ground truth have a magnitude <3%. The absolute diﬀerence in average output scores isalso <3%. We also performed this experiment with a single hidden layer (‘2-layer’ case), andachieved similar results (a 0.4% drop in accuracy, or 2 misclassiﬁed images). No crosstalk errorcorrection was applied to these results.

Energy analysis: DONN compared with all-electronic hardware

In this section, we compare the theoretical interconnect energy consumption of the DONN with itsall-electronic equivalent, where interconnects are illustrated in green in Fig. 6. The interconnectenergy, which must include any source ineﬃciencies, is the energy required to charge the parasiticwire, detector, and inverter capacitances, where a CMOS inverter is representative of the input to amultiplier. See Methods for full energy calculations. In the electronic case, a long wire transports able 1. MNIST classiﬁcation accuracy of DONN versus all-electronic hardware withcustom fully-connected neural network models h ν / e must be greater than or equal to the bandgap E g of thedetector material (here, we have chosen silicon as an example, and set h ν / e = E g ). C det is atheoretical approximation for a ( × × ) µ m cubic detector [43] and the optical source powerconversion eﬃciency (wall-plug eﬃciency, i.e., WPE) is a measured value for VCSELs [45, 46]. C T is an approximation for the capacitance of an inverter in a state-of-the-art node [43, 47] and L wire is the distance between MAC units in various scenarios.We ﬁnd that the optical interconnect energy is independent of length at 0.2 fJ/bit, while theelectrical interconnect energy scales from 0.2-0.3 fJ/bit for inter-multiplier communication forabutted MAC units to 90 fJ/bit for inter-chiplet interconnects. The crossover point where theoptical interconnect energy drops below the electrical energy occurs when L wire ≥ µ m. TheDONN therefore provides an improvement in the interconnect energy for data transmissionand can scale to greatly decrease the energy consumption of data distribution with regulardistribution patterns. Additionally, advanced technologies are emerging which could lower itsenergy consumption, such as plasmonic photodetectors with ultra-low capacitance [48] and moreeﬃcient VCSELs. (b)(a) (c) Mem PEPEPEPE PEPEPEPE (d)

PEPEPEPE PEPEPEPEMem Mem Mem

Fig. 6. Fan-out of one bit from memory (Mem) to multiple processing elements (PEs).(a) Fan-out by electrical wire to a row of PEs in a monolithic chip. (b) DONN equivalentof monolithic chip, where green wire is replaced by optical paths. (c) Fan-out byelectrical wire to blocks of PEs divided into chiplets, or separated by memory and logic.(d) DONN equivalent of fan-out to PEs in multiple blocks (energetically equivalent to(b)).

Discussion

With minimal impact on accuracy, the DONN yields an energy advantage over all-electronicaccelerators with long wire lengths. In our proof-of-concept experiment, we performed inference able 2. Interconnect energies over three distances: inter-MAC, inter-SRAM, andinter-chiplet

Global parameters C wire / µ m ∼ / µ m [43, 49, 50] C T ∼ C det h ν / e ∼ L wire µ m † V DD E elec / bit 0.2-0.3 fJ/bit E DONN / bit 0.2 fJ/bit Inter-SRAM (7 nm SRAM macro) L wire µ m [52] V DD E elec / bit 2 fJ/bit E DONN / bit 0.2 fJ/bit Inter-chiplet [22] L wire ∼ µ m V DD E elec / bit 90 fJ/bit E DONN / bit 0.2 fJ/bit † We assume a square multiplier and scale reported 8-bit multiplier areas [53–55] froma 45 nm to a 7 nm node (the current state of the art) with the scaling factors fromRef. [51]. A MAC unit comprises both an 8-bit multiplier and a 32-bit adder, so we areplacing a lower bound on the minimum length of L wire . Recent work [56] optimizesMAC units for DNNs, and reports a 337 µ m area in a 28 nm node, where the MACunit comprises an 8-bit multiplier and a 32-bit adder. Extrapolated to a 7 nm node witha fourth-order polynomial ﬁt of the scaling factors from Ref. [51], the MAC unit is ofsize (7 µ m ) , which falls within the 5-8 µ m range.*Input-output voltage and core logic voltage can diﬀer in CMOS. In optics, however,since the data delivery mechanism does not vary with distance travelled, we assume V DD remains constant at 0.80 V. on 500 MNIST images with 2- and 3-layer FC-NNs and found a <0.6% drop in accuracy and a<3% absolute diﬀerence in average output scores with respect to the ground truth implementationon CPU. We attributed these errors to crosstalk due to imperfect alignment and blurring fromthe camera’s Bayer ﬁlter. In fact, a simple crosstalk correction scheme lowered measured biterror rates by two orders of magnitude. We could thus transmit bits with 100% measuredﬁdelity in the activation arm (better aligned than the weight arm), which illustrates that crosstalkcan be mitigated and possibly eliminated either through post-processing, charge sharing at thetransmitters, greater spacing of receivers, or optimized design of optical elements and receiverpixels. In the hypothetical regime where error due to crosstalk is negligible, the remaining noisesources are shot and thermal noise. Intuitively, shot and thermal noise are also present in anall-electronic system, and the number of photoelectrons at the input to an inverter in the DONN isequal to the number of electrons at the input to an inverter in electronics. Therefore, if these noisesources do not limit accuracy in the all-electronic case, the same can be said for the DONN [43].For mathematical validation that shot and thermal noise have a trivial impact on bit error rate inthe DONN, see Supplementary Note 7. These analyses demonstrate that the fundamental limit tothe accuracy of the DONN is no diﬀerent than the accuracy of electronics, and thus, we do notexpect accuracy to hinder DONN scaling in an optimized system.In our theoretical energy calculations, we compared the nearly length-independent data deliverycosts of the DONN with those of an all-electronic system. We found that in the worst case,hen multipliers are abutted in a multiplier array, optical transmitters have a similar interconnectenergy cost compared to copper wires in a 7 nm node ( ∼ ∼ ∼ ∼ ∼ ∼

90 fJ/bit). In themulti-chiplet case, the cost to transmit two 8-bit values in electronics ( ∼ ∼ × > µ m. More broadly, furthergains can be expected through the relaxation of electronic system architecture constraints. Methods

Digital optical neural network implementation for bit error rate and inference experiments

We performed bit error rate and inference experiments with optical data transfer and fan-out ofpoint sources using cylindrical lenses. Two digital micromirror devices (DMDs, Texas InstrumentsDLP3000, DLP4500) illuminated by spatially-ﬁltered and collimated LEDs (Thorlabs M625L3,M455L3) acted as stand-ins for the two linear source arrays. For the input activations/weights,each 10.8 µ m-long mirror in one DMD column/row either reﬂected the red/blue light toward thedetector (‘1’) or a beam dump (‘0’). Then, for each of the DMDs, an f =

100 mm sphericallens followed by an f =

100 mm cylindrical achromatic lens imaged one DMD pixel to an entirerow/column of superpixels of a color camera (Thorlabs DCC3240C). Each camera superpixelis made up of four pixels of size (5.3 µ m) : two green, one red and one blue. The cameraacquisition program applies a ‘de-Bayering’ ﬁlter to automatically extract color information foreach sub-pixel; this ﬁlter causes blurring, and therefore it increased crosstalk in our system. In afuture version of the DONN, a specialized receiver will reduce this crosstalk and also operate ata higher speed.To process the image received on the camera, we subtracted the background, normalized,then thresholded. (We acquired normalization and background curves with all DMD pixels inthe ‘on’ and ‘oﬀ’ states, respectively. This background subtraction and normalization couldbe implemented on-chip by precharacterizing the system, and biasing each receiver pixel bysome ﬁxed voltage.) If the detected intensity was above the threshold value, it was labeled a‘1’; below threshold, a ‘0’. For the bit error rate experiments, we compared the parsed valuesfrom the camera with the known values transmitted by the DMDs, and deﬁned the bit error rateas the number of incorrectly received bits divided by the total number of bits. In the inferenceexperiments, the DMDs displayed the activations and pre-trained weights, which propagatedthrough the optical system to the camera. After background subtraction and normalization, theCPU multiplied each activation with each weight, and applied the nonlinear function (ReLU afterthe hidden layers and softmax at the output). We did not correct for crosstalk here, to illustratethe worst-case scenario of impact on accuracy. The CPU then fed the outputs back to the inputactivation DMD for the next layer of computation. We used a DNN model with two hidden layerswith 100 activations each and a 10-activation output layer. We also tested a model with a singlehidden layer with 100 activations. MNIST preprocessing

For the inputs to the network, a bilinear interpolation algorithm transformed the 28 × × = QuantizedMin + ( Input − FloatingMin ) Scale (2)where Quantized is the returned value, QuantizedMin is the minimum value expressible in thequantized datatype (here, always 0), Input is the input data to be quantized, FloatingMin ishe minimum value in Input, and Scale is the scaling factor to map between the two datatyperanges (cid:16)

FloatingMax − FloatingMinQuantizedMax − QuantizedMin (cid:17) . See gemmlowp documentation [61] for more information onimplementations of this quantization. (In practice, 8-bit representations are widely used in DNNs,since 8-bit MACs are generally suﬃcient to maintain accuracy in inference [8, 62, 63]).

Electronic and optical interconnect energy calculations

When an electronic wire transports data over a distance L wire to the gate of a CMOS inverter(representative of a full-adder’s input, which are the building blocks of multipliers), the energyconsumption per bit is: E elec / bit = (cid:16) C wire µ m · L wire + C T (cid:17) · V DD (3)where V DD is the supply voltage, L wire is the wire length between two multipliers and C T is theinverter capacitance. Interconnects consume energy predominantly when a load capacitance,such as a wire, is charged from a low (0 V) to a high ( ∼ → →

1) consumes littleadditional energy. To switch a wire from a ‘1’ to a ‘0’, the wire is discharged to the ground forfree (Supplementary Note 4). Lastly, maintaining a value of ‘0’ simply keeps the voltage at 0 V,at no cost. Assuming a random distribution of ‘0’ and ‘1’ bits, we therefore include a factor of1/4 in equation (3) to account for this dependence on switching activity.In the DONN, a light source replaces the wire for fan-out. The low capacitances of thereceiverless detectors in the DONN allow for the removal of receiving ampliﬁers [43]. Thus, theDONN’s minimum energy consumption corresponds to the optical energy required to generatea voltage swing of 0.8 V on the load capacitance (i.e., the photodetector ( C det ) and an inverter( C T )), all divided by the source’s power conversion eﬃciency (called wall-plug eﬃciency, WPE).Subsequent transistors in the multiplier are powered by the oﬀ-chip voltage supply, as in theall-electronic architecture. Assuming a detector responsivity of ∼ E DONN / bit = · WPE · h ν · n p (4)where h ν is the photon energy and the number of photons per bit, n p , is determined by: n p = ( C det + C T ) · V DD e (5)As in the all-electronic case, we assume low leakage on the receiverless photodetector. Photons arereceived for every ‘1’ and therefore, to avoid charge buildup, charge on the output capacitor mustbe reset after every clock cycle. In Supplementary Note 5, we propose a CMOS discharge circuitthat actively resets the receiver. (Another possible method is a dual-rail encoding scheme [43].)Thus, the switching activity factor is 1/2 instead of 1/4: as for the all-electronic case, we assumea random distribution of bits, but here, both 1 → → References

1. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In

Advances in Neural Information Processing Systems 25 .1097–1105 (2012).2. Esteva, A. et al.

Dermatologist-level classiﬁcation of skin cancer with deep neural networks.

Nature , 115–118(2017).3. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.

Nature , 436–444 (2015).4. Sze, V., Chen, Y., Yang, T. & Emer, J. S. Eﬃcient processing of deep neural networks: A tutorial and survey.

Proceedings of the IEEE , 2295–2329 (2017).5. Chen, Y., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: An energy-eﬃcient reconﬁgurable accelerator for deepconvolutional neural networks.

IEEE Journal of Solid-State Circuits , 127–138 (2017).. Chen, Y.-H., Yang, T.-J., Emer, J. & Sze, V. Eyeriss v2: A ﬂexible accelerator for emerging deep neural networks onmobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems , 292–308 (2019).7. Yin, S. et al. A high energy eﬃcient reconﬁgurable hybrid neural network processor for deep learning applications.

IEEE Journal of Solid-State Circuits , 968–982 (2018).8. Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In .1–12 (2017).9. Xu, X. et al.

Scaling for edge inference of deep neural networks.

Nature Electronics , 216–222 (2018).10. Duarte, J. et al. Fast inference of deep neural networks in FPGAs for particle physics.

Journal of Instrumentation ,P07027–P07027 (2018).11. Petrillo, C. E. et al. LinKS: discovering galaxy-scale strong lenses in the kilo-degree survey using convolutionalneural networks.

Monthly Notices of the Royal Astronomical Society , 3879–3896 (2019).12. Ćiprijanović, A., Snyder, G., Nord, B. & Peek, J. Deepmerge: Classifying high-redshift merging galaxies with deepneural networks.

Astronomy and Computing .1–10(2019).14. Huerta, E. A. et al.

Enabling real-time multi-messenger astrophysics discoveries with deep learning.

Nature ReviewsPhysics , 600–608 (2019).15. Gupta, U. et al. The architectural implications of facebook’s dnn-based personalized recommendation. In .488–501 (2020).16. Lan, Z. et al.

ALBERT: A lite BERT for self-supervised learning of language representations. Preprint at arXiv:1909.11942 (2019).17. Horowitz, M. Computing’s energy problem (and what we can do about it). In .10–14 (2014).18. Poulton, J. W. et al.

A 1.17-pj/b, 25-gb/s/pin ground-referenced single-ended serial link for oﬀ- and on-packagecommunication using a process- and temperature-adaptive voltage regulator.

IEEE Journal of Solid-State Circuits ,43–54 (2019).19. Shrivastava, M. et al. Physical insight toward heat transport and an improved electrothermal modeling framework forFinFET architectures.

IEEE Transactions on Electron Devices , 1353–1363 (2012).20. Gupta, M. S., Oatley, J. L., Joseph, R., Wei, G. & Brooks, D. M. Understanding voltage variations in chipmultiprocessors using a distributed power-delivery network. In .1–6 (2007).21. Fowers, J. et al. A conﬁgurable cloud-scale dnn processor for real-time AI. In .1–14 (2018).22. Shao, Y. S. et al.

Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In

Proceedingsof the 52nd Annual IEEE/ACM International Symposium on Microarchitecture - MICRO ’52 .14–27 (2019).23. Yin, J. et al.

Modular routing design for chiplet-based systems. In .726–738 (2018).24. Samajdar, A. et al.

A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim.In .304–315 (IEEE, 2020).25. Simonyan, K. & Zisserman, A. In

International Conference on Learning Representations (2015).26. Szegedy, C. et al.

Going deeper with convolutions. Preprint at arXiv:1409.4842 (2014).27. Mnih, V. et al.

Human-level control through deep reinforcement learning.

Nature , 529–533 (2015).28. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computervision. In .2818–2826 (2016).29. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In .770–778 (2016).30. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In .1800–1807 (2017).31. Vaswani, A. et al.

Attention is all you need. In

Advances in Neural Information Processing Systems 30 .5998–6008(2017).32. Zoph, B., Vasudevan, V., Shlens, J. & Le, Q. V. Learning transferable architectures for scalable image recognition. In .8697–8710 (2018).33. Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In .7132–7141 (2018).34. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers forlanguage understanding. Preprint at arXiv:1810.04805 (2018).35. Radford, A. et al.

Language models are unsupervised multitask learners.

OpenAI Blog (2019). https://openai.com/blog/better-language-models .36. Dai, Z. et al. Transformer-XL: attentive language models beyond a ﬁxed-length context. In

Proceedings of the57th Annual Meeting of the Association for Computational Linguistics .2978–2988 (Association for ComputationalLinguistics, Florence, Italy, 2019).37. Hamerly, R., Bernstein, L., Sludds, A., Soljacic, M. & Englund, D. Large-scale optical neural networks based onhotoelectric multiplication.

Physical Review X , 021032 (2019).38. Tait, A. N. et al. Neuromorphic photonic networks using silicon photonic weight banks.

Scientiﬁc Reports , 1–10(2017).39. Lin, X. et al. All-optical machine learning using diﬀractive deep neural networks.

Science , 1004–1008 (2018).40. Shen, Y. et al.

Deep learning with coherent nanophotonic circuits.

Nature Photonics , 441–446 (2017).41. Feldmann, J. et al. Parallel convolution processing using an integrated photonic tensor core. Preprint at arXiv:2002.00281 (2020).42. Sun, J., Timurdogan, E., Yaacobi, A., Hosseini, E. S. & Watts, M. R. Large-scale nanophotonic phased array.

Nature , 195 (2013).43. Miller, D. A. B. Attojoule optoelectronics for low-energy information processing and communications.

Journal ofLightwave Technology , 346–396 (2017).44. Bauters, J. F. et al. Ultra-low loss silica-based waveguides with millimeter bend radius. In .1–3 (2010).45. Iga, K. Vertical-cavity surface-emitting laser: its conception and evolution.

Japanese Journal of Applied Physics ,1 (2008).46. Jäger, R. et al.

57% wallplug eﬃciency oxide-conﬁned 850 nm wavelength GaAs VCSELs.

Electronics Letters ,330–331 (1997).47. Zheng, P., Connelly, D., Ding, F. & Liu, T.-J. K. FinFET evolution toward stacked-nanowire FET for CMOStechnology scaling. IEEE Transactions on Electron Devices , 3945–3950 (2015).48. Tang, L. et al. Nanometre-scale germanium photodetector enhanced by a near-infrared dipole antenna.

NaturePhotonics , 226–229 (2008).49. Keckler, S. W., Dally, W. J., Khailany, B., Garland, M. & Glasco, D. GPUs and the future of parallel computing. IEEE Micro , 7–17 (2011).50. Dally, W. J. et al. Hardware-enabled artiﬁcial intelligence. In .3–6 (2018).51. Stillmaker, A. & Baas, B. Scaling equations for the accurate prediction of CMOS device performance from 180 nmto 7 nm.

Integration , 74–81 (2017).52. Chang, J. et al. A 7nm 256Mb SRAM in high-k metal-gate FinFET technology with write-assist circuitry forlow-VMIN applications. In .206–207 (IEEE, 2017).53. Saadat, H., Bokhari, H. & Parameswaran, S. Minimally biased multipliers for approximate integer and ﬂoating-pointmultiplication.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2623–2635(2018).54. Shoba, M. & Nakkeeran, R. Energy and area eﬃcient hierarchy multiplier architecture based on Vedic mathematicsand GDI logic. Engineering Science and Technology, an International Journal , 321–331 (2017).55. Ravi, S., Patel, A., Shabaz, M., Chaniyara, P. M. & Kittur, H. M. Design of low-power multiplier using UCSLAtechnique. In Artiﬁcial Intelligence and Evolutionary Algorithms in Engineering Systems .119–126 (2015).56. Johnson, J. Rethinking ﬂoating point for deep learning. Preprint at arXiv:1811.01721 (2018).57. Wu, Z. et al.

A comprehensive survey on graph neural networks.

IEEE Transactions on Neural Networks andLearning Systems

IEEE Transactions on Knowledge and DataEngineering et al.

MLPerf: An industry standard benchmark suite for machine learning performance.

IEEE Micro ,8–16 (2020).60. Parashar, A. et al. Timeloop: A systematic approach to DNN accelerator evaluation. In .304–315 (IEEE, 2019).61. Jacob, B. & Warden, P. et al . gemmlowp: a small self-contained low-precision GEMM library. https://github.com/google/gemmlowp (2015, accessed 2020).62. Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M. & Moshovos, A. Stripes: Bit-serial deep neural networkcomputing. In .1–12 (2016).63. Albericio, J. et al.

Bit-pragmatic deep neural network computing. In .382–394 (2017).64. Coimbatore Balram, K., Audet, R. & Miller, D. Nanoscale resonant-cavity-enhanced germanium photodetectors withlithographically deﬁned spectral response for improved performance at telecommunications wavelengths.

Opticsexpress , 10228–33 (2013). Acknowledgements

Thanks to Christopher Panuski for helpful discussions about µ LEDs and Angshuman Parasharand Yannan (Nellie) Wu for insights into all-electronic DNN accelerators. We would also like tothank Mohamed Ibrahim for useful discussions on receiver discharging circuits. Anthony Penneshelped with several machining tasks. Thanks to Ronald Davis III and Zhen Guo for manuscriptrevisions. We also thank the NVIDIA Corporation for the donation of the Tesla K40 GPU usedfor training the fully-connected networks. Equipment was purchased thanks to the U.S. Armyesearch Oﬃce through the Institute for Soldier Nanotechnologies (ISN) at MIT under grant no.W911NF-18-2-0048. L.B. is supported by a Postgraduate Scholarship from the Natural Sciencesand Engineering Research Council of Canada, National Science Foundation (NSF) E2CDA grantno. 1640012 and the afore-mentioned ISN grant. A.S. is supported by an NSF Graduate ResearchFellowship Program under grant no. 1122374, NTT Research Inc., NSF EAGER program grantno. 1946967, and the NSF/SRC E2CDA and ISN grants mentioned above. R.H. was supportedby an Intelligence Community Postdoctoral Research Fellowship at MIT, administered by ORISEthrough the U.S. DoE / ODNI.

Author contributions

D.E. and R.H. developed the original concept. L.B. designed and performed the hardwareexperiments with the support of A.S. and D.E. A.S. developed the data acquisition, training, andconfusion matrix analysis software. L.B. developed the output image processing software andperformed the bit error rate calculations. L.B. and A.S. performed the energy calculations, withcritical insights from R.H. J.E. and V.S. provided critical insights into all-electronic hardwarecomparisons. L.B. and A.S. wrote the manuscript with input from all authors. R.H., J.E., V.S.and D.E. supervised the project.

Competing interests

The authors declare no competing interests.

Additional information

Correspondence

Correspondence and requests for materials should be addressed to L.B., A.S. or D.E.

Supplementary information

Supplementary information is available for this paper. reely scalable and reconﬁgurable optical hardwarefor deep learning: supplementary material L IANE B ERNSTEIN , A

LEXANDER S LUDDS , R

YAN H AMERLY , V

IVIENNE S ZE , J OEL E MER , AND D IRK E NGLUND Research Laboratory of Electronics, Massachusetts Institute of Technology, 50 Vassar St, Cambridge, MA 02139, USA NTT Research, 1950 University Ave NVIDIA, 2 Technology Park Drive, Westford, MA 01886, USA Computer Science and Artiﬁcial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA 02139, USA † These authors contributed equally to this work e-mail: [email protected] e-mail: [email protected] e-mail: [email protected] June 25, 2020 SUPPLEMENTARY NOTE 1: BIT ERROR RATE DUE TOCROSSTALK

Here, we show experimental bit error rate maps for both theblue and red channels. Each DMD pixel is fanned out to a row(column) of superpixels on the camera for the input activations(weights). The Bayer ﬁlter allows the discrimination of the inputactivations from the weights into red and blue channels, respec-tively. Since the camera has four sub-pixels per superpixel, webin the sub-pixels into 2 × SUPPLEMENTARY NOTE 2: CROSSTALK CORRECTION

The bit error rate described in the previous section is mainlyattributable to optical crosstalk at the detector, due to imper-fect lenses and alignment. Since this error is deterministic (asopposed to random ﬂuctuations), it can be compensated by post-processing. To illustrate this principle, we performed simplecrosstalk correction: we multiplied each line of an image de-tected on the camera by a tridiagonal crosstalk reduction matrix,per equation (S1) (where I : n is the corrected line of the cameraimage). ξ was estimated to be ∼ ∼ I : n is renormalized after thismatrix multiplication. We show the effects of crosstalk reduction in Fig. S2.  I n I n ...  =  − ξ − ξ − ξ − ξ   I n I n ...  (S1)To maximize energy efﬁciency and throughput, the ﬁnal ver-sion of this system (with a custom CMOS chip that integratesdetection with digital MAC computation) will not perform anypost-processing. Instead, we might use a charge-sharing schemeat the transmitters to implement a version of equation (S1). Al-ternatively, we could simply reduce crosstalk by changing thesystem design; for example, we could choose to space the PEsfurther apart or shrink the active region of the detectors to im-prove the ratio of signal at the current pixel to noise from neigh-boring pixels. SUPPLEMENTARY NOTE 3: TRAINING AND TEST SETS

In our proof-of-concept experiment, we performed inference on500 images using a two-hidden-layer, fully-connected neuralnetwork, where each hidden layer had 100 activations. We usedTensorFlow’s built-in dataset importer to download the ﬁrst 500images in the test set of the MNIST handwritten digit dataset [1],as downloaded from the TensorFlow 2 Keras database. Releventcode can be found in the GitHub repository for user AlexanderSludds (alexsludds): https://github.com/alexsludds/Digital-Optical-Neural-Network-Code

The model’s weights were pre-trained on an NVIDIA K40GPU using the entire MNIST training set. Categorical cross-entropy was used as a loss function. Dropout regularized the a r X i v : . [ c s . ET ] J un C a m e r a s upe r p i x e l ( y ) -1 -2 -30085 112 C a m e r a s upe r p i x e l ( y ) -1 -2 -3356 C a m e r a s upe r p i x e l ( y ) C a m e r a s upe r p i x e l ( y ) Camera superpixel ( x ) (a)(b)(c)(d) Fig. S1.

Bit error rates in proof-of-concept experiment. (a) Bluechannel: errors when random vector of ‘0’s and ‘1’s displayedon DMD (single shot). Blue: incorrectly transmitted bit; white:correct. (b) Region of interest selected for experiment. Errorin blue channel averaged over 100 frames (different vectorsdisplayed on DMD at each frame). (c)-(d) same as (a)-(b), butfor red channel. model’s weights in each layer to prevent overﬁtting. Input im-ages were downsized from 49 ×

49 to 7 ×

200 300 400

Correctly classified as 1 Correctly classified as 0

Multiplier ( y ) I n t en s i t y

200 300 400 I n t en s i t y Multiplier ( x ) Correctly classified as 1 Correctly classified as 0 I n t en s i t y I n t en s i t y (a)(b)(c)(d) Fig. S2.

One line of receiver image after background subtrac-tion and normalization, with random vectors of ‘1’s and ‘0’sdisplayed on DMDs. (a) Column 100 in red channel (same asFig. 4b in main text). (b) Same as (a), after crosstalk correction.(c) Row 100 in blue channel. (d) Same as (c), after crosstalkcorrection.

SUPPLEMENTARY NOTE 4: ELECTRONIC INTERCON-NECT SWITCHING ENERGY IN 0 TO 1 TRANSITIONS

The dynamic switching energy of CMOS devices is the amountof energy required to charge the output capacitance of a CMOSgate. Energy is only consumed in CMOS inverters for low-to-high transitions on the outputs of these gates. Consider thetoy circuit model shown in Fig. S3. On the left is a CMOSinverter, and on the right are a low-to-high and high-to-low tran-sitions, respectively. In the low-to-high transition, the PMOShas to switch closed, shorting the output to the supply rail bycharging the load capacitance. In the high-to-low transition, theNMOS already has a sufﬁcient drain-to-source voltage from theload capacitance charge, so it can discharge the output without ig. S3.

A demonstration of where dynamic energy consump-tion goes during switching of a CMOS inverter. The circuit,shown left, consists of a stacked NMOS and PMOS device.During an output low to high transition, shown center, chargeis deposited on the lumped output capacitance. During an out-put low to high transition, shown right, that charge from thelumped output capacitance is discharged through the NMOSinto ground.

Fig. S4.

A proposed circuit for resetting the receiver lumpedcapacitor model.consuming any power from the supply. To summarize, in anoutput which switches from low to high and back to low again,the PMOS initially turns on, taking CV DD energy from the sup-ply, then the NMOS will turn on, discharging CV DD from thecharged load capacitor (the other CV DD is dissipated as heatin the resistive load). SUPPLEMENTARY NOTE 5: RESETTING A ‘RECEIVER-LESS’ CIRCUIT

There are several circuit methods by which the accumulatedcharge on the input capacitor can be reset. In the method shownin Fig. S4, we place the NMOS device NMOS

Discharge betweenthe photodetector and ground and drive the gate with an ex-ternal reference voltage V ref . The beneﬁt of this solution is thatit consumes no dynamic energy when there is no optical inputpower. However, it has the tradeoff that it requires additionalarea on chip and, because it is ratioed logic, requires carefuldesign to ensure functionality. The width of NMOS Discharge isset such that the accumulated charge on the capacitor generatesa voltage high enough to overcome the input threshold of theload (modeled here as a CMOS inverter), but not so small that itcannot dissipate the charge quickly in a single clock cycle. Oneproblem that arises from receiverless photodetection is that aconstant steam of ‘1’s coming into a photodetector without astrong enough NMOS

Discharge ﬁll causes additional charge toslowly build on the load capacitance. To compensate, we pro-pose a P-N junction diode (Clamp Diode).

SUPPLEMENTARY NOTE 6: ELECTRONIC REPEATERS

A naive implementation of a repeater is a double inverter. Theenergy required is C T V , since in any transition, one invertermust be making a low-to-high transition and the other a high-to-low transition. As a result, in any ‘ﬂip’ of a repeater, one inverterdoes not consume energy. Using the values in Table 2 of the maintext, the cost of a repeater is .06 fJ/bit for an output low-to-hightransition. Therefore, even in the worst-case scenario where weplace a repeater between every multiplier in an array of abutted8-bit MAC units, the inter-multiplier interconnect energy cost islarger than that of the repeater. SUPPLEMENTARY NOTE 7: SHOT AND THERMALNOISE

In a hypothetical crosstalk-free DONN, the remaining noisesources are thermal (Johnson) and shot noise. To gain insight intowhether they would affect classiﬁcation accuracy, we estimatethe ensuing bit error rates (BERs). The detector registers a ‘1’when q ≥ q D photoelectrons are generated, and a ‘0’ when q < q D , where we assume the threshold charge is set by q D = n p /2electrons. Fig. S5 illustrates the probability distributions of thenumber of photoelectrons, as well as the probabilities that a ‘0’is received when a ‘1’ is sent (BER ), and vice-versa (BER ). q D n P qp ( q ) p ( q )BER BER p ( q ) Fig. S5.

Schematic representation of probability density func-tion of received charge (curves) and bit error rate (shadedregion) - not to scale.In a receiverless photodetector scheme, thermal noise can beapproximated as ‘kT/C’ noise [2], with: σ V = q k B T / ( C det + C T ) (S2)where σ V is the standard deviation of voltage, k B is the Boltz-mann constant, T is the temperature in Kelvin, C det is the ca-pacitance of the photodetector, and C T is the capacitance of theinverter. The temperature depends on quality of heat sinkingand proximity to hot spots; from Ref. [3], we assume it is inthe range T ∈ [ − ] . Using the values from Table 2 of themain text, we ﬁnd σ V ≈ − (cid:28) V DD . We can further verifywhether thermal noise is likely to cause bit errors by approxi-mating the probability distribution due to thermal noise, p J ( q ) ,by a Gaussian: p J ( q ) = σ J √ π e − q σ (S3) EFERENCES REFERENCES with σ J = p k B T ( C det + C T ) / e ≈ − ) since the number of transmitted photons is n p =

0. Thus:BER = ∞ ∑ q = q D p ( q ) = ∞ ∑ q = q D p J ( q ) = ∞ ∑ q = q D σ J √ π e − q σ (S4) ≈

12 erfc q D √ σ J ! (S5)BER for different n p = q D are reported in Table S1.We assume shot noise follows a Poissonian probability distri-bution: p shot ( q ) = e − n p (cid:0) n p (cid:1) q q ! (S6)where n p is the number of photons per detector per clock cycle.For ease of computation with large n p , we take the natural loga-rithm:ln ( p shot ( q )) = ln e − n p (cid:0) n p (cid:1) q q ! ! (S7) = ln (cid:0) e − n p (cid:1) + q ln (cid:0) n p (cid:1) − ln ( q ! ) (S8) = − n p + q ln (cid:0) n p (cid:1) − q ∑ m = ln ( m ) (S9) ⇓ (S10) p shot ( q ) = exp − n p + q ln (cid:0) n p (cid:1) − q ∑ m = ln ( m ) ! (S11)BER due to shot noise is therefore:BER shot1 = q D − ∑ q = p shot ( q ) (S12)Results of this computation for various n p are shown in Table S1. Table S1.

Expected values for BER due to shot noise for differ-ent numbers of transmitted photons/bit n p BER ∗ BER shot1

BER total1

10 10 − − −

100 10 − − − − − − − † − − − − *BER = BER thermal1†

Too small for MATLAB to compute.Note: We report a range since thermal noise, and thereforeBER, depends on quality of heat sinking.Thermal noise will also contribute to BER ; we convolve theprobability distributions to ﬁnd the total bit error rate:BER total1 = q D − ∑ q = p ( q ) = q D − ∑ q = p shot ( q ) ~ p J ( q ) (S13)From equation (5) in the main text, we ﬁnd n p ≈ REFERENCES

1. LeCun, Y., Cortes, C. & Burges, C. J. C. MNIST handwrittendigit database. http://yann.lecun.com/exdb/mnist/ (1998).2. Miller, D. A. B. Attojoule optoelectronics for low-energyinformation processing and communications.

Journal ofLightwave Technology , 346–396 (2017).3. Shrivastava, M. et al. Physical insight toward heat transportand an improved electrothermal modeling framework forﬁnfet architectures.

IEEE Transactions on Electron Devices59