Neural Sampling Machine with Stochastic Synapse allows Brain-like Learning and Inference
Sourav Dutta, Georgios Detorakis, Abhishek Khanna, Benjamin Grisafe, Emre Neftci, Suman Datta
Neural Sampling Machine with Stochastic Synapse allows Brain-like Learning and Inference
Sourav Dutta , Georgios Detorakis , Abhishek Khanna , Benjamin Grisafe , Emre Neftci and Suman Datta Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, USA Department of Cognitive Sciences, University of California Irvine, Irvine, CA, 92697, USA *Corresponding author: [email protected]
Many real-world mission-critical applications require continual online learning from noisy data and real-time decision making with a defined confidence level. Probabilistic models and stochastic neural networks can explicitly handle the uncertainty in data and allowing adaptive learning on the fly, but their implementation in a low-power substrate remains a challenge. In this work, we introduce a novel hardware fabric that can implement a new class of stochastic neural network called Neural Sampling Machine (NSM) that exploits stochasticity in its synaptic connections for approximate Bayesian inference. Harnessing the inherent non-linearities and stochasticity occurring at the atomic level in emerging materials and devices allows us to capture the synaptic stochasticity occurring at the molecular level in biological synapses. We experimentally demonstrate an in silico hybrid stochastic synapse by pairing a ferroelectric field-effect transistor (FeFET)-based analog weight cell with a two-terminal stochastic selector element. Such a stochastic synapse can be integrated within the well-established crossbar array architecture for compute-in-memory (CIM). We experimentally show that the inherent stochastic switching of the selector element between the insulator and metallic state introduces a multiplicative stochastic noise within the synapses of NSM that samples the conductance states of the FeFET, both during learning and inference. Using experimentally calibrated models, we perform network-level simulations to highlight the salient automatic weight normalization feature introduced by the stochastic synapses of the NSM that paves the way for continual online learning without any offline Batch Normalization. We also showcase the Bayesian inferencing capability introduced by the stochastic synapse during inference mode, thus accounting for uncertainty in data. We report high accuracy of 98.25% on standard image classification task as well as the estimation of data uncertainty in original vs. rotated samples. Building such a stochastic NSM hardware will allow using inspiration from neuroscience to design a ML architecture that can learn and report uncertainty.
Harnessing the intricate dynamics at the microscopic level in emerging materials and devices have unraveled new possibilities for brain-inspired computing such as building analog multi-bit synapses [1]–[10] and bio-inspired neuronal circuits [10]–[12]. Such emerging materials and devices also exhibit inherent stochasticity at the atomic level which is often categorized as a nuisance for information processing. In contrast, variability is a prominent feature exhibited by biological neural networks at the molecular level are believed to contribute to the computational strategies of the brain [13]. Such variability has been reported in the recordings of biological neurons or as unreliability associated with the synaptic connections. Typically, a presynaptic neuronal spike causes the release of neurotransmitters at the synaptic release site as illustrated in Fig. 1(a). Borst et. al. [14] reported that the synaptic vesicle release in the brain can be extremely unreliable. The transmission rate can be as high as 50% and as low as 10% measured in vivo at a given synapse. Synaptic noise has the distinguishing feature of being multiplicative which plays a key role in learning and probabilistic inference dynamics. In this work, we propose a novel stochastic synapse that harnesses the inherent variability present in emerging devices and mimic the dynamics of a noisy biological synapses. This allows us to realize a novel neuromorphic hardware fabric that can support a recently proposed class of stochastic neural network called the Neural Sampling Machine (NSM) [15]. While the functional role of this multiplicative stochasticity in the brain is still under debate, the biologically inspired stochasticity can be exploited in certain machine learning algorithms. In particular, NSMs build on the idea of introducing stochasticity at various levels in a neural network to allow – (a) escaping local minima during learning and inference [16], (b) regularization in neural networks [17], [18], (c) approximate Bayesian inference with Monte Carlo sampling [19], [20] and (d) energy efficient communication and computation [21], [22]. NSM draws inspiration from regularization techniques such as Dropout [17] or DropConnect [18] that randomly drop a subset of neural activation or weights in the neural network during the forward pass of training. Contrary to DropConnect where stochasticity is switched off during inference, the synaptic stochasticity is always present in an NSM. This always-on stochasticity confers probabilistic inference capabilities to the network [20] and is consistent with the idea of continual learning and lifelong learning machines while improving energy efficiency [21], [22]. Neural networks equipped with always-on stochasticity have been shown to match or surpass the performance of contemporary machine learning algorithms. Together with multiplicative noise incorporated in stochastic synapses, this new class of NSM provides an important pathway towards realizing probabilistic inference [23] and active learning [24], [25]. In this work, we propose a hardware implementation of NSM using hybrid stochastic synapses consisting of an embedded non-volatile memory (eNVM) in series with a two-terminal stochastic selector element. We experimentally demonstrate in silico such a hybrid stochastic synapse by pairing a doped HfO FeFET-based analog weight cell with a two-terminal Ag/HfO stochastic selector. Such hybrid synapses can be integrated within the prevailing crossbar array architecture for CIM that provides a promising energy-efficiency pathway for building neuromorphic hardware by reducing data-movement[26]. We exploit the inherent stochastic switching of the selector element between the insulator and the metallic state to perform Bernoulli sampling of the conductance states of the FeFET both during learning and inference. A remarkable side-effect of the multiplicative noise dynamics is a self-normalizing effect that performs automatic weight normalization and prevention of internal covariate shift in an online fashion. Furthermore, the always- on stochasticity of the NSM during the inference mode allows performing Bayesian inferencing. Theoretical Model of NSM
Neural Sampling Machines (NSM) are stochastic neural networks that exploit neuronal and/or synaptic noise to perform learning and inference [15]. A schematic illustration is shown in Fig. 1(b) comprising synaptic stochasticity that injects a multiplicative Bernoulli or “ blank-out ” noise in the model. Such a noise can be incorporated in the model as a continuous DropConnect [18] mask on the synaptic weights such that a subset of the weights is continuously forced to be zero as shown in Fig. 1(b). Next, we lay down a theoretical description of the NSM. We use binary threshold neurons with the following activation function 𝑧 " = 𝑠𝑔𝑛 ( 𝑢 𝑖 ) = + −1, 𝑖𝑓 𝑢 𝑖 < 01, 𝑖𝑓 𝑢 𝑖 ≥ 0 (1) where 𝑢 " is the pre-activation of neuron 𝑖 and is given by: 𝑢 " = ∑ (𝜉 "6 +𝑎 " )𝑤 "6:6;< 𝑧 + 𝑏 " (2) where 𝑤 "6 represents the weight of the synaptic connection between neurons 𝑖 and 𝑗 and 𝜉 "6 is the multiplicative Bernoulli noise modeled using an independent and identically distributed ( iid ) random variable with parameter 𝑝 such that 𝜉 "6 ~𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) ∈ [0,1] . 𝑏 " is a bias term applied per neuron 𝑖 . An additional term 𝑎 " is added per neuron 𝑖 to counter the scaling factor issue due to multiplicative noise [27]. It can be further shown that for such binary threshold neurons, the probability of a neuron firing is given by 𝑃(𝑧 " = 1|𝒛) = 𝔼(S T |𝒛)ULVWX(S T |𝒛) YZ (3) where 𝔼(𝑢 " ) and 𝑉𝑎𝑟(𝑢 " ) are the expectation and variance of 𝑢 " . For Bernoulli type noise, the probability of neuron firing becomes [27] 𝑃(𝑧 " = 1|𝒛) = 12 ⎣⎢⎢⎡1 + erf ⎝⎛ (𝑝 + 𝑎 " ) ∑ 𝑤 "6 𝑧 b2𝑝(1 − 𝑝) ∑ 𝑤 "6L6 ⎠⎞⎦⎥⎥⎤ = 12 h1 + erf i (𝑝 + 𝑎 " ) ∑ 𝑤 "6 𝑧 U2𝑝(1 − 𝑝)‖𝒘 𝒊 ‖mn = Recent years have seen extensive research on building dedicated hardware for accelerating DNNs using CIM approach. The core computing kernel consists of a crossbar array with perpendicular rows and columns with eNVMs placed at each cross-point as shown in Fig. 1(c). The weights in the DNN are mapped to the conductance states of the eNVM. The crossbar array performs row-wise weight update and column-wise summation operations in a parallel fashion as follows: the input (or read) voltages 𝑽 𝒊𝒏 from the input neuron layer are applied to all the rows and are multiplied by the conductance of the eNVM at each cross-point 𝑮 to create a weighted sum current in each column 𝑰 𝒐𝒖𝒕 = ∑ 𝑮𝑽 𝒊𝒏 . The output neuron layer placed at the end of the column converts these analog currents into digital neuronal outputs. Implementing an NSM with the same existing hardware architecture requires selectively sampling or reading the synaptic weights 𝐺 "6 with some degree of uncertainty, based on random binary variables 𝜉 "6 generated for each of the synapse. We show that this can be easily realized by pairing the eNVM in series with a two-terminal stochastic selector element at each cross-point as shown illustratively in Fig. 1(c). We choose a selector device such that it operates as a switch, stochastically switching between an ON state (representing 𝜉 "6 = 1) and an OFF state ( 𝜉 "6 = 0 ). The detailed description of such a selector is mentioned later. Fig. 1(d) shows a scenario where an input voltage 𝑉 "(cid:134),(cid:135) is applied to the third row of the synaptic array while the conductance of the synapses are set to 𝑮 = {𝐺 < , 𝐺 L , 𝐺 (cid:135) , 𝐺 (cid:137) , … , 𝐺 : } . Depending on the state of the selectors in the cross-points, an output weighted sum current 𝑰 𝒐𝒖𝒕 = {0, 𝐺 L 𝑉 "(cid:134),(cid:135) , 0, 𝐺 (cid:137) 𝑉 "(cid:134),(cid:135) , … , 0} is generated. This is exactly the same as multiplying the weight sum of 𝑤 "6 𝑧 with a multiplicative noise 𝜉 "6 as described in Eqn. 2. Building Blocks for Stochastic Synapse: FeFET-based Analog Weight Cell The idea of voltage-dependent partial polarization switching in ferroelectric Hf x Zr O can be leveraged to implement a non-volatile FeFET-based analog synapse. The FeFET-based synapse can be integrated into a pseudo-crossbar array that is suitable for row-wise weight update and column-wise summation [6], [10], [29] . Fig. 2(a) shows the schematic of a FeFET-based analog synapse without any additional stochastic selector element. The channel conductance 𝐺 of the FeFET can be modulated by applying write voltage pulses ± V write to the gate of the FeFET. For reading out the conductance state, a small read voltage V read is applied to the gate terminal. With an input voltage 𝑉 "(cid:134) applied to the drain of the FeFET, the output (drain) current becomes 𝐼 (cid:142)S(cid:143) =𝐺𝑉 "(cid:134) . Fig. 2(b) shows the experimentally measured conductance modulation in a 500 nm x 500 nm high-K metal gate FeFET fabricated at 28nm technology node [30]. For on-line learning on crossbar arrays, typically potentiation and depression pulse schemes with identical pulse amplitudes and widths are preferred. Nonetheless for a proof-of-concept, we used an amplitude modulation scheme where write voltage pulses V write of increasing amplitude from 2.8V to 4V and pulse widths of 1 𝜇 s are applied to modulate the conductance of the FeFET. Applying progressively increasing negative pulses causes the FeFET to transition from the initial low resistance state (LRS) with lower threshold voltage (V T ) to high resistance state (HRS) as shown by the current-voltage characteristics in Fig. 2(b). Similarly, applying progressively increasing positive pulses causes a change in the conductance from HRS to LRS. Fig. 2(c) shows a continuous change in the conductance state of the FeFET upon applying multiple potentiation and depression pulses of varying amplitude and constant pulse width of 1 𝜇 s. The cycle-to-cycle variation in the measured conductance states observed in Fig. 2(c) arises due to the inherent stochastic switching dynamics of the individual ferroelectric domains [31]. Such inherent stochasticity also results in a device-to-device variation of the conductance states. To incorporate such variability, we measured the conductance modulation both for potentiation and depression across ten devices as shown in Fig. 2(d). We incorporate the model of FeFET-based analog weight cell in the NSM by fitting the conductance update scheme for both potentiation and depression with the closed-form expression Δ𝐺 = 𝛼 + 𝛽 Q1 − 𝑒 {(|V (cid:147)(cid:148)T(cid:149)(cid:150) |{V (cid:151) ) (cid:152)(cid:153) Y where 𝛼, 𝛽, 𝛾 and 𝑉 (cid:155) are the fitting parameters. Building Blocks for Stochastic Synapse: Ag/HfO Stochastic Selector Next, we describe the characteristics of our stochastic selector device. Fig. 3(a) shows a schematic and a transmission electron microscopy (TEM) of a fabricated stack of [Ag/TiN/HfO /Pt] with 3nm TiN and 4nm HfO . A stochastic synapse is realized by augmenting this stochastic selector in series with the FeFET-based analog weight cell as shown in Fig. 2(b). The [Ag/TiN/HfO2/Pt] metal ion threshold switch device, from here on referred to as the Ag/ HfO selector device, operates based on the principle of metal ion migration through a metal oxide medium similar to conducting bridge RAM (CBRAM). Starting from an initial OFF state, under an applied external bias, Ag atoms ionize and respond to the electric field migrating via interstitial hopping from top electrode to bottom electrode until a continuous filament of Ag+ atoms bridge the top and bottom electrodes. This is accompanied by several orders of magnitude change in conductivity as the device turns ON as shown by the measured current-voltage characteristics in Fig. 3(c). As the field is reduced, the inclination for Ag atoms to form clusters with other Ag atoms, rather than linear chains of atoms in contact with Pt allows for the spontaneous rupture of the atomic filament, turning OFF the device [32]. The role of TiN in the stack is to limit the initial migration of Ag during the electroforming sweep, such that device reliability is enhanced [33]. Interestingly, upon repeated measurement of the switching characteristic of the selector device, we see a considerable variation in the threshold voltage V T that triggers the spontaneous formation of the Ag + filament through HfO insulating matrix and turning ON the device as shown in Fig. 3(d). Such stochastic switching can be exploited by applying the input voltage V in within the variation window of the V T as shown in Fig. 3(c). This would allow stochastic sampling of the conductance state of the FeFET in series. Figs. 3(e) and (f) show two examples of stochastically reading an LRS and an HRS of the FeFET through the stochastic selector. Overall, this validates the proposed idea of using such a hybrid structure as a truly stochastic synapse for implementing NSM on the hardware. The stochasticity switching of the selector device is incorporated in the NSM by modeling it as an Onrstein-Uhlenbeck (OU) Process. The dynamics of the V T can be described as 𝑑𝑉 (cid:157) = 𝜃(𝜇 − 𝑉 (cid:157) )𝑑𝑡 + 𝜎𝑑𝑊 (5) where W is the Wiener process, 𝜃 describes the magnitude of the mean-reverting force towards the mean 𝜇 . 𝜎 captures the diverting variance. The calibrated OU process shows excellent agreement with our experimental results as shown in Figs. 3(g)-(i) in terms of the cycle-to-cycle variation of V T , overall distribution of V T and autocorrelation. Details of the OU calibration is included in the Methods section. Hardware NSM and Image Classification Task We test the performance of our hardware NSM incorporating FeFET-based analog weight cell and stochastic selector as the hybrid stochastic synapse on image classification task using the MNIST handwritten digit dataset as an example. Fig. 4(a) shows the network architecture consisting of an input layer with 784 neurons, three fully connected hidden layers with 300 neurons and a softmax output layer of 10 neurons for 10-way classification. For comparison, we chose three networks with the same architecture – (a) deterministic feedforward multilayer perceptron (MLP), (b) theoretical NSM model with full precession synaptic weights and a Bernoulli multiplicative noise for the stochastic synapses and (c) simulated hardware-NSM using the FeFET-based analog weight cell and the stochastic selector. The hardware NSM is trained using backpropagation and a softmax layer with cross-entropy loss and minibatch size of 100. While training of the hardware NSM, during the backward pass, the weight update is applied using the derivative of Eqn. (4) and the closed-form equation in Fig.2(d). During the forward pass for both learning and inference, the weights are stochastically accessed. This involves calculating the V T of each selector device in the cross-points in every iteration using the OU process described by Eqn. 5 and constructing a Boolean matrix Ξ such that if 𝑉 (cid:157) ≥ 𝑉 (cid:157),£⁄W(cid:134) , 𝜉 "6 = 1 , else 𝜉 "6 = 0 . Subsequently, we evaluate Eqns. 1-2. The exact nature of the multiplicative noise injected by the stochastic selector is understood by comparing the measured switching probability with the theoretically predicted probability of switching for a Bernoulli process. Fig. 4(b) shows an exact match between the measured and theoretically predicted probability, highlighting that our stochastic selector device can inject Bernoulli multiplicative noise. Fig. 4(c) and (d) shows the performance of the hardware NSM in terms of the test accuracy and comparison with the theoretical NSM model and conventional MLP network. It is seen that the theoretical model outperforms the conventional MLP network as highlighted in [27]. The simulated hardware-NSM shows comparable test accuracy with the conventional MLP, the performance mainly limited by the dynamic range and non-idealities of the FeFET-based synaptic weight cell. Inherent Weight Normalization and Robustness to Weight Fluctuations As explained earlier, NSM allows decoupling the weight matrix as 𝑣 " = 𝛽 " 𝒘 𝒊 ‖𝒘 𝒊 ‖ which provides several advantages. Firstly, an inherent weight normalization can be effectively achieved without resorting to any batch normalization technique by performing gradient descent (calculating derivatives) with respect to the variables 𝛽 in addition to the weights 𝒘 as [27]. ¥ℒ¥§ T = ∑ r Ts ¥ ¤Ts ℒ s ‖𝒘 𝒊 ‖ (6) ¥ℒ¥r Ts = § T ‖𝒘 𝒊 ‖ ¥ℒ¥' Ts − § T ‖𝒘 𝒊 ‖ “ 𝒘 𝒊 ¥ℒ¥§ T (7) This allows the distribution of the weights in the NSM to remain more stable than a conventional MLP without any additional weight regularization applied. Fig. 4(e) shows the evolution of the weights of the third layer during learning for three cases – (a) an MLP without any regularization, (b) MLP with additional regularization added and (c) hardware NSM. It is seen that the distribution of NSM weights is narrower and remains concentrated around its mean (low variance). On the other hand, the variance of the weight distribution is larger for the MLP network without weight regularization. Mitigation of Internal Covariate Shift The self-normalizing feature of the NSM also prevents the internal covariate shift caused by changes in the input distribution, similar to that achieved using Batch normalization. To highlight this, we next compare the 15th, 50th and 85th percentiles of the input distributions to the last hidden layer during training for all the three networks as shown in Fig. 4(f). The internal covariate shift is clearly visible in the conventional MLP without any normalization incorporated as the input distributions change significantly during the learning. In contrast, the evolution of the input distribution in the hardware NSM is remains stable, suggesting that NSMs prevents internal covariate shift through the self-normalizing effect that inherently performs weight normalization. Bayesian Inferencing and Capturing Uncertainty in Data Next, we showcase the ability of our simulated hardware-NSM to perform Bayesian inferencing and produce classification confidence. For this, we train our hardware NSM on the full MNIST dataset. During the inference mode, we evaluate the performance of the trained NSM on continuously rotated images of the digits 1 and 2 and shown in Fig. 5(a) and (f). For each of the rotated images, we perform 100 stochastic forward passes and record the softmax input (output of the last fully connected hidden layer in Fig. 4(a)) as well the softmax output. We highlight the response of 3 representative neurons - 1, 2 and 4 out of all the 10 neurons that show the highest activity. It is seen that when the softmax input of a particular neuron is larger than all the other neurons, the NSM will predict the class corresponding to that neuron. For example, in Fig. 5(b)-(d), for the first seven images, the softmax input for neuron 1 is largest. Consequently, the softmax output for neuron 1 remains close to 1 and the NSM predicts the images as belonging to class 1. However, as the images are rotated more, it is seen that even though the softmax output can be arbitrarily high for neuron 2 or 4 predicting that the image belongs to the class 2 or 4, respectively, the uncertainty in the softmax output is high (output covering the entire range from 0 to 1). This signifies that the NSM can account for the uncertainty in the prediction. We quantify the uncertainty of the NSM by looking at the entropy of the prediction, defined as 𝐻 = −∑𝑝 ∗𝑙𝑜𝑔 (𝑝) , where p is the probability distribution of the prediction. As shown in Figs. 5(d) and (e), when the NSM makes a correct prediction (classifying image 1 as belonging to class 1), the uncertainty measured in terms of the entropy remains 0. However, in the case of wrong predictions (classifying rotated image of 1 as belonging to class 2 or 4), the uncertainty associated with the prediction becomes large. This is in contrats to the results obtained from a conventional MLP network where the network cannot account for any uncertainty in the data as shown in Fig. 5. Similar results are highlighted when presenting the NSM with rotated images of digit 2 as shown in Figs. 5(f)-(j). Conclusion Stochasticity works a powerful mechanism in introducing many computational features of a deep neural network such regularization and Monte Carlo sampling. This work builds upon the inherent weight normalization feature exhibited by a stochastic neural network, specifically the Neural Sampling Machine (NSM). Such normalization acts as a powerful feature in most modern deep neural networks [28], [34], [35], mitigating internal covariate shift and providing an alternative mechanism for divisive normalization in bio-inspired neural networks [36]. Our proposed theoretical NSM model provides several advantages: (a) it is an online alternative for otherwise used batch normalization and dropout techniques, (b) it can mitigate saturation at the boundaries of fixed range weight representations, and (c) it provides robustness against spurious fluctuations affecting the rows of the weight matrix. We demonstrate that the required stochastic nature of the theoretical NSM model can be realized in emerging stochastic devices. This allows seamless implementation of NSM on a hardware using the compute-in-memory architecture. We demonstrate the capability of our proposed hardware NSM to perform image recognition task on standard MNIST dataset with high accuracy (98.25%) comparable to state-of-the-art deterministic neural network. We also showcase the ability of our hardware NSM to perform probabilistic inferencing and quantify the uncertainty in data. Note that while this work focuses on using FeFET as the analog weight cell and Ag/HfO as the stochastic selector, a hardware NSM can also be realized using other emerging devices. For example, one can utilize emerging memory candidates such as PCM and RRAM instead of FeFET as the analog weight cell can. For the stochastic selector, other candidates can be explored including Ovonic Threshold Switch (OTS) [37], Mixed Ionic Electronic Conductor (MIEC) [38], and Insulator Metal Transition (IMT) oxides [39] such as Vanadium Dioxide (VO ) [40], [41] and Niobium Oxide (NbO x ) [42], [43]. Methods Fabrication of Ag/HfO stochastic selector Ag/TiN/HfO /Pt devices are fabricated on 250 nm SiO /Si substrates. Bottom electrodes are patterned with e-beam lithography and 15nm/60nm Ti/Pt deposited via e-beam evaporation. A 4nm thick HfO film is deposited using atomic layer deposition of TDMAH and H O at 120C, followed directly by 3nm thick TiN deposition with TiCl and N at 120C without breaking vacuum. The 150nm thick Ag top electrode is then patterned and deposited using e-beam evaporation, followed by a blanket TiN isolation etch in CHF and electrical testing. Calibration of Onrstein-Uhlenbeck (OU) Process We calibrate the parameters of Eqn. (5) using the experimentally measured threshold voltage V T of 18 selector devices such as shown in Fig. 3(d). We use the method of linear regression, which has been established in [44] to recast the Eqn. (5) to 𝑦 = 𝑎𝑥 + 𝑏 + 𝜖 (8) where 𝑎 is the slope, 𝑏 is the interception term and 𝜖 is a white noise term. The solution of Eqn. (5) after discretization using the Euler-Maruyama method is given by 𝑉 (cid:157)(cid:143)z< = 𝑉 (cid:157) (cid:149) 𝑒 {(cid:176)–(cid:143) + 𝜇†1 − 𝑒 {(cid:176)–(cid:143) ‡𝜎b <{⁄ ·“(cid:181)¶(cid:149) L(cid:176) 𝒩(0,1) (9) By comparing Eqns. (8) and (9), we have 𝑎 = 𝑒 {(cid:176)–(cid:143) , 𝑏 = 𝜇†1 − 𝑒 {(cid:176)–(cid:143) ‡ and sd( 𝜖) = 𝜎b <{⁄ ·“(cid:181)¶(cid:149) L(cid:176) . Solving for 𝑎 , 𝑏 and sd( 𝜖 ), we obtain the OU parameters 𝜇 = ‚<{W , 𝜃 = − „” W–(cid:143) and 𝜎 =𝑠𝑑(𝜖)b {L „” W–(cid:143)(<{W “ ) . We have to compute 𝑎 , 𝑏 and the variance of the error of the linear regression in order to calibrate the OU parameters 𝜇 , 𝜃 and 𝜎 . The least square regression terms are 𝑆 … =∑ 𝑆 "{< , 𝑆 ‰ = (cid:134)";< ∑ 𝑆 " , 𝑆 …… = ∑ 𝑆 "{ The multilayer perceptron (MLP) network described in Fig. 4(a) was trained with the backpropagation algorithm [45], the Cross-entropy as loss function and an adapted version of Adam optimizer with a learning rate of 0.0003 and betas (0.9, 0.999). We adapted the Adam optimizer to accommodate for the updates of the conductance in the FeFet model (see paragraph: Building Blocks for Stochastic Synapse: FeFET-based Analog Weight Cell). The training and testing batch sizes were both set to 100. We trained the network for 200 epochs and at each epoch we used the full 60,000 samples training MNIST set. The learning rate was linearly decreased after 100 epochs with a rate of …<(cid:155)(cid:155) , 1(cid:201) , where 𝑥 is the number of a specific epoch. Every two epochs we measured the accuracy of the network using the full 10,000 samples testing MNIST set over an ensemble of 100 samples of the forward pass of the neural network. The accuracy was measured as the ratio of successfully classified digits to the total number of samples within the test MNIST set (10,000). All the experiments ran on a Nvidia GPU Titan X with 12GB of physical memory and a host machine equipped with a Intel i9 with 64 GB physical memory running Arch Linux. The source code is written in Python (Pytorch, Numpy, Sklearn) and it will [be freely available on-line upon acceptance for publication]. Data availability The data that support the findings of this study are available from the corresponding author upon request. Code availability The simulation codes used for this study are available from the corresponding author upon request. References [1] S. Yu, P. Y. Chen, Y. Cao, L. Xia, Y. Wang, and H. Wu, “Scaling-up resistive synaptic arrays for neuro-inspired architecture: Challenges and prospect,” in Technical Digest - International Electron Devices Meeting, IEDM , 2015. [2] L. Gao et al. , “Fully parallel write/read in resistive synaptic array for accelerating on-chip learning,” Nanotechnology , 2015. [3] S. Ambrogio et al. , “Equivalent-accuracy accelerated neural-network training using analogue memory,” Nature , 2018. [4] D. Kuzum, R. G. D. Jeyasingh, B. Lee, and H. S. P. Wong, “Nanoelectronic programmable synapses based on phase change materials for brain-inspired computing,” Nano Lett. , 2012. [5] G. W. Burr et al. , “Experimental Demonstration and Tolerancing of a Large-Scale Neural Network (165 000 Synapses) Using Phase-Change Memory as the Synaptic Weight Element,” IEEE Trans. Electron Devices , 2015. [6] M. Jerry et al. , “Ferroelectric FET analog synapse for acceleration of deep neural network training,” in Technical Digest - International Electron Devices Meeting, IEDM , 2018. [7] X. Sun, P. Wang, K. Ni, S. Datta, and S. Yu, “Exploiting Hybrid Precision for Training and Inference: A 2T-1FeFET Based Analog Synaptic Weight Cell,” in Technical Digest - International Electron Devices Meeting, IEDM , 2019. [8] Y. Luo, P. Wang, X. Peng, X. Sun, and S. Yu, “Benchmark of Ferroelectric Transistor Based Hybrid Precision Synapse for Neural Network Accelerator,” IEEE J. Explor. Solid-State Comput. Devices Circuits , 2019. [9] M. Jerry et al. , “Ferroelectric FET based Non-Volatile Analog Synaptic Weight Cell,” University of Notre Dame Notre Dame United States, 2019. [10] S. Dutta, C. Schafer, J. Gomez, K. Ni, S. Joshi, and S. Datta, “Supervised Learning in All FeFET-Based Spiking Neural Network: Opportunities and Challenges,” Front. Neurosci. , 2020. [11] T. Tuma, A. Pantazi, M. Le Gallo, A. Sebastian, and E. Eleftheriou, “Stochastic phase-change neurons,” Nat. Nanotechnol. , 2016. [12] S. Dutta et al. , “Programmable coupled oscillators for synchronized locomotion,” Nat. Commun. , 2019. [13] D. C. Knill and A. Pouget, “The Bayesian brain: The role of uncertainty in neural coding and computation,” Trends Neurosci. , 2004. [14] J. G. G. Borst, “The low synaptic release probability in vivo,” Trends Neurosci. , 2010. [15] E. O. Neftci, B. U. Pedroni, S. Joshi, M. Al-Shedivat, and G. Cauwenberghs, “Stochastic synapses enable efficient brain-inspired learning machines,” Front. Neurosci. , 2016. [16] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm for boltzmann machines,” Cogn. Sci. , 1985. [17] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv Prepr. arXiv1207.0580 , 2012. [18] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using DropConnect,” in , 2013. [19] L. Buesing, J. Bill, B. Nessler, and W. Maass, “Neural dynamics as sampling: A model for stochastic computation in recurrent networks of spiking neurons,” PLoS Comput. Biol. , 2011. [20] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” in , 2016. [21] W. B. Levy and R. A. Baxter, “Energy-Efficient Neuronal Computation via Quantal Synaptic Failures,” J. Neurosci. , 2002. [22] J. J. Harris, R. Jolivet, and D. Attwell, “Synaptic Energy Use and Supply,” Neuron . 2012. [23] K. Doya, S. Ishii, A. Pouget, and R. P. N. Rao, Bayesian brain: Probabilistic approaches to neural coding . MIT press, 2007. [24] K. Friston, “The free-energy principle: A unified brain theory?,” Nature Reviews Neuroscience . 2010. [25] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” J. Artif. Intell. Res. , 1996. [26] S. Yu, “Neuro-Inspired Computing with Emerging Nonvolatile Memorys,” Proc. IEEE , 2018. [27] G. Detorakis, S. Dutta, A. Khanna, M. Jerry, S. Datta, and E. Neftci, “Inherent Weight Normalization in Stochastic Neural Networks,” in Advances in Neural Information Processing Systems , 2019, pp. 3286–3297. [28] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems , 2016. [29] M. Jerry et al. , “A Ferroelectric field effect transistor based synaptic weight cell,” J. Phys. D. Appl. Phys. , vol. 51, no. 43, p. 434001, 2018. [30] M. Trentzsch et al. , “A 28nm HKMG super low power embedded NVM technology based on ferroelectric FETs,” in Technical Digest - International Electron Devices Meeting, IEDM , 2017. [31] K. Ni, W. Chakraborty, J. Smith, B. Grisafe, and S. Datta, “Fundamental Understanding and Control of Device-to-Device Variation in Deeply Scaled Ferroelectric FETs,” 2019. [32] N. Shukla, R. K. Ghosh, B. Gnsafe, and S. Datta, “Fundamental mechanism behind volatile and non-volatile switching in metallic conducting bridge RAM,” in Technical Digest - International Electron Devices Meeting, IEDM , 2018. [33] B. Grisafe, M. Jerry, J. A. Smith, and S. Datta, “Performance Enhancement of Ag/HfO2 Metal Ion Threshold Switch Cross-Point Selectors,” IEEE Electron Device Lett. , 2019. [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in , 2015. [35] M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel, “Normalizing the normalizers: Comparing and extending network normalization schemes,” in , 2017. [36] D. Querlioz, O. Bichler, A. F. Vincent, and C. Gamrat, “Bioinspired Programming of Memory Devices for Implementing an Inference Engine,” Proc. IEEE , 2015. [37] D. Kau et al. , “A stackable cross point phase change memory,” in Technical Digest - International Electron Devices Meeting, IEDM , 2009. [38] R. S. Shenoy et al. , “MIEC (mixed-ionic-electronic-conduction)-based access devices for non-volatile crossbar memory arrays,” Semiconductor Science and Technology . 2014. [39] M. Imada, A. Fujimori, and Y. Tokura, “Metal-insulator transitions,” Rev. Mod. Phys. , 1998. [40] C. N. Berglund and H. J. Guggenheim, “Electronic Properties of VO2 near the Semiconductor-Metal Transition,” Phys. Rev. , 1969. [41] R. M. Wentzcovitch, W. W. Schulz, and P. B. Allen, “VO2: Peierls or Mott-Hubbard? A view from band theory,” Phys. Rev. Lett. , 1994. [42] E. Cha, J. Park, J. Woo, D. Lee, A. Prakash, and H. Hwang, “Comprehensive scaling study of NbO2 insulator-metal-transition selector for cross point array application,” Appl. Phys. Lett. , 2016. [43] W. G. Kim et al. , “NbO2-based low power and cost effective 1S1R switching for high density cross point ReRAM Application,” in Digest of Technical Papers - Symposium on VLSI Technology , 2014. [44] A. K. Dixit and R. S. Pindyck, Investment under uncertainty . 2012. [45] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature , 1986. Acknowledgements We are grateful to M. Trentzsch, S. Dunkel, S. Beyer, and W. Taylor at Globalfoundries Dresden, Germany for providing 28nm HKMG FeFET test devices. This project was supported by the National Science Foundation (NSF), and the Nanoelectronics Research Corporation (NERC), a subsidiary of the Semiconductor Research Corporation (SRC), through Extremely Energy Efficient Collective Electronics (EXCEL). Author Contribution S. Dutta, G.D., E.N. and S. Datta developed the main idea. S. Dutta and A.K. performed all the measurements. B.G. helped with fabrication of the selector devices. G.D. and E.N. performed the simulations for NSM. All authors discussed the results, agreed to the conclusions of the paper and contributed to the writing of the manuscript. Figures Figure 1. (a) Synaptic stochasticity occurring at the molecular level in biological neural networks. The presynaptic neuronal spike causes the release of neurotransmitters at the synaptic release site with a probability around 0.1. (b) Schematic of a Neural Sampling Machine (NSM) incorporating a Bernoulli or “blank-out” multiplicative noise in the synapse. This acts as a continuous DropConnect mask on the synaptic weights such that a subset of the weights is continuously forced to be zero. (c) Illustration of an NSM implemented in a hardware using crossbar array architecture implementing compute-in-memory. The analog weight cell implemented using eNVMs are placed at each cross-point and are augmented with a stochastic selector element. This allows selectively sampling or reading the synaptic weights 𝐺 "6 with some degree of uncertainty, based on random binary variables 𝜉 "6 generated for each of the synapse. (d) Illustration of a scenario where an input voltage 𝑉 "(cid:134),(cid:135) is applied to a row of the synaptic array with conductance states 𝑮 = {𝐺 < , 𝐺 L , 𝐺 (cid:135) , 𝐺 (cid:137) , … , 𝐺 : } . Depending on the state of the selectors in the cross-points, an output weighted sum current 𝑰 𝒐𝒖𝒕 = {0, 𝐺 L 𝑉 "(cid:134),(cid:135) , 0, 𝐺 (cid:137) 𝑉 "(cid:134),(cid:135) , … , 0} is generated which is exactly same as multiplying the weight sum of 𝑤 "6 𝑧 with a multiplicative noise 𝜉 "6 Figure 2. (a) Schematic of a stand-alone FeFET-based analog synapse. The channel conductance can be modulated by applying write pulses ± V write to the gate of the FeFET while reading out the conductance state is achieved by applying a small read voltage V read to the gate terminal. (b) Experimentally measured conductance modulation in a 500 nm x 500 nm high-K metal gate FeFET fabricated at 28nm technology node. An amplitude modulation scheme is used where positive and negative write voltage pulses V write of increasing amplitude from 2.8V to 4V and pulse widths of 1 𝜇 s are applied to modulate the conductance of the FeFET. (c) Measured continuous change in the conductance state of the FeFET upon applying multiple potentiation and depression pulses of varying amplitude. (d) The FeFET-based analog weight cell is modeled in the NSM by fitting the conductance update scheme for both potentiation and depression with the closed-form expression as shown in the figure. Figure 3. (a) Schematic and TEM of a fabricated stack of [Ag/TiN/HfO /Pt] with 3nm TiN and 4nm HfO . (b) A stochastic synapse is realized by augmenting this stochastic selector in series with the FeFET-based analog weight cell. (c) Measured current-voltage characteristics showing abrupt electronic transition from insulating state to metallic state due to the formation of a continuous filament of Ag+ atoms bridge the top and bottom electrodes. A wide window of variation in the threshold voltage V T that triggers the spontaneous formation of the Ag + filament is observed. The stochasticity can be exploited by applying the input voltage V in within the variation window of the V T . (d) Measured threshold voltage V T over multiple cycles. (e, f) Stochastically reading an LRS and an HRS of the FeFET through the stochastic selector. (g-i) The stochasticity switching of the selector device is modeled using an Onrstein-Uhlenbeck (OU) Process. The model shows excellent agreement with the experimental data. Figure 4. (a) Network architecture of the NSM consisting of an input layer, three hidden fully connected layers and an output layer. (b) Exact match witnessed between the measured switching probability of the stochastic selector device and theoretically predicted probability for a Bernoulli distribution, highlighting that our stochastic selector device can inject Bernoulli multiplicative noise. (c) Evolution of the test accuracy for the simulated hardware-NSM using the FeFET-based analog weight cell and the stochastic selector as a function of the epochs. (d) Comparison of the performance of the simulated hardware-NSM with a deterministic feedforward multilayer perceptron (MLP) and the theoretical NSM model with full precession synaptic weights and a Bernoulli multiplicative noise for the stochastic synapses. (e) Evolution of the weights of the third layer during learning for three different networks- an MLP without any regularization, an MLP with additional regularization added and the simulated hardware-NSM. (f) Evolution of the 15th, 50th and 85th percentiles of the input distributions to the last hidden layer during training for all the three networks. Overall, NSM exhibits a tighter distribution of the weights and activation concentrated around its mean, highlighting the inherent self-normalizing feature. Figure 5. Bayesian inferencing and uncertainty in data comparison between simulated hardware-NSM and a conventional MLP network. (a, f) Continuously rotated images of the digits 1 and 2 from the MNIST dataset, used for performing Bayesian inferencing. We perform 100 stochastic forward passes during the inference mode for each rotated image of digits 1 and 2 and record the distribution of the (b, g) softmax input and (c, h) softmax output for few representative output neurons. (d, i) Classification result produced by the NSM for each rotated image. (e, j) The uncertainty of the NSM associated with the prediction, calculated in terms of the entropy 𝐻 =−∑𝑝 ∗ 𝑙𝑜𝑔 (𝑝) , where p is the probability distribution of the prediction. When the NSM makes a correct prediction (classifying image 1 and 2 as belonging to class 1 and 2, respectively), the uncertainty measured in terms of the entropy remains 0. However, in the case of wrong predictions, the uncertainty associated with the prediction becomes large.is the probability distribution of the prediction. When the NSM makes a correct prediction (classifying image 1 and 2 as belonging to class 1 and 2, respectively), the uncertainty measured in terms of the entropy remains 0. However, in the case of wrong predictions, the uncertainty associated with the prediction becomes large.