[PDF] Hardware-aware in-situ Boltzmann machine learning using stochastic magnetic tunnel junctions

Abstract

One of the big challenges of current electronics is the design and implementation of hardware neural networks that perform fast and energy-efficient machine learning. Spintronics is a promising catalyst for this field with the capabilities of nanosecond operation and compatibility with existing microelectronics. Considering large-scale, viable neuromorphic systems however, variability of device properties is a serious concern. In this paper, we show an autonomously operating circuit that performs hardware-aware machine learning utilizing probabilistic neurons built with stochastic magnetic tunnel junctions. We show that in-situ learning of weights and biases in a Boltzmann machine can counter device-to-device variations and learn the probability distribution of meaningful operations such as a full adder. This scalable autonomously operating learning circuit using spintronics-based neurons could be especially of interest for standalone artificial-intelligence devices capable of fast and efficient learning at the edge.

Full PDF

11 Hardware-aware in-situ Boltzmann machine learning using stochastic magnetic tunnel junctions

Jan Kaiser, William A. Borders, Kerem Y. Camsari, a) Shunsuke Fukami, , b) Hideo Ohno, and Supriyo Datta Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47906 USA Laboratory for Nanoelectronics and Spintronics, Research Institute of Electrical Communication, Tohoku University, Sendai, Japan Department of Electrical and Computer Engineering, University of California Santa Barbara, Santa Barbara, CA, 93106 USA Center for Innovative Integrated Electronic Systems, Tohoku University, Sendai, Japan. Center for Spintronics Research Network, Tohoku University, Sendai, Japan. Center for Science and Innovation in Spintronics, Tohoku University, Sendai, Japan. WPI-Advanced Institute for Materials Research, Tohoku University, Sendai, Japan. (Dated: 9 February 2021) a)Electronic mail: [email protected] b)Electronic mail: [email protected]

Conventional computers use deterministic bits to operate and encode information. While this approach is effective for well-defined tasks like arithmetic operations, there are many difficult tasks like stochastic optimization, sampling, and probabilistic inference which instead are readily addressed by utilizing stochasticity. A promising approach for solving these difficult tasks is using computers which are naturally probabilistic. In a well-known piece, Feynman suggested that in the same way that the use of quantum computers is important to simulate quantum phenomena, a probabilistic computer could be a natural solution to problems which are intrinsically probabilistic. Recently, utilizing spintronics technology, Borders et al. demonstrated such an autonomously running probabilistic computer consisting of probabilistic bits (p-bits) with a stochastic magnetic tunnel junction (s-MTJ) which can perform computationally hard tasks like integer factorization. Machine learning is another important field in which probabilistic computation and a large amount of random numbers could be highly beneficial. It holds promise for various tasks like image recognition, medical application and autonomous driving . For these applications, conventional von-Neumann computers are inefficient and novel computing architectures inspired by information processing in the human brain are of interest . Boltzmann machines offer a promising route for hardware learning due to their local learning rule and tolerance to stochasticity . Boltzmann machines are generative stochastic recurrent neural networks having a large application space ranging from optimization to generative machine learning . This suggests that building a compact hardware implementation in the form of a probabilistic computer that resembles a Boltzmann machine could be highly beneficial in terms of energy consumption and training speed. While some hardware implementations have been presented for Restricted Boltzmann machines (RBMs) , in this paper we focus on fully-connected unrestricted Boltzmann machines. The usual problem in learning unrestricted Boltzmann machine is that they are hard to train since the equilibrium samples of the network are harder to extract . In this work we show a system that performs this sampling naturally and could hence make it possible to train unrestricted Boltzmann machines more efficiently using the natural physics of s-MTJs. A common concern for the development of neuromorphic systems based on emerging devices like s-MTJs is the inevitable device variability . This poses an obstacle to deploy these systems for real-world application on a large scale while preserving high reliability. Several approaches have been proposed to overcome these challenges on a device level for example by applying external magnetic fields , performing a calibration phase or by post-processing . Another interesting approach to counter the effect of variability and realize high-performance in neuromorphic systems is to perform training and inference on the same hardware system . In this paper, we present a proof-of-concept demonstration of a probabilistic computer that can perform in-situ learning allowing to counter device-to-device variations naturally as part of its learning process. Here, device variability is addressed on a system’s level. We show that after weights and biases are learned, even devices with non-ideal characteristics can be used to perform given tasks successfully. Such a natural variation- tolerance could enable large-scaled implementations of MTJ-based probabilistic computers. Hardware-aware learning with MTJ-based p-bits

The main building block of a probabilistic computer is the p-bit, analogous to a binary stochastic neuron (BSN) . Its activation function can be described by 𝑚 i (𝑡 + 𝜏 N ) = sgn (tanh [𝐼 i (𝑡)] − 𝑟) . (1) Here, 𝑚 i is a bipolar random variable, 𝜏 N is the time the p-bit takes to perform the activation operation, 𝐼 i is the dimensionless input to p-bit 𝑖, and 𝑟 is a uniformly distributed random number between -1 and +1. Eq.(1) can also be written in binary notation with a unit step function and a sigmoid function. To connect multiple p-bits, a synaptic function computes the input of every p-bit 𝐼 i by taking the weighted sum of all p-bit outputs 𝑚 i , 𝐼 i (𝑡 + 𝜏 S ) = ∑ 𝑊 ij 𝑚 j (𝑡) j (2) where 𝜏 S is the synapse execution time and 𝑊 ij is the weight matrix that couples p-bit 𝑖 and p-bit 𝑗 . Here, the bias to p-bit 𝑖 is subsumed into 𝑊 ij . Given a particular weight matrix, every p-bit configuration has a defined probability given by the Boltzmann distribution where 𝑃 (𝑚) ∝exp [− 𝐸(𝑚)] with energy

𝐸(𝑚) = − ∑ 𝑊 ij 𝑚 i 𝑚 j . To find a fitting weight matrix for a given data distribution, the weights are trained by performing gradient ascent of the log-likelihood . It is well known that the ideal Boltzmann machine algorithm based on log-likelihood learning is generally intractable since learning time scales exponentially with the size of the system . However, it has been shown that approximate version of the Boltzmann learning rule like the contrastive divergence algorithm can be used to perform approximate learning for large Boltzmann machine systems. This algorithmic scaling motivates the use of domain specific, efficient, and fast hardware accelerators like the p-bit building block that naturally represents the neuron function of the Boltzmann machine in order to accelerate the learning process . To map the Boltzmann machine learning algorithm to our hardware system, we use a continuous learning rule similar to the persistent contrastive divergence algorithm given by , 𝑑𝑊 ij 𝑑𝑡 = 〈𝑣 i 𝑣 j 〉−𝑚 i 𝑚 j −𝜆𝑊 ij 𝜏 L (3) that can be implemented by making use of resistor-capacitor (RC) circuits where each weight is stored as a voltage. Here, 〈𝑣 i 𝑣 j 〉 is the average correlation between two neurons in the data distribution, 𝑚 i 𝑚 j is the correlation of the p-bit outputs and 𝜏 L is the learning time constant. The discharging of the capacitor through resistor 𝑅 is used to regularize the weights parametrized by 𝜆 . Regularization assures that weights do not become too large and helps the algorithm to converge to a solution. This learning rule only requires the correlation between two p-bits 𝑚 i 𝑚 j for updating weight 𝑊 ij which makes this learning algorithm attractive for hardware implementations. Eq.(3) does not change when the system becomes larger. Another important advantage of the presented hardware implementation of the Boltzmann machine is that the computational expensive part of getting the equilibrium samples of the correlation term 𝑚 i 𝑚 j needed for learning is performed naturally. More information about the learning rule is presented in the supplementary information. Note that while we have chosen a RC network in this proof-of-concept experiment to conveniently represent analog voltages as weights, the synaptic functionality in our system could also be implemented out of memristor crossbar arrays to support in-situ learning by mapping the weight update rule (Eq. (3)) to an equation of changing conductance 𝐺 ij instead of changing voltage 𝑉 ij . The use of memristor crossbars would have the main advantage that the weight storage becomes non-volatile. Eqs.(1),(2),(3) are implemented in hardware to build a probabilistic circuit that performs learning naturally. Fig. 1 a shows the block diagram of the learning circuit. The neurons (Eq.(1)) are implemented with an s-MTJ in series to a transistor and a resistor 𝑅 S . The drain voltage gets thresholded by using a comparator . The synapse (Eq.(2)) is implemented by using a microcontroller in conjunction with a digital to analog converter (DAC). To compute the correlation of p-bit outputs 𝑚 i 𝑚 j an XNOR gate is needed between the p-bit and the learning block (Eq.(3)) where the weights are updated using an RC array. Fig. 1 b shows the printed circuit board (PCB) with the 5 p-bits and the RC-array with 15 RC elements used in the experiment. In the methods section more details about the experimental implementation are presented. Variation-tolerant learning of a full-adder

We demonstrate the learning of the hardware circuit using the data distribution of a full adder (FA).

A FA has 3 inputs and 2 outputs resulting in

𝑁 = 5 p-bits. To connect these p-bits, 10 weights and 5 biases have to be learned (in total 15 RC elements as shown in Fig. 1 b. For the FA, the binary inputs [𝐴𝐵𝐶 in ] get added and the outputs are given by the sum 𝑆 and the carry out 𝐶 out . This corresponds to a data distribution that is given by 8 out of the 32 ( 𝑁 ) possible configurations. In the methods section, the truth table and the mapping from truth table to analog voltages 𝑉 v,ij is explained in detail. For the FA, the learning is performed for a total of 3000 s. In the supplementary information, learning examples for an AND, OR and XOR gate with less p-bits are shown. Full adder learning with ideal MTJ.

Fig. 2 a shows the normalized, time averaged p-bit response of every p-bit using the ideal s-MTJ implementation when the input voltage 𝑉 IN is swept. These s-MTJs are emulated in hardware with two resistances that are randomly selected by a multiplexer (MUX) to obtain nearly ideal p-bit response characteristics (see methods section for more details). Due to variations in the circuit, every curve is slightly shifted from the ideal 50/50 point at 𝑉 IN = 1.95 V. Even though we are using the MUX model here, it has been shown by Borders et al. that near ideal p-bit responses can be obtained with real s-MTJs. In previous hardware p-circuit implementations, lateral shifts of the p-bit response had to be eliminated by adjusting synaptic biases to calibrate the experiment . By contrast in this demonstration, since the biases are learned during operation, no calibration phase is necessary. This is a significant advantage since learning can account for transistor and s-MTJ variations between p-bits. After obtaining the response of all p-bits, the learning experiment is performed (see methods section for more detail about the experimental procedure). The goal of the learning process is that the p-bits fluctuate according to a set data distribution. Since at every point in time the p-bits can just be in one bipolar state, to monitor the training progress, the distribution of the p-bits 𝑃 Exp (𝑡) is observed over a fixed time window of 60 s, normalized to 1 and compared to the ideal distribution of a full adder given by the 8 lines of the truth table (see Table

I). To compare two probability distributions the Kullback–Leibler divergence (KL-divergence) defined by

𝐾𝐿(𝑃

Ideal ||𝑃

Exp ) = ∑ 𝑃

Ideal (𝑚) log(𝑃

Ideal (𝑚)/𝑃

Exp (𝑚, 𝑡)) 𝑚 is commonly used . Fig. 2 d shows the learning performance measured by the KL divergence versus time 𝑡 . The difference between the ideal data distribution and the experimental distribution decreases significantly in the first 500 s of learning. At the end of learning the KL divergence reaches a value of around 0.2. The experimental distribution at 𝑡 = 0, 𝑃 Exp (𝑡 = 0) is shown in Fig. 2 b. At the start of learning the weights and biases are small and the distribution is close to a uniform random distribution. However, due to slight mismatches in the p-bit response of every individual p-bit (Fig. 2 a) some peaks are more prominent than others. The distribution at the end of learning 𝑃 Exp (𝑡 = 3000𝑠) is shown in Fig. 2 c, where the highest peaks correspond to the correct distribution for the FA, demonstrating the circuit’s ability to learn the given data distribution. We note that as long as the learned peaks are about equal, the KL divergence can be reduced further by increasing all weight values equally i.e. decreasing the temperature of the Boltzmann machine. In Fig. 3, the 10 weights voltages across the capacitors 𝑉 ij = 𝑉 v,ij − 𝑉 C,ij extracted from the circuit are shown. The weights are measured throughout the whole learning process. The blue lines show the weight voltages for the ideal MTJ. After around 500 s the weights saturate and do not change anymore. In the supplementary material, the weights values are compared to the weight matrix commonly used for the FA in logic applications.

Full Adder learning with non-ideal MTJ . To examine the effects of variability, we investigate the learning experiment implemented with fabricated s-MTJs (see methods section for more details). Fig. 2 e shows the 𝑉 OUT versus 𝑉 IN characteristics for the 5 MTJ-based p-bits averaged over 15 s. At the transition point between the stochastic and the deterministic region of the response curve, the slope of the response is sharper compared to the center of the curve, which shows a gradual increase. The combination of these two characteristics leads to a non-ideal p-bit response that deviates from the ideal response described by Eq.(1). The reason for the distorted shape of the p-bit response is due to the fact that the MTJs show stochastic behavior for a large window of current flow in the order of > 10 𝜇𝐴 . The change of the current flow in the MTJ/transistor branch due to change voltage at the gate of the transistor is not large enough to pin the MTJ to 𝑅 P or 𝑅 AP state. This leads to the distorted shape of the p-bit response in Fig. 2 e. For best MTJ characteristics, the stochastic range for current flow should be in the order of around in the design used here. Fig. 2 f and g show the histogram of 𝑃 Exp during the first and last 60 s of learning. At the end of learning the 8 desired peaks are the largest, showing that even though the learning algorithm is based on an ideal p-bit response derived from the Boltzmann distribution, the circuit can still learn the desired functionality. Despite the noted non-idealities, the KL divergence saturates to a level comparable between ideal and non-ideal MTJ as shown in Fig. 2 d. This can be explained by the fact that in-situ learning has the capabilities to counter device-to-device variations by adjusting weights and biases to fit the system (see supplementary information for more details on the learned bias voltages). In Fig. 3, the red lines show the weight voltages of the non-ideal MTJ over the duration of the learning process. It can be clearly seen that the weights differ significantly between the ideal and non-ideal p-bit implementation while achieving similar performance in the KL-divergence, leading to the conclusion that feedback in the system between data and p-bit outputs is able to learn around variations, a crucial ingredient to achieve a high level of performance under device variability. In the supplementary information a system simulation on the MNIST dataset is presented to show that the variation tolerance exists when the proposed circuit is scaled up. The fact that the circuit can learn around variations can be useful not just for classical machine learning tasks like classification or unsupervised learning but also for tasks that have been demonstrated on probabilistic computers like optimization , inference or invertible logic . Instead of externally setting the coupling between p-bits, an additional learning task could improve the performance of the p-circuit by assuring that the coupling between the p-bits is adjusted to the exact hardware p-bit response. In addition, the proposed hardware can be used to represent many different distinct probability distributions by adjusting the coupling between p-bits accordingly. For the particular combination of MTJ and transistor, voltage change at the input can change the output of the p-bit on a transistor response time scale. Because the transistor response can be faster than the implemented synapse, for this particular experiment each p-bit was updated sequentially through the microcontroller instead of autonomously to preserve functionality (see Ref. for more details). Weight extraction.

In the previous sections, we compared the distribution of the output configurations of the hardware p-bits averaged over 60 s with the ideal distribution by taking the Kullback-Leibler divergence. In this section we compare how the weights extracted as voltages across the capacitors in the circuit would perform on an ideal platform i.e. to the Boltzmann distribution where

𝑃 (𝑚) ∝ exp[−𝛽𝐸(𝑚)] and 𝛽 is the inverse temperature of the system. The temperature in a Boltzmann machine is a constant factor that all weights and biases are multiplied with and represents how strongly coupled the p-bits are with each other. The comparison has particular relevance since the non-ideal effects during learning should have an effect on the weights compared to the weights that would be learned on an ideal machine. Fig. 4 shows the Boltzmann distribution with the weights of Fig. 3. The conversion factor between the voltages 𝑉 across the capacitors and dimensionless weights 𝑊 of the Boltzmann distribution represented by the temperature factor 𝛽 was chosen in a way that the relative difference between the peaks of the distribution can be seen clearly. To reduce the effect of noise, the weight values are averaged over the last 10 s of learning. For the example of the FA, it is known from the truth table that an ideal system has no bias. Hence, we do not use the extracted bias but set it to 0 for the Boltzmann distribution. In Fig. 4 a it can be clearly seen that compared to Fig. 2 c the learned distribution differs more from the ideal distribution since the peaks are not as uniform. The peaks for 0 configuration [𝐴𝐵𝐶 in ] = 000, [𝐶 out 𝑆] = 00 and [𝐴𝐵𝐶 in ] = 111, [𝐶 out 𝑆] = 11 are not as prominent as the other 6 peaks that have been learned. This discrepancy becomes even more visible in Fig. 4 b compared to Fig. 2 g where the weights used in the Boltzmann distribution were learned using a less ideal response of the p-bits. Here, only peaks [𝐴𝐵𝐶 in ] = 000, [𝐶 out 𝑆] = 00 and [𝐴𝐵𝐶 in ] = 111, [𝐶 out 𝑆] = 11 are prominent. This shows that the learned weights fit to the activation of the hardware p-bits but not for the ideal Boltzmann distribution. Hence, we can conclude that the probabilistic computer adapted to the non-ideal p-bit response during the in-situ learning process. The results presented in this section suggest that learning and inference must be performed on the same hardware to operate reliably. In contrast, initially training on this non-ideal machine, then transferring the weight values to an ideal system to complete convergence and perform the programmed task could allow for a hardware-based speed-up of the typically time-consuming weight training step. This is similar in spirit to using pre-trained weights in a neural network . While this can be a disadvantage, the advantages of using the efficient and compact learning circuit that can be used for training and inference should outweigh the problems of transferability between platforms. In this section, we have shown that device-to-device variations can be countered by performing hardware aware in-situ learning by comparing the learning performance of two systems, one system with ideal p-bit responses and the other with non-ideal p-bit responses that differ significantly compared to Eq.(1). We have shown that the overall performance is the same for both systems after the training is finished while the learned weights (Fig. 3) are different. However, we have also shown that if the weights are extracted from the learning circuit and used to calculate the Boltzmann distribution, the obtained distribution differs substantially from the desired data distribution (Fig. 4 b). These observations show clearly that the circuit can learn around device-to-device variations.

Discussion

In this paper, we have presented for the first time a proof-of-concept demonstration of an 1 autonomously operating fully connected Boltzmann machine using MTJ based p-bits. Furthermore, we have shown how device-to-device variations can be countered by performing hardware aware in-situ learning. In the following paragraphs, we compare the presented probabilistic computer with other platforms like conventional CMOS architectures. On the device level, the closest digital CMOS alternative to the MTJ-based p-bit is a linear feedback shift register (LFSR), without considering the analog tunability of the p-bit. A detailed comparison between p-bit versus LFSR has been performed by Borders et al. The compact MTJ-based p-bit uses around 10x less energy per random bit and has about 300x less area than a 32-bit LFSR. Besides these advantages, a standard LFSR is not tunable like the hardware p-bit and relies on pseudo randomness. The p-bit based on an s-MTJ relies on thermal noise and is, hence, a true random number generator. This can be significant for applications for which the quality of the randomness is important. On the system level, the p-bits in combination with the synapse (Eqs.(1) and (2)) are utilized to collect samples of the distribution given by the current weights to update the weights according to the correct gradient. Collecting statistics by sampling drives the learning process since every sample is directly utilized to update the weight voltages (Eq.(3)). Thus, the numbers of samples per unit time are significant for the speed of the learning process. The MTJ fluctuation time of the p-bit 𝜏 N is a significant time scale for the generation of samples since it describes how fast Eq.(1) can be computed in hardware. The learning time constant 𝜏 L has to be larger than the MTJ fluctuation time 𝜏 N to collect enough statistics to ensure convergence of the learning process. To ensure that every p-bit input is correctly calculated based on the state of the other p-bits, it is important that the synapse time 𝜏 S is smaller than 𝜏 N . In this experiment, since the synapse time defined by the microcontroller is in the order of 100 𝜇 s to 1 ms, 𝜏 N is in the order of 10 − 100 ms which results in slow training in the order of s. However, it has to be noted that the time scales of the circuit can be reduced significantly in an integrated version of the proposed circuit where the synapse 2 based on crossbar architectures can operate with GHz speeds with execution times down to 10 ps and the fluctuation time of s-MTJs can be in the order of 100 ps . This would allow a substantial decrease of 𝜏 L and an increase of the learning speed by up to 9 orders of magnitude. Regarding energy consumption of the synapse block, the efficient p-bit building block presented here can be combined with any synapse option that provides the most power efficiency. The RC array used here to represent weights as voltages requires a constant memory refresh similar to mainstream dynamic random-access memory (DRAM). To save energy during the learning process, the presented p-bit building block could be combined with non-volatile synapse implementations like memristive crossbar arrays The overall power consumption can be estimated using numbers from the literature. The MTJ-based p-bit consumes about 20 μW . In a memristive crossbar, each memristor consumes about 1.25 μW and operational amplifiers around . The XNOR operation consumes 10 μW. For the overall circuit with 5 p-bits, 15 XNOR-gates and memristors, and 5 operational amplifiers would take approximately

294 μW . This is the projected power consumption of a fully-connected Boltzmann machine hardware shown in this work. For specified applications where less weight connections between neurons are needed (for example restricted Boltzmann machines in digital computers), the number of components can be reduced which results in improved power consumption. In this regard, the estimated power consumption of 294 µW in our work can also be significantly reduced by employing a higher-level approach. Another significant advantage of the probabilistic circuit is that due to the compactness and area savings of the p-bit, when scaling up, many more p-bits can be put on a chip compared to CMOS alternatives like LFSRs. In addition, the p-bit hardware implementation does not rely on any clocking in order to function and is hence autonomously operating. This has the advantage that many autonomously operating p-bits can function in parallel leading to an overall acceleration of the operation. In this context, it has to be noted that the information of the current state of a p-bit has to be propagated to all other p-bits that are connected to it on a time scale 𝜏 S that is much shorter 3 than the neuron time 𝜏 N for the probabilistic circuit to function properly. When the p-bit fluctuation time varies between different p-bit it has to be assured that the fastest p-bit with fluctuation time 𝜏 N,f fluctuates slower than 𝜏 S . Depending on the sparsity of the weight matrix and the ratio of 𝜏 S to 𝜏 N , the number of parallel operating p-bits has to be adjusted to ensure fidelity of the operation . In a recent paper by Sutton et al. an FPGA design was implemented that emulates a probabilistic circuit where the MTJ based p-bit is envisioned as a drop-in replacement. In this complete system-level hardware realization of a p-computer that can only perform inference not learning, a drastic reduction in area footprint of the compact p-bit design compared to digital implementations was confirmed. This shows that an integrated version of the proposed learning circuit based on the p-computer architecture could be very beneficial. While we have addressed that device-to-device variations of the shape and shift of the p-bit response can be accounted for by hardware-aware learning, it is important to note that rate variation of the stochastic MTJ between p-bits cannot be reduced by this approach. The system will in the worst case learn as fast as the fluctuation rate of the slowest p-bit 𝜏 N,s which can slow down the overall operation. However, in the case of p-bits with stochastic MTJs where the thermal barrier of the magnet in the free layer is in the order of ≈ 𝑘 B 𝑇 , the fluctuation rate does not go exponentially with the size of the magnet making the system less susceptible to rate variations . It has to be noted that a way to reduce rate variation in probabilistic circuits based on stable MTJs that are biased using voltages and magnetic fields has been presented by Lv et al. As overall design criteria for the autonomous circuit the following conditions have to be met: 𝜏 S ≪ 𝜏 N,f and 𝜏 N,s ≪ 𝜏 L . In conclusion, we have shown a proof-of-concept demonstration of a fully connected probabilistic computer built with MTJ-based p-bits that can perform learning. We have presented multiple learning examples for up to 5 p-bits and 15 learning parameters. The learning is robust and can operate even with strong device-to-device variations due to hardware-aware learning. This shows that when scaled up and with faster fluctuating building blocks, probabilistic computers 4 could accelerate computation while reducing energy cost for a wide variety of tasks in the machine learning field such as generative learning or sampling, as well as for tasks that could benefit from variation tolerance like optimization or invertible logic.

Methods MTJ fabrication & Characterization.

The MTJs used in this work are fabricated with a stack structure as follows, from the substrate side: Ta(5)/ Pt(5)/ [Co(0.4)/Pt(0.4)]6/ Co(0.4)/ Ru(0.4)/ [Co(0.4)/Pt(0.4)] / Co(0.4)/ Ta(0.2)/ (Co Fe ) B (1)/ MgO/ (Co Fe ) B (1.7)/ Ta(5)/ Ru(5)/ Ta(50). The numbers in parentheses are the nominal thicknesses in nanometers. All films are deposited on a thermally oxidized silicon substrate by dc and rf magnetron sputtering at room temperature. The stacks are then processed into circular MTJs with nominal junction size of 20-25 nm in diameter by electron beam lithography and argon ion milling. The samples are annealed at 300°C in vacuum for an hour. MTJs are then cut out from wafers and bonded with wires to IC sockets to be placed in the p-bit circuit board. To determine non-ideal MTJs with suitable characteristics, the MTJ resistance is measured by sweeping the current from negative to positive values, and the time-averaged and high-frequency signals are read across a voltmeter and oscilloscope, respectively. We measure an approximate tunnel magnetoresistance ratio of 65% fluctuating between an average 𝑅 P = 18 kΩ and 𝑅 AP = 30 kΩ. The current at which the resistance switches by half is determined to be 𝐼 , which is the bias current at which the MTJs will spend equal time in the P and AP states. The 𝐼 used in this work ranges from 3 to 5 µA. We measure the average fluctuation time 𝜏 N by performing retention time measurements when the MTJ is in either the high (AP) or the low (P) state using voltage readings from the oscilloscope. To ensure reliable collection of data, the oscilloscope sampling rate is set ten times faster than the fastest recorded fluctuation time of the MTJ. The retention times used in this work range from 10 ms to 100 ms. 5 Hardware implementation of the p-bit.

Eq.(1) is implemented with the s-MTJ based p-bit proposed by Camsari et al. and experimentally demonstrated by Borders et al. The p-bit implementation in this paper follows Ref. and is built with an s-MTJ in series to a transistor (2N7000,T0-92-3 package) and a source resistor 𝑅 S . The supply voltage is set to 𝑉 DD = 200 mV. The source resistance 𝑅 S is chosen so that 𝐼 is flowing through the circuit when 𝑉 IN,0 = 1.95

V. The voltage at the drain of the transistor is then thresholded using a comparator (AD8694, 16-SOIC package). The reference voltage is chosen to be 𝑉 REF = 𝑉 DD − 𝐼 𝑅 P +𝑅 AP . We have used a comparator to add another node where we can fine tune 𝑉 REF . However, in an integrated circuit the transistor should be chosen so that 𝑉 REF = 𝑉 DD /2 as simulated in references . The overall p-bit is then just built with 1 MTJ and 3 transistors. For the experiment with ideal MTJs, the s-MTJ is emulated by a multiplexer (MUX) model that includes all major characteristics of a real s-MTJ and has been developed by Pervaiz et al. as illustrated in Fig. 5. The s-MTJ is emulated by providing a noise signal to the MUX where the statistics of the noise depend on 𝑉 IN and are generated using a microcontroller that switches between a resistor 𝑅 P and 𝑅 AP representing the two resistive states of the s-MTJ. Here, the resistors values are chosen to be 𝑅 P = 11 kΩ and 𝑅 AP = 22 kΩ. The advantage of this approach is that the MTJ parameters like stochastic range and resistance can be easily manipulated in this model. For the MUX, a MAX 394 quad analog multiplexer is used. Implementation of the synapse.

The synapse is implemented with an Arduino MEGA microcontroller and an 8-channel PMOD DA4 Digital-Analog-Converter. The digital output voltages of the p-bits {𝑉 OUT } are fed into the microcontroller together with the analog weight voltages {𝑉 C } of the learning circuit. The internal Analog-Digital-Converter (ADC) of the microcontroller is used for sensing the weight voltages. Eq.(2) is then computed and the analog input voltages {𝑉 IN } are wired back to the neurons by utilizing the DAC. To reduce the synapse 6 time in every iteration of the synapse operation, only one of the 15 analog voltages are read out and updated. This does not affect the circuit performance since the capacitor voltages 𝑉 C are changing slowly. The synapse operation time 𝜏 S is < 1 ms which is faster than the MTJ fluctuation time. The condition 𝜏 S « 𝜏 N has to be satisfied to ensure fidelity of the autonomous operation of the p-circuit. Implementation of weight updating.

Eq.(3) can be written into circuit parameters 𝐶 𝑑𝑉 ij 𝑑𝑡 = 𝑉 v,ij −𝑉 m,ij −𝑉 ij 𝑅 (4) with 〈𝑣 i 𝑣 j 〉 = 𝑉 v,ij /(𝑉 DD /2) and 𝑚 i 𝑚 j = 𝑉 m,ij /(𝑉 DD /2) . Eqs.(3) and (4) can be converted into each other by setting 𝑊 ij = 𝐴 v 𝑉 ij /𝑉 , 𝜆 = 𝑉 /(𝐴 v 𝑉 DD /2) and 𝜏 L = 𝜆𝑅𝐶 where 𝐴 v is a voltage gain factor between the voltage across the capacitor and the used weight value for the weighted sum in Eq.(2). This voltage gain is used to adjust the regularization parameter 𝜆 for the update rule Eq.(3). High 𝜆 produces smaller weight values during learning. For the FA experiment, 𝐴 v is chosen to be 3 which turned out to be a reasonable value for achieving a good degree of regularization while achieving high peaks in the learned distribution. The voltage 𝑉 is the reference voltage of the p-bit response defined by 𝐼 i (𝑡) = 𝑉 IN (𝑡)/𝑉 . The full RC element is depicted in Fig. 1. For proper operation it is important that the learning time constant 𝜏 L is much larger than the neuron time 𝜏 N . To achieve this, a high RC-constant is chosen with a 1 MΩ resistor and a 10 µF capacitor. Since this circuit has a high resistance in series to the capacitor, to ensure that the reading of the weight voltage does not discharge the capacitor, a buffer stage is used between the capacitor and the synapse. The buffer is implemented with an operational amplifier (AD8694, 16-SOIC package). For learning the correlations 𝑚 i 𝑚 j , represented by voltage 𝑉 m,ij , are crucial. To obtain the current correlations between neuron 𝑚 i and 𝑚 j their product has to be computed. This is done here by using another microcontroller. Since the output m is bipolar (𝑚 ∈ {−1, 1}) only negative or positive correlation is possible. Voltage 𝑉 m,ij is limited by the output voltages of the DAC which 7 has a range from 0 V to 2.5 V. 𝑉 m,ij can hence be calculated by solving 𝑉 m,ij = (𝑚 i 𝑚 j + 1)/2 · 2.5 V. Voltage 𝑉 m,ij is fed back to the corresponding RC element by utilizing another DAC. The described operation is the same as computing the XNOR operation between two binary variables. Hence, the operation is straight forward and the programmability of the microcontroller not essential for operation of the circuit. Experimental procedure.

Before the start of training the capacitor is fully discharged so that 𝑉 C,ij (𝑡 = 0) = 𝑉 v,ij corresponding to 𝑉 ij = 0 (compare Fig. 3). At 𝑡 = 0 the training starts, and voltage 𝑉 C and the p-bit output voltages are measured at sampling frequency 𝑓 S . The training is run for 𝑇 = 3000 s. The data is collected with an NI USB-6351 X SERIES DAQ, that has analog inputs for the 15 weights and biases and digital inputs for the 5 p-bit outputs. The software

Labview is utilized to record data with a sampling frequency of 𝑓 S = 1 kHz. In this paper we have trained the bias due to mismatch of p-bit responses together with the bias needed to learn the data distribution. In principle, these can be separated to obtain a better bias value that can be used on other platforms. However, this separation of calibration and learning is only possible for the bias of every p-bit and not for the weights connecting them since the calibration cannot be performed with ideal p-bit responses with the hardware system. Mapping of truth table to node voltages for learning.

For a fully visible Boltzmann machine with N neurons, (𝑁 + 1)𝑁/2 weights and biases have to be learned. Depending on the size of the data vectors that should be learned (𝑁 + 1)𝑁/2

RC elements/memristors and (𝑁 + 1)𝑁/2

XNOR-gates while 𝑁 p-bits (3 transistors and 1 s-MTJ) and 𝑁 operational amplifiers for current summation are needed in the full integrated version of this circuit. The goal for learning is that the fully trained network has the same distribution as the data distribution. For a FA, the data distribution is given by the truth table shown in table I. 8 The data distribution can be described by a matrix in which the number of columns is equal to the number of neurons N and the number of rows is equal to the number of training examples 𝑑 . For the biases, another neuron unit with value 1 is added so that there are (𝑁 + 1) columns. For the example of a full adder (FA), 𝑁 = 5 and 𝑑 = 8 for 8 lines in the truth table. The matrix 𝑉 FA is then a 6x8 matrix where all 0s of the truth table are converted to −1s since we are using the bipolar representation: 𝑉 FA = [ −1 −1 −1 −1 −1 1−1 −1 1 1 −1 1−1 1 −1 1 −1 1−1 1 1 −1 1 11 −1 −1 1 −1 11 −1 1 −1 1 11 1 −1 −1 1 11 1 1 1 1 1] (5) The density matrix is then calculated by computing 𝐷 = 𝑉 𝑇 𝑉/𝑑 which is a 6x6 matrix for the FA: 𝐷 FA = 𝑉 FA𝑇 𝑉 𝐹𝐴 𝑑 = [ 1 0 0 0 0.5 00 1 0 0 0.5 00 0 1 0 0.5 00 0 0 1 −0.5 00.5 0.5 0.5 −0.5 1 00 0 0 0 0 1] (6) with 𝑑 = 8 . The values in the last column of the density matrix correspond to the average value of every neuron in the data distribution and are used to learn the biases. Only the terms above the diagonal of 𝐷 are needed and converted to voltages 𝑉 v,ij in the circuit. Since the DAC operates with positive voltages in the range of 0 V to 2.5 V, 𝑉 v,ij = (𝐷 ij + 1)/2 · 2.5 V. Data availability

The datasets generated and analyzed during this study are available from the corresponding authors on reasonable request. 9

References

1. Feynman, R. P. Simulating physics with computers.

Int J Theor Phys , (1982). 2. Borders, W. A. et al. Integer factorization using stochastic magnetic tunnel junctions.

Nature , 390–393 (2019). 3. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature , 436 (2015). 4. Esteva, A. et al.

A guide to deep learning in healthcare.

Nat. Med. , 24–29 (2019). 5. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. , 85–117 (2015). 6. Big data needs a hardware revolution. Nature , 145 (2018). 7. Sze, V., Chen, Y.-H., Emer, J., Suleiman, A. & Zhang, Z. Hardware for machine learning: Challenges and opportunities. in et al.

Neuromorphic spintronics.

Nat. Electron. et al.

A million spiking-neuron integrated circuit with a scalable communication network and interface.

Science , 668–673 (2014). 10. Davies, M. et al.

Loihi: A Neuromorphic Manycore Processor with On-Chip Learning.

IEEE Micro , 82–99 (2018). 11. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Comput. , 1771–1800 (2002). 12. Carreira-Perpinan, M. A. & Hinton, G. E. On contrastive divergence learning. in Aistats vol. 10 33–40 (Citeseer, 2005). 13. Ernoult, M., Grollier, J. & Querlioz, D. Using Memristors for Robust Local Learning of Hardware Restricted Boltzmann Machines.

Sci. Rep. , (2019). 14. Bojnordi, M. N. & Ipek, E. Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on

IEEE Electron Device Lett. , 345–348 (2019). 16. Aarts, E. H. L. & Korst, J. H. M. Boltzmann machines and their applications. in PARLE Parallel Architectures and Languages Europe (eds. Bakker, J. W., Nijman, A. J. & Treleaven, P. C.) vol. 258 34–50 (Springer Berlin Heidelberg, 1987). 17. Osborn, T. R. Fast Teaching of Boltzmann Machines with Local Inhibition. in

International Neural Network Conference: July 9–13, 1990 Palais Des Congres — Paris — France

Artificial intelligence and statistics

Advances in Neural Information Processing Systems 25 (eds. Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 2222–2230 (Curran Associates, Inc., 2012). 20. Eryilmaz, S. B. et al.

Training a Probabilistic Graphical Model With Resistive Switching Electronic Synapses.

IEEE Trans. Electron Devices , 5004–5011 (2016). 21. Tsai, C.-H., Yu, W.-J., Wong, W. H. & Lee, C.-Y. A 41.3/26.7 pJ per Neuron Weight RBM Processor Supporting On-Chip Learning/Inference for IoT Applications. IEEE J. Solid-State Circuits , 2601–2612 (2017). 22. Salakhutdinov, R. Learning and Evaluating Boltzmann Machines. 21. 23. De Rose, R. et al. Variability-Aware Analysis of Hybrid MTJ/CMOS Circuits by a Micromagnetic-Based Simulation Framework.

IEEE Trans. Nanotechnol. , 160–168 (2017). 24. Lv, Y., Bloom, R. P. & Wang, J.-P. Experimental Demonstration of Probabilistic Spin Logic by Magnetic Tunnel Junctions. IEEE Magn. Lett. , 1–5 (2019).

25. Qu, Y. et al.

Variation-Resilient True Random Number Generators Based on Multiple STT-MTJs.

IEEE Trans. Nanotechnol. , 1270–1281 (2018). 26. Li, C. et al. Efficient and self-adaptive in-situ learning in multilayer memristor neural networks.

Nat. Commun. , 1–8 (2018). 27. Dalgaty, T., Castellani, N., Querlioz, D. & Vianello, E. In-situ learning harnessing intrinsic resistive memory variability through Markov Chain Monte Carlo Sampling. ArXiv200111426 Cs (2020). 28. Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for Boltzmann machines.

Cogn. Sci. , 147–169 (1985). 29. Camsari, K. Y., Faria, R., Sutton, B. M. & Datta, S. Stochastic p-bits for invertible logic. Phys. Rev. X , 031014 (2017). 30. Koller, D. & Friedman, N. Probabilistic Graphical Models: Principles and Techniques . (MIT Press, 2009). 31. Nair, V. & Hinton, G. E. Implicit Mixtures of Restricted Boltzmann Machines. 8. 32. Hamilton, K. E. et al.

Accelerating Scientific Computing in the Post-Moore’s Era.

ACM Trans. Parallel Comput. , 6:1-6:31 (2020). 33. Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. in Proceedings of the 25th international conference on Machine learning - ICML ’08

Front. Comput. Neurosci. , (2020). 35. Ambrogio, S. et al. Equivalent-accuracy accelerated neural-network training using analogue memory.

Nature , 60 (2018). 36. Mahmoodi, M. R., Prezioso, M. & Strukov, D. B. Versatile stochastic dot product circuits based on nonvolatile memories for high performance neurocomputing and neurooptimization.

Nat. Commun. , 1–10 (2019). 37. Camsari, K. Y., Salahuddin, S. & Datta, S. Implementing p-bits With Embedded MTJ. IEEE Electron Device Lett. , 1767–1770 (2017). 38. Pervaiz, A. Z., Datta, S. & Camsari, K. Y. Probabilistic Computing with Binary Stochastic Neurons. in Ann. Math. Stat. , 79–86 (1951). 40. Sutton, B., Camsari, K. Y., Behin-Aein, B. & Datta, S. Intrinsic optimization using stochastic nanomagnets. Sci. Rep. , (2017). 41. Faria, R., Kaiser, J., Camsari, K. Y. & Datta, S. Hardware Design for Autonomous Bayesian Networks. ArXiv200301767 Cs (2020). 42. Faria, R., Camsari, K. Y. & Datta, S. Implementing Bayesian networks with embedded stochastic MRAM.

AIP Adv. , 045101 (2018). 43. Sutton, B. et al. Autonomous Probabilistic Coprocessing with Petaflips per Second.

IEEE Access

Advances in Neural Information Processing Systems et al.

Technological exploration of RRAM crossbar array for matrix-vector multiplication. in

The 20th Asia and South Pacific Design Automation Conference et al.

Harnessing Intrinsic Noise in Memristor Hopfield Neural Networks for Combinatorial Optimization. 24. 48. Kaiser, J. et al.

Subnanosecond Fluctuations in Low-Barrier Nanomagnets.

Phys. Rev. Appl. , 054056 (2019). 49. Hassan, O., Faria, R., Camsari, K. Y., Sun, J. Z. & Datta, S. Low-Barrier Magnet Design for Efficient Hardware Binary Stochastic Neurons. IEEE Magn. Lett. , 1–5 (2019). 50. Pufall, M. et al. Large-angle, gigahertz-rate random telegraph switching induced by spin- momentum transfer. Phys. Rev. B , (2004). 51. Li, B. et al. Memristor-based approximated computation. in

International Symposium on Low Power Electronics and Design (ISLPED)

Phys. Rev. , 1677–1686 (1963). 53. Coffey, W. T. & Kalmykov, Y. P. Thermal fluctuations of magnetic nanoparticles: Fifty years after Brown.

J. Appl. Phys. , 121301 (2012).

Acknowledgments

J.K. thanks A.Z. Pervaiz for helpful discussions. This work was supported in part by ASCENT, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA and in part by JST-CREST JPMJCR19K3, JSPS Kakenhi 19J12206, and Cooperative Research Projects of RIEC. K.Y.C gratefully acknowledges support from Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant CCF-0939370.

Author contribution

K.Y.C., S.F., H.O. and S.D. planned the study. J.K., K.Y.C. and S.D. developed the mapping of learning algorithm and the experimental setup. W.A.B. and S.F. prepared and characterized the MTJ devices. J.K. and K.Y.C. conducted the learning experiment and collected results. All authors contributed to the writing of the manuscript. All authors discussed the results.

Competing interests

The authors declare no competing interests.

Additional information Supplementary information is available for this paper. 2

Figure 1:

Probabilistic Learning Circuit: a Block diagram of the learning circuit. b A photograph of the PCB with the 5 p-bits (each consisting of an s-MTJ, an NMOS transistor and a source resistor 𝑅 S ) and 15 RC elements and 20 operational amplifiers (5 used as a comparator and 15 as a buffer). The p-bits are interconnected with the RC-array as shown in a . Figure 2:

Full Adder (FA) learning: a Response of ideal MTJ for the 5 p-bits used in the FA. Every point is averaged over 15 s. b Experimental distribution of ideal MTJ for the first 60 s of learning. c Experimental distribution of ideal MTJ over the last 60 s of learning. d KL-divergence between ideal and experimental distribution vs. time of ideal and non-ideal MTJ. The experimental distribution is obtained over 60 s of learning. e Response of non-ideal MTJ for the 5 p-bits used in the FA. Every point is averaged over 15 s. f Experimental distribution of non-ideal MTJ for the first 60 s of learning. g Experimental distribution of non-ideal MTJ over the last 60 s of learning. Figure 3:

Weight voltages during FA learning:

The 10 weight voltages are shown during the 3000 s of learning. Blue lines are the weights learned with the ideal MTJ; red lines show the weights for the non-ideal MTJ. The solid lines in the middle are the moving average of the actual weights taken over a window of 10 s.

Figure 4:

Boltzmann distribution obtained from learned weights: a Boltzmann distribution computed by using the learned weights of the FA with the ideal s-MTJ p-bits. b Boltzmann distribution computed by using the learned weights of the FA with the real s-MTJ p-bits. Figure 5:

MUX model:

The s-MTJ based p-bit on the left is modeled by a multiplexer that switches randomly between 𝑅 P and 𝑅 AP but as a function of 𝑉 IN so that the right statistics are preserved . 𝐴 𝑣 𝐵 𝑣 𝐶 𝑖𝑛 𝑣 S 𝑣 𝐶 out 𝑣 𝑃 Ideal

0 0 0 0 0 0.125 0 0 1 1 0 0.125 0 1 0 1 0 0.125 0 1 1 0 1 0.125 1 0 0 1 0 0.125 1 0 1 0 1 0.125 1 1 0 0 1 0.125 1 1 1 1 1 0.125

Table I:

Truth Table of Full Adder: 𝐴 and 𝐵 are inputs, 𝐶 in is the carry in, 𝑆 the sum and 𝐶 out the carry out. In the Boltzmann machine context, all visible units are equivalent so that inputs and outputs can be written as 𝑣 . 𝑃 Ideal is the ideal probability distribution where every line's probability is 𝑝 = 1/8 = 0.125 . Supplementary Information: Hardware-aware in-situ Boltzmann machine learning using stochastic magnetic tunnel junctions

Jan Kaiser, William A. Borders, Kerem Y. Camsari, a) Shunsuke Fukami, , b) Hideo Ohno, and Supriyo Datta Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47906 USA Laboratory for Nanoelectronics and Spintronics, Research Institute of Electrical Communication, Tohoku University, Sendai, Japan 3)Department of Electrical and Computer Engineering, University of California Santa Barbara, Santa Barbara, CA, 93106, USA Center for Innovative Integrated Electronic Systems, Tohoku University, Sendai, Japan. Center for Spintronics Research Network, Tohoku University, Sendai, Japan. Center for Science and Innovation in Spintronics, Tohoku University, Sendai, Japan. WPI-Advanced Institute for Materials Research, Tohoku University, Sendai, Japan. (Dated: 9 February 2021) a) Electronic mail: [email protected] b) Electronic mail: [email protected]

LEARNED WEIGHTS AND BIASES

In the main manuscript, the learned probability distribution of the full adder is analyzed. In this section the actual weights voltages across the capacitors (Fig. 3 in the main manuscript) are compared to the ideal FA weights. The weight matrix for a FA for an ideal p-computer with ideal sigmoidal p-bit responses is the following and has been part of several works : 𝑊 FA = [ 0 −1 −1 1 2−1 0 −1 1 2−1 −1 0 1 21 1 1 0 −22 2 2 −2 0 ] (S1) Since the ideal FA probability distribution is symmetric, the bias vector is 0 and can be disregarded here.

In Fig. 3 of the main manuscript the weights voltages across the capacitors 𝑉 ij = 𝑉 v,ij − 𝑉 C,ij extracted from the RC-circuit are shown. Since the p-bit response has units of voltage whereas the ideal p-bit response is unitless, there is a constant conversion factor between the 𝑊 FA and the weight voltages in Fig. 3. Since the p-bit responses differ for both cases, the learned weights voltages are not identical. The weights are learned to fit to the given non-ideal response of each p-bit. However, it can be clearly seen that the general structure of the different weight voltage matrix extracted from the experiment and 𝑊 FA is similar at the end of the learning process for both ideal and non-ideal MTJ system. For example −2𝑉 ≈ −2𝑉 ≈ 2𝑉 ≈ 𝑉 which corresponds to −2𝑊 = −2𝑊 = 2𝑊 = 𝑊 in Eq.(S1). This makes the point that even though the weights learned in this experiment are not ideal due to the non-ideal p-bit responses, they are related to the weights of an ideal p-computer. Initializing with the weights learned on a hardware probabilistic computer could hence reduce learning time when trying to learn based on an ideal Boltzmann distribution as mentioned in the main manuscript. In Fig. S1 the learned biases are shown. Since the ideal learned biases are 0, the biases learned in this experiment account for the shifted p-bit responses away from the ideal response center at 𝑉 IN = 1.95 V. Since the p-bit responses for the non-ideal MTJ in Fig. 3 a of the main manuscript are shifted to the left all biases are negative and bigger than the biases needed for the ideal MTJ.

BOLTZMANN MACHINE LEARNING ALGORITHM

For learning probability distributions in the context of energy-based models like Boltzmann machines the common learning algorithm is gradient ascent of the log-likelihood given by

𝐿(𝑊; 𝑉) = ∑ log exp[−𝐸(𝑣 n ; 𝑊)] (S2) where 𝑍 is the partition function and the data distribution is given by 𝑉 = {𝑣 n } 𝑛=1𝑑 . The gradient ascent update rule is given by 𝑊 ij (𝑡 + 1) = 𝑊 ij (𝑡) + 𝜖 δ𝐿(W;V)δW | 𝑊(𝑡) (S3) with the learning rate 𝜖 . Solving for the derivative of 𝐿(𝑊; 𝑉) gives 𝑊 ij (𝑡 + 1) = 𝑊 ij (𝑡) − 〈 δ𝐸(𝑚)δ𝑊 〉 data + 〈 δ𝐸(𝑚)δ𝑊 〉 model (S4) The data-term in the derivative evolves from exp(−𝐸(𝑣 n ; 𝑊 ) and the model-term from the partition function 𝑍 in Eq.(S2). With energy given by 𝐸(𝑚) = − ∑ 𝑊 ij 𝑚 i 𝑚 j , the Boltzmann machine learning rule is 𝑊 ij (𝑡 + 1) = 𝑊 ij (𝑡) + 𝜖(〈𝑣 i 𝑣 j 〉 − 〈𝑚 i 𝑚 j 〉) (S5) Eq.(3) of the main manuscript is a time-continuous version of Eq.(S5) where the averaged correlation 〈 𝑚 i 𝑚 j 〉 is replaced with the sampled correlation 𝑚 i 𝑚 j (compare Ref. ). A similar formula can be derived for the biases. It has to be noted that the learning rule in Eq.(S5) assumes ideal sigmoidal p-bit responses since it is derived from Boltzmann law. However, in this paper the same learning rule is also applied when p-bit responses are non-sigmoidal with significant variations yet good learning results are achieved. LEARNING OF AND, OR AND XOR GATE

In this section learning examples with smaller numbers of p-bits are presented. The same PCB is utilized but only 3 p-bits and 6 RC elements are used for the AND and OR gate and 4 p-bits and 10 RC elements are used for the XOR gate. Here, the ideal MUX model is used.

Learning the AND-Gate

For an AND-Gate the truth table matrix in the bipolar representation 𝑉 AND with an added column with +1 for the bias is given by 𝑉 AND = [−1 −1 −1 1−1 1 −1 11 −1 −1 11 1 1 1] (S6)

The density matrix is then given by 𝐷 AND = 𝑉 AND𝑇 𝑉 AND 𝑑 = [ 1 0 0.5 00 1 0.5 00.5 0.5 1 −0.50 0 −0.5 1 ] (S7) with 𝑑 = 4 . In total 6 parameters have to be learned. Fig. S2 a shows the p-bit response of the 3 p-bits used for AND gate learning. Fig. S2 b shows the KL divergence and Fig. S2 c and d show the histogram at the start and at the end of learning. For the AND gate after learning out of the 8 possible configurations, the 4 desired states become most and equally probable. Learning the OR-Gate

For an OR-Gate the truth table matrix in the bipolar representation 𝑉 OR with an added column with +1 for the bias is given by 𝑉 OR = [−1 −1 −1 1−1 1 1 11 −1 1 11 1 1 1] (S8) The density matrix is then given by 𝐷 OR = 𝑉 OR𝑇 𝑉 OR 𝑑 = [ 1 0 0.5 00 1 0.5 00.5 0.5 1 0.50 0 0.5 1 ] (S9) with 𝑑 = 4 . In total 6 parameters have to be learned. Fig. S3 a shows the p-bit response of the 3 p-bits used for OR gate learning. Fig. S3 b shows the KL divergence and Fig. S3 c and d show the histogram at the start and at the end of learning. For the OR gate after learning out of the 8 possible configurations, the 4 desired states become most and equally probable. Learning the XOR-Gate

For an XOR-Gate even though there are just 2 inputs and 1 output an auxiliary neuron is needed to be able to learn the XOR functionality. Without an additional p-bit, all non-diagonal entries of the density matrix are 0 which corresponds to no learning at all. Here, we choose the auxiliary neuron to be in the first column of the 𝑉 XOR matrix and to be 1 for the first entry and -1 for the last 3 entries of the XOR truth table matrix. 𝑉 XOR = [ 1 −1 −1 −1 1−1 −1 1 1 1−1 1 −1 1 1−1 1 1 −1 1] (S10)

The density matrix is then given by 𝐷 XOR = 𝑉 XOR𝑇 𝑉 XOR 𝑑 = [ 1 −0.5 −0.5 −0.5 −0.5−0.5 1 0 0 0−0.5 0 1 0 0−0.5 0 0 1 0−0.5 0 0 0 1 ] (S11) with 𝑑 = 4 . It can be clearly seen that without the first column in 𝑉 𝑋𝑂𝑅 all off-diagonal terms of the 𝐷 XOR would be 0. In total 10 parameters have to be learned. Fig. S4 a shows the p-bit response of the 4 p-bits used for XOR gate learning. Fig. S4 b shows the KL divergence and Fig. S4 c and d show the histogram at the start and at the end of learning. The figure just shows the histogram of the 3 of the 4 p-bits (2 input, 1 output) without plotting the states of the auxiliary p-bit. For the XOR gate after learning, the 4 desired states become most and equally probable out of the 8 possible configurations.

SIMULATIONS OF THE PROPOSED CIRCUIT FOR LARGER NETWORKS

In this section we use a behavioral model on the MNIST dataset to show that the variation tolerance observed in our proof-of-concept experiment can be transferred to larger scale. It has to be noted that the implemented circuit in our proof-of-concept experiment is a fully visible Boltzmann machine that does not make use of any hidden neurons. This means that the states of all nodes of the Boltzmann machine are given by the data distribution. Hidden neurons add representational power to a Boltzmann machine and are needed for reaching high absolute accuracy on image recognition tasks like MNIST . The MNIST dataset has 60000 training images and 10000 test images with 28x28 pixels with digits from 0 to 9. The fully visible Boltzmann network used here consists of 794 p-bits (28x28=784 + 10 p-bits used as labels). The MNIST dataset is transformed into bipolar values and Algorithm 1 which emulates the circuit's behavior is used for learning. For every iteration of the p-bit update procedure, the behavioral model proposed by Faria et al. for the hardware p-bit implementation is utilized, a model that has been benchmarked against SPICE simulations. In addition, the activation function is changed to account for device-to-device variations. To model the behavior of the proposed circuit we use the formula 𝑎𝑐𝑡(𝑥, 𝑘) = tanh [(1 − 𝑘) 𝑥 + 𝑘 𝑥 ] (S12) where 𝒌 ∈ [𝟎, 𝟏] parameterizes how ideal the response of the p-bit is. In Fig. S5 a, Eq.(S12) is compared to a non-ideal p-bit response observed in the experiment. For 𝒌 =𝟎 the ideal p-bit response is observed whereas for 𝒌 = 𝟏 the p-bit response looks like a staircase. It can be clearly seen that the model is very close to the observed experimental behavior of the p-bits. To simulate the variation behavior, the factor 𝒌 is drawn from a Gaussian distribution with mean 𝝁 𝐤 and standard deviation 𝝈 𝐤 for every p-bit. In Fig. S5 b the accuracy of the circuit is shown for every iteration of Algorithm 1 for different distributions of 𝒌 for each p-bit. To obtain test results, the 784 p-bits that correspond to the pixels are clamped to the bipolar test data and the label p-bits are fluctuating freely. The p-bit with the highest probability of being ‘1’ is used for the classified digit. The learning is performed for different values of 𝝁 𝐤 and 𝝈 𝐤 . After around iterations the accuracy saturates to about 81% for all 3 curves shown while the learned weights differ (Fig. S5 c,d). This shows that the circuit can account for non- ideal p-bit responses by learning the correct weights. The learning can account for the non-ideal p-bit responses and still obtain similar accuracy. The behavioral model simulation suggests that the learning duration of the task shown in Fig.(S5) can be around 100 ns with 𝚫𝒕 = 𝟏 ps and iterations in an ideally optimized integrated circuit using MTJ based p-bits. The 81% accuracy is due to the chosen fully visible network structure without any hidden units. The low performance of this model is not due to the hardware components but due to the low representational power of the fully visible Boltzmann machine . The same circuit with hidden nodes can be for example implemented by time sharing the p-bit circuit for collecting data and model statistics but is out of the scope of this paper. REFERENCES

1. Hassan, O., Camsari, K. Y. & Datta, S. Voltage-Driven Building Block for Hardware Belief Networks.

IEEE Des. Test , 15–21 (2019). 2. Pervaiz, A. Z., Sutton, B. M., Ghantasala, L. A. & Camsari, K. Y. Weighted p-Bits for FPGA Implementation of Probabilistic Circuits. IEEE Trans. Neural Netw. Learn. Syst. , 1920–1926 (2019). 3. Carreira-Perpinan, M. A. & Hinton, G. E. On contrastive divergence learning. in Aistats vol. 10 33–40 (Citeseer, 2005). 4. Koller, D. & Friedman, N.

Probabilistic Graphical Models: Principles and Techniques . (MIT Press, 2009). 5. Kaiser, J., Faria, R., Camsari, K. Y. & Datta, S. Probabilistic Circuits for Autonomous Learning: A simulation study.

Front. Comput. Neurosci. , (2020). 6. LeCun, Y., Cortes, C. & Burges, C. J. C. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. 7. Le Roux, N. & Bengio, Y. Representational Power of Restricted Boltzmann Machines and Deep Belief Networks.

Neural Comput. , 1631–1649 (2008). 8. Faria, R., Kaiser, J., Camsari, K. Y. & Datta, S. Hardware Design for Autonomous Bayesian Networks. ArXiv200301767 Cs (2020).

Figure S1:

Biases voltages during FA learning: a Biases voltages across the capacitors for ideal MTJ, b Biases voltages for non-ideal MTJ.

Figure S2:

AND-Gate: a Response for the 3 p-bits used in the AND-Gate. b KL-divergence between ideal and experimental distribution vs. time. The experimental distribution is obtained over 30 s of learning. c Experimental distribution for the first 30 s of learning. d Experimental distribution over the last 30 s of learning. The voltage gain 𝐴 v is set to 3. Figure S3:

OR-Gate: a Response for the 3 p-bits used in the OR-Gate. b KL-divergence between ideal and experimental distribution vs. time. The experimental distribution is obtained over 30 s of learning. c Experimental distribution for the first 30 s of learning. d Experimental distribution over the last 30 s of learning. The voltage gain 𝐴 v is set to 3. Figure S4:

XOR-Gate: a Response for 4 p-bits used in the XOR-Gate. b KL-divergence between ideal and experimental distribution vs. time. The experimental distribution is obtained over 60 s of learning. c Experimental distribution for the first 60 s of learning. d Experimental distribution over the last 60 s of learning. The voltage gain 𝐴 v is set to 4. Figure S5:

Learning with behavioral p-bit model on MNIST dataset: a Experimental p-bit response is compared to the model of Eq.(S12) for different values of 𝑘 where 𝑥 is fitted to the input voltage 𝑉 IN . b Test set accuracy on the MNIST dataset during training c,d

Example weights during training. Following parameters are used in the behavioral model: neuron time 𝜏 N = 150 ps, synapse time 𝜏 S =10 ps, transistor time 𝜏 T = 25 ps and Δ𝑡 = 1 ps. The used learning parameters are 𝜖 = 10 −5 , 𝜆 = 0. Algorithm 1:

Behavioral model of proposed learning circuit. Given a data set 𝑿 , calculate density matrix 𝑫 = 𝑿𝑿 𝑻 ; Initialize 𝑾 to 0 and initialize 𝒎 ∈ {−𝟏, 𝟏} randomly; for 𝒕 = 𝟎: 𝑻 (number of iterations) do Get 𝒎 from p-bit sampling procedure (Eqs. (1),(2)); Calculate 𝑴 = 𝒎𝒎 𝑻 ; Update 𝑾 𝐢𝐣 = 𝑾 𝐢𝐣 + 𝝐(𝑫 𝐢𝐣 − 𝑴 𝐢𝐣 − 𝝀𝑾 𝐢𝐣 ) (Eq.3); set diagonal terms of 𝑾 to 0;to 0;