An Energy-Efficient Low-Voltage Swing Transceiver for mW-Range IoT End-Nodes
Hayate Okuhara, Ahmed Elnaqib, Davide Rossi, Alfio Di Mauro, Philipp Mayer, Pierpaolo Palestri, Luca Benini
AAn Energy-Efficient Low-Voltage SwingTransceiver for mW-Range IoT End-Nodes
Hayate Okuhara ∗ , Ahmed Elnaqib ∗ , Davide Rossi ∗ , Alfio Di Mauro ‡ , Philipp Mayer ‡ , Pierpaolo Palestri † , Luca Benini ∗‡ ∗ DEI, University of Bologna, Italy † DPIA, University of Udine, Italy ‡ Integrated System Laboratory ETH, Zuerich, Switzerland
Abstract —As the Internet-of-Things (IoT) applications becomemore and more pervasive, IoT end nodes are requiring moreand more computational power within a few mW of powerenvelope, coupled with high-speed and energy-efficient inter-chipcommunication to deal with the growing input/output and mem-ory bandwidth for emerging near-sensor analytics applications.While traditional interfaces such as SPI cannot cope with thesetight requirements, low-voltage swing transceivers can tackle thischallenge thanks to their capability to achieve several Gbps ofbandwidth at extremely low power. However, recent researchon high-speed serial links addressed this challenge only partially,proposing only partial or stand-alone designs, and not addressingtheir integration in real systems and the related implications.In this paper, we present for the first time a complete designand system-level architecture of a low-voltage swing transceiverintegrated within a low-power (mW range) IoT end-node proces-sors, and we compare it with existing microcontroller interfaces.The transceiver, implemented in a commercial 65-nm CMOStechnology achieves 10.2x higher energy efficiency at 15.7x higherperformance than traditional microcontroller peripherals (singlelane).
Index Terms —IoT, SerDes, Energy efficient peripheral, SPI,microcontroller.
I. I
NTRODUCTION
Pushed by the IoT trends, in the last years, the requiredcomputational performance in end-nodes has increased con-siderably. Nowadays, near-sensor applications, such as con-volutional neural network (CNN) based image analysis andbio-potential processing, have to efficiently operate on largevolumes of sensor data captured by microcontrollers (MCUs).In this scenario, state of the art SoCs have already achievedperformance in the order of several GOPS within a 10mWpower envelope [1], [2].On the other hand, in modern embedded systems operatingin the IoT context, overcoming the limitations imposed by lowchip-to-chip communication bandwidths represents a majorchallenge. Conventional MCU peripherals, such as I2C, I2S,and SPI provide transfer data rates in the order of few tenthsof Mbps, which are typically not sufficient to satisfy theexpected bandwidth and energy efficiency demand of thenext-generation IoT applications. For example, according tothe results reported in [21], the off-chip bandwidth requiredto perform MobileNetV2 inference [22] at 10 FPS on anMCU is larger than 500 Mbps. Although there are somesolutions which can reach this requirement (e.g. HyperBus orOctal SPI operating at fast frequencies) [3], [18], their powerconsumption rapidly saturates the end-node power budgets.Serial links peripherals [6]–[8], [12], [19], relying on analogdata transceivers, constitute a promising alternative to purelydigital serial interfaces, both from the bandwidth and energyefficiency perspective. In serial links, serialized data are sentat high rate, while low-power consumption is guaranteed by
TABLE IR
EPORTED LOW POWER SERIAL LINKS AND OUR SYSTEM
Reported work [12] [7], [8] [19] This workTarget bandwidth 1 1-6 25 0.8(Gbps)Power consumption < < exploiting low-voltage swing signals at the physical layer.State of the art solutions [7], [8] can achieve over 1 Gbpsbandwidth, while keeping the power consumption in a fewmW ranges.While various research efforts have been reported in op-timizing serial links, system-level integration, e.g. in micro-controllers, has not been extensively studied to the authors’best knowledge. Also, in the IoT context, it is essentialto minimize the data transmission power to not erode theavailable power budget dedicated to useful computation. TableI provides an overview of recent research efforts on low-powertransceivers, positioning the proposed work with respect tostate of the art transceivers in terms of power and bandwidth,and highlighting the limitations of the latter with respect tosystem level integration issues. This includes the need forseveral external power supplies forming and additional sourceof power consumption not considered in previous works.From the observations above, the contributions of this paperare as follows: • We designed a serializer-deserializer link (SerDes) systemand we integrated it into an open-source low-powermicrocontroller [9]. Detailed architectural and micro-architectural information are shown. • We evaluated the energy efficiency of the implementedSerDes with post-layout simulations. This brings guide-lines for its power management. • We explored a duty-cycled operation of the SerDes for alow bandwidth target. We report on the trade-off betweenbandwidth and energy efficiency.The energy efficiency of the SerDes was finally comparedwith conventional digital peripherals widely adopted in mi-crocontrollers such as SPI [2], [4], [5] and more advancedperipherals such as HyperBus [3]. The SerDes achieves 10.2xhigher energy efficiency at 787 Mbps than the case of a SingleSPI operating at 50 Mbps. Moreover, even if we target a lowbandwidth such as 10 Mbps with the SerDes, its efficiency is8.3x higher than the SPI. Also, we show the SerDes energy is21x smaller than the Hyper Bus. a r X i v : . [ c s . A R ] O c t ig. 1. High level architectural block diagram of the system overview II. S
YSTEM O VERVIEW
Fig. 1 shows an overview of the System on a Chip (SoC)hosting the proposed serial link. The main building blocks ofthe SoC are a RISC-V core coupled to a multi-bank word-level interleaved memory, and an autonomous input/outputsubsystem ( µ DMA) [10] to transfer data to the peripherals.The internal clock is generated by a frequency locked loops(FLL). Additionally, the SoC features a timer, a debug unit,and programmable GPIOs. The SerDes is composed of thetransmitter (TX), the receiver (RX), and configuration registersmapped on the advanced peripheral bus (APB) used to accessenable signals, as well as the address, and the size of thecommunicated data. The SerDes is connected to the µ DMA,an autonomous DMA subsystem providing high-speed datatransfers between L2 and the peripherals.Data from the µ DMA are transmitted to another chip viathe TX module. Its enable signals (“
Comm-En ” and “
Warm-En ”) are from memory-mapped registers that are accessed viasoftware. The transferred data is captured by the RX moduleand delivered to the µ DMA. The µ DMA sends the receiveddata to the RX buffer which is allocated in the global memoryaccording to the configuration registers.The SerDes operates in three modes: idle, warm-up, anddata-comm. During the idle mode, all the digital circuits aredeactivated, and the transceiver is in low-power mode. Thedata-comm mode sends/receives serialized data. However, toestablish a communication, the RX has to be synchronizedwith the transmitted data generated by another chip, potentiallyoperating at a different clock phase. Hence, during the warm-up mode, the TX sends a training sequence including its clockphase information to the RX. According to this input, the RXrecovers the transmitter clock. These three modes are selectedthrough “
Comm-En ” and “
Warm-En ” registers.To start the actual inter-chip communication, the TX isfirstly set to the warm-up mode. The RX in another chipreceives this information through a GPIO, resulting in theRX warm-up mode as well. Using the timer in Fig. 1, theprocessor in the RX chip waits a fixed amount of time untilthe RX clock is ready for the communication. Then, througha GPIO, the RX chip notifies the TX chip that the clock isready. Also, the information required by the µ DMA is storedin the configuration registers. When this is finished, throughanother GPIO, the RX also informs the TX that the datacommunication is ready. Finally, the SerDes mode is changedto data-comm mode, and the TX starts to send main data bydeclaring its start and end point with a communication header(Start flit), and a footer (Stop flit).
Fig. 2. Architectural block diagram of the serial link (a)TX (b)RX
III. L OW - POWER S ERIAL L INK
A. Link architecture
Fig. 2 shows a detailed block diagram of the SerDes. TheTX is composed of an 8b/10b encoders, TX controller, 40:1serializer, pre-driver, and the driver. The RX is equipped withthe analog comparators, timing synchronizers, a deserializer,RX controller, Clock Data Recovery (CDR) circuit, and the10b/8b decoders. Both the TX and RX operate at the samefrequency. However, as previously described, the clock phasein the RX has to be adjusted. The CDR circuit performs theclock recovery so that the RX clock transitions occur at themid-point of the received data bit. The data communicationis conducted by a differential signal. Hence, four analog padsare required in addition to three GPIOs used to synchronizethe RX and TX in different chips.
B. TX design
At the TX, 40-bit “Start flit”, “Stop flit”, and the main bodyof the communication are serialized and transmitted to the RXin another chip. The multiplexer in Fig. 2 (a) selects one ofthem and sends it to the serializer which output serialized dataat the double data rate (DDR) of the TX clock. Then, the drivertransmits the data to the RX with low-voltage swing (200mV)signals. Here we adopt the serializer and driver in [12].The main body of the communication is encoded by the fourparallel 8b/10b encoders [11] which ensure that the serializeddata is DC-balanced and its disparity is less than ± . The TXcontroller is a finite state machine that manages the timingof these functionalities according to the FIFO handshakingsignals from the interface between the SerDes and the µ DMA,and the enable signals from the configuration registers. TheTX clock is provided by the FLL and divided by two andfour. “
Clk fll/4 ” is utilized for the encoders, multiplexer, andcontroller to reduce the power consumption. Since the µ DMAoperates at the system clock, the interface between the SerDesand µ DMA is implemented by an asynchronous FIFO.Firstly, the TX is set to the idle state by the state machine ofthe TX controller. By asserting “
Warm-En ”, its state is changedto the warm-up mode which outputs a training sequence gen-erated by the encoders. “Start flit” is sent when the transferredata is ready (
Valid =“1”) and “
Comm-En ” is asserted. Afterthe header is transferred, the state is automatically changedto the data-comm mode, then the main part of the datacommunication is started. During this mode, the input of theserializer is updated every 20 cycles as the serial data aresynchronized at DDR. When the “
Valid ” signal is negated,“Stop flit” is sent. Finally, the state is back to the idle one.
C. RX design
At the RX, the input is firstly captured by the analog com-parators [15] which restore the even and odd bit data from thechannel. These bits are buffered by the timing synchronizers,then deserialized, decoded, and sent to the µ DMA throughthe asynchronous FIFO interface. We employ the deserializerarchitecture reported in [12]. The “timing synchronizers” hereare buffers to ensure the timing constraints between digital andanalog circuits.Since the data communication begins from “Start flit” andends at “Stop flit”, the sequence detector monitors whetheror not they arrive. This is realized by checking 11011111( K , in [11]) for “Start flit” and 10111111 ( K , ) for “Stopflit”. According to the information from the detector, the RXcontroller manages the deserializer and 10b/8b decoders forthe main body of the transferred data. The decoded data withthe “ Valid ” signal is sent to the FIFO when its “
Ready ” isasserted.The generated clock by the CDR scheme is divided intofour (“
Clk pi/4 ”) and two ( “
Clk pi/2 ”). The RX controller,decoders, and some parts of the CDR loop are synchronizedat “
Clk pi/4 ” to reduce the power consumption.
Clk pi/2 isutilized by the deserializer.
1) Sequence detector:
In the sequence detector, the evenand odd bits captured by the analog comparators are checkedto activate the entire RX when the start flit arrives. The detectoris composed of a finite state machine as shown in Fig. 3.The state of the detector changes when the K , arrives. Inother words, when the first two bit of 11011111 (i.e. 11) isdetected, the next state is “Check1”. After this, if the followingtwo bits are 01, the state is updated to “Check2”. When allthe bits of K , are detected, the deserializer and decoderare enabled through the RX controller. Also, during the datacommunication, it is monitored whether or not the stop flitarrives with a similar procedure. When this is detected, thestate of the detector is backed to “Start”.It is important to mention that the RX has to considerwhether or not a bit shift occurs at arriving data. In otherwords, even if a bit is sent as even bit at the TX side, thereis no guarantee that it is captured as even bit at the RX. Forexample, the sequence of 11011111 might be captured as x110 11 11 1x. To manage this, the state machine holds the bitshift information as the signal “ Shift ”. Since an additional 2bits have to be checked when “
Shift ” is asserted, the “Check4”state is implemented.Also, note that the timing synchronizer adjusts the bit shiftaccording to the “
Shift ” signal from the sequence detector afterdetecting the start flit. Hence, the deserializer always receivesthe even and odd bit correctly.
2) RX controller:
During the warm-up mode, the controlleractivates only the parts of the CDR loop. After the loop issettled, an enable signal for the sequence detector is providedfrom the configuration registers. When the start flit arrives,the controller state is in the data-comm mode which enablesthe entire deserializer. The decoders update their output when
Fig. 3. State machine of the sequence detectorFig. 4. Architectural diagram of the CDR loop
Valid ” signal is also generated after the latency of thedecoders. When the stop flit arrives, the controller disables the8:40 deserializer and decoders if “
Warm-En ” is still asserted.In case that all the enable signals for the RX are negated, theRX is in the idle mode.
3) Clock Data Recovery module:
The CDR scheme iscomposed of the phase detector, digital filter and phase inter-polator (PI) which adjusts the phase of the FLL clock (Fig. 4).The “Early-Late” module consists of seven parallel Alexanderphase detectors [14] that compares 8-bit“
Data ” captured bythe normal clock (“
Clk ”) with 8-bit “
Edge ” synchronized ata quadrature clock (“
Clkq ”). Then, the number of “
Early ” issubtracted by the number of “
Late ”. The result is accumulatedand divided by 1/N (N=1,2,4,8,..., 128) at every 4 clock cycles.According to the divider output, the PI shifts the clock phasefor both of “
Clk ” and “
Clkq ”. The resolution of this adjustmentis set to π/ in the current design. The PI is a charge-basedinterpolator based on [16].IV. I MPLEMENTATION
We implemented a system-level layout including theSerDes. A 65-nm bulk CMOS technology [17] was used.This design includes three FLLs [13] as clock generatorsand 128KB of the L2 bank. Two of the FLLs are for themicrocontroller and peripherals except for the SerDes. The lastone is dedicated to the link for a testing purpose. At actualsystems, one of the other FLLs is shared with the SerDes tosave the system power consumption. The analog signals areconnected to 4 library I/O cells featuring a built-in 50-ohmresistor. Two of them are for the RX and the rests are forthe TX. Synopsys Design Compiler 2018.06-SP1 and CadenceInnovous v15.20 were employed for the synthesis and P&R.The nominal voltage and operational frequency of theSerDes are 1.2V and 400MHz, respectively. Hence, the targetbandwidth of the current design is 0.8 Gbps as the data transferis performed at DDR. Also, 1.2V is used for both digital andanalog circuits. This is because adding another voltage sourceincreases system costs which should be avoided for embeddedmicrocontrollers.
ABLE IIP
OWER CONSUMPTION OF THE S ER D ES @ 1.2VPower consumption RX 2.85mW(Analog parts) TX 0.59mWPower consumption 0.591mW : data-comm mode(Digital parts) RX 0.367mW : warm-up mode0.433 µ W : idle mode0.239 mW :data-comm & warm-upTX 32.7 µ W : idleFig. 5. Conceptual timing diagram of the duty-cycled operation
V. R
ESULTS
To evaluate the energy efficiency of the proposed SerDessystem, post-layout simulations are conducted with SynopsysPrime Time M-2016.12-SP3 for the digital part and CadenceSpectre 6.1 for the analog part. Table II shows the estimatedpower consumption at 1.2V of V DD and 400MHz of oper-ational frequency. Since the TX power is dominated by theanalog part and the serializer, other parts are omitted.According to the results, the entire power consumption ofthe SerDes is 4.27mW when the serial link is in the data-comm mode. The energy efficiency of the implemented linkis 5.34pJ/bit. A power of 4.05 mW is consumed during thewarm-up mode because most of the RX components need tobe activated. If the analog parts are turned off via an off-chippower switch during the idle state, the entire link power is33.1 µ W.In case that a required bandwidth is lower than 0.8Gbps,the power consumption is further lowered. However, since theCDR loop is designed for 0.8Gbps, lowering its operationalfrequency causes a loop convergence problem. Instead, a duty-cycled operation [20] which periodically turns on the SerDesis adopted in this paper. Fig. 5 shows its conceptual timingdiagram. Here, T Cycle , T
Act , T
W arm and T Idle represent onecycle period, duration of the data-comm, warm-up, and idlemode, respectively. The data communication is conducted untilthe RX buffer in the global memory is filled up. Then, the linkstate is back to the idle mode. When it is activated again, thewarm-up mode settles the CDR loop with the overhead of T W arm .Using these assumptions and the values in Table II, theSerDes energy efficiency during the duty-cycled operationis obtained (see Fig. 6). For a comparison to other existingperipherals, this graph also depicts the read/write averageenergy consumption of a single SPI (40-nm) and Hyper Bus(65-nm) implementation with an I/O voltage of 1.8V. Thetransferred data size of the Hyper Bus was 0.5 KB. The HyperBus is implemented by fast but power-hungry drivers, whilethe SPI adopts slow but low power ones. Hence, the SPI andHyper Bus operate up to 50 and 100MHz, respectively. Inother words, the maximum bandwidth of the former and latterare 50 Mbps and 1.6Gbps. As can be seen from the graph, theHyper Bus consumes much higher energy than the single SPIdue to the I/O drivers even though the Hyper Bus achieves abandwidth over 1Gbps. Thus, at the conventional digital in-terfaces, there is a trade-off between the maximum bandwidthand energy efficiency. On the other hand, our SerDes achieves − Bandwidth (
Mbit / s ) E n e r gyp e r b it ( p J / b i t ) Single SPIQuad SPI SDRQuad SPI DDROctal SPI SDROctal SPI DDRHyper BusSerDes
Fig. 6. Energy consumption compared to other peripheralsTABLE IIIT
HE NUMBER OF DATA PADS NEEDED FOR EACH SOLUTION
Single SPI Quad SPI Octal SPI Hyper Bus This work4 6 11 12 4 a high bandwidth and low energy consumption simultaneously.Indeed, the maximum bandwidth ( BW max ) with the 16KBRX buffer is 787Mbps. Compared to the best case of theSingle SPI (i.e. at 50Mbps), the SerDes efficiency is 10.2xhigher at 15.7x higher performance. Besides, even if thetarget bandwidth is lowered to 10Mbps, the proposed SerDesachieves 8.3x smaller energy than the SPI. Moreover, althoughthe Hyper Bus achieves about 2 times higher bandwidth, itsenergy efficiency is 21x lower than our SerDes operating at BW max .Based on the SPI measurement results (Fig. 6) and itsswitching activity, we estimated the energy efficiency of aQuad-SPI and Octal-SPI operating at both DDR and SDRwhich are also shown in Fig. 6. As can be seen from thegraph, the parallel SPI lanes improve the energy efficiency,at the cost of additional overheads in terms of pad usage(Table III), which is critical for small and often pad lim-ited microcontrollers. Nevertheless, the proposed SerDes stillachieves lower energy consumption, at a 3x smaller pad areacost. Indeed, the SerDes energy efficiency at BW max is 2.56xhigher than the case of the DDR Octal SPI, joining the benefitsof low pad frame overhead, high bandwidth and high energyefficiency, essential features for next-generation near-sensordata analytics low-power architectures.VI. C ONCLUSION
In this paper, we presented the system architecture of a high-speed/low-power serial link. The proposed SerDes simultane-ously provides a high bandwidth and energy efficiency forembedded systems, unlike traditional digital interfaces suchas SPIs and a Hyper Bus. The evaluation results showedthat, thanks to the low-voltage swing property, the SerDesachieves about 10.2x higher energy efficiency at 15.7x higherbandwidth than the Single SPI link. Also, the duty-cycledoperation allows the SerDes to achieve 8.3x higher energyefficiency than the Single SPI even at 10Mbps, a low band-width requirement. Moreover, when compared to the HyperBus, the SerDes energy is 21x smaller.A
CKNOWLEDGMENT
This work was supported in part by the WiPLASH (Archi-tecting More Than Moore – Wireless Plasticity for Heteroge-neous Massive Computer Architectures) project founded fromthe European Union’s Horizon 2020 research and innovationprogram under Grant Agreement No. 863337. µ µµ