[PDF] An Energy-Efficient Low-Voltage Swing Transceiver for mW-Range IoT End-Nodes

Abstract

As the Internet-of-Things (IoT) applications become more and more pervasive, IoT end nodes are requiring more and more computational power within a few mW of power envelope, coupled with high-speed and energy-efficient inter-chip communication to deal with the growing input/output and memory bandwidth for emerging near-sensor analytics applications. While traditional interfaces such as SPI cannot cope with these tight requirements, low-voltage swing transceivers can tackle this challenge thanks to their capability to achieve several Gbps of bandwidth at extremely low power. However, recent research on high-speed serial links addressed this challenge only partially, proposing only partial or stand-alone designs, and not addressing their integration in real systems and the related implications. In this paper, we present for the first time a complete design and system-level architecture of a low-voltage swing transceiver integrated within a low-power (mW range) IoT end-node processors, and we compare it with existing microcontroller interfaces. The transceiver, implemented in a commercial 65-nm CMOS technology achieves 10.2x higher energy efficiency at 15.7x higher performance than traditional microcontroller peripherals (single lane).

Full PDF

AAn Energy-Efﬁcient Low-Voltage SwingTransceiver for mW-Range IoT End-Nodes

Hayate Okuhara ∗ , Ahmed Elnaqib ∗ , Davide Rossi ∗ , Alﬁo Di Mauro ‡ , Philipp Mayer ‡ , Pierpaolo Palestri † , Luca Benini ∗‡ ∗ DEI, University of Bologna, Italy † DPIA, University of Udine, Italy ‡ Integrated System Laboratory ETH, Zuerich, Switzerland

Abstract —As the Internet-of-Things (IoT) applications becomemore and more pervasive, IoT end nodes are requiring moreand more computational power within a few mW of powerenvelope, coupled with high-speed and energy-efﬁcient inter-chipcommunication to deal with the growing input/output and mem-ory bandwidth for emerging near-sensor analytics applications.While traditional interfaces such as SPI cannot cope with thesetight requirements, low-voltage swing transceivers can tackle thischallenge thanks to their capability to achieve several Gbps ofbandwidth at extremely low power. However, recent researchon high-speed serial links addressed this challenge only partially,proposing only partial or stand-alone designs, and not addressingtheir integration in real systems and the related implications.In this paper, we present for the ﬁrst time a complete designand system-level architecture of a low-voltage swing transceiverintegrated within a low-power (mW range) IoT end-node proces-sors, and we compare it with existing microcontroller interfaces.The transceiver, implemented in a commercial 65-nm CMOStechnology achieves 10.2x higher energy efﬁciency at 15.7x higherperformance than traditional microcontroller peripherals (singlelane).

Index Terms —IoT, SerDes, Energy efﬁcient peripheral, SPI,microcontroller.

I. I

NTRODUCTION

Pushed by the IoT trends, in the last years, the requiredcomputational performance in end-nodes has increased con-siderably. Nowadays, near-sensor applications, such as con-volutional neural network (CNN) based image analysis andbio-potential processing, have to efﬁciently operate on largevolumes of sensor data captured by microcontrollers (MCUs).In this scenario, state of the art SoCs have already achievedperformance in the order of several GOPS within a 10mWpower envelope [1], [2].On the other hand, in modern embedded systems operatingin the IoT context, overcoming the limitations imposed by lowchip-to-chip communication bandwidths represents a majorchallenge. Conventional MCU peripherals, such as I2C, I2S,and SPI provide transfer data rates in the order of few tenthsof Mbps, which are typically not sufﬁcient to satisfy theexpected bandwidth and energy efﬁciency demand of thenext-generation IoT applications. For example, according tothe results reported in [21], the off-chip bandwidth requiredto perform MobileNetV2 inference [22] at 10 FPS on anMCU is larger than 500 Mbps. Although there are somesolutions which can reach this requirement (e.g. HyperBus orOctal SPI operating at fast frequencies) [3], [18], their powerconsumption rapidly saturates the end-node power budgets.Serial links peripherals [6]–[8], [12], [19], relying on analogdata transceivers, constitute a promising alternative to purelydigital serial interfaces, both from the bandwidth and energyefﬁciency perspective. In serial links, serialized data are sentat high rate, while low-power consumption is guaranteed by

TABLE IR

EPORTED LOW POWER SERIAL LINKS AND OUR SYSTEM

Reported work [12] [7], [8] [19] This workTarget bandwidth 1 1-6 25 0.8(Gbps)Power consumption < < exploiting low-voltage swing signals at the physical layer.State of the art solutions [7], [8] can achieve over 1 Gbpsbandwidth, while keeping the power consumption in a fewmW ranges.While various research efforts have been reported in op-timizing serial links, system-level integration, e.g. in micro-controllers, has not been extensively studied to the authors’best knowledge. Also, in the IoT context, it is essentialto minimize the data transmission power to not erode theavailable power budget dedicated to useful computation. TableI provides an overview of recent research efforts on low-powertransceivers, positioning the proposed work with respect tostate of the art transceivers in terms of power and bandwidth,and highlighting the limitations of the latter with respect tosystem level integration issues. This includes the need forseveral external power supplies forming and additional sourceof power consumption not considered in previous works.From the observations above, the contributions of this paperare as follows: • We designed a serializer-deserializer link (SerDes) systemand we integrated it into an open-source low-powermicrocontroller [9]. Detailed architectural and micro-architectural information are shown. • We evaluated the energy efﬁciency of the implementedSerDes with post-layout simulations. This brings guide-lines for its power management. • We explored a duty-cycled operation of the SerDes for alow bandwidth target. We report on the trade-off betweenbandwidth and energy efﬁciency.The energy efﬁciency of the SerDes was ﬁnally comparedwith conventional digital peripherals widely adopted in mi-crocontrollers such as SPI [2], [4], [5] and more advancedperipherals such as HyperBus [3]. The SerDes achieves 10.2xhigher energy efﬁciency at 787 Mbps than the case of a SingleSPI operating at 50 Mbps. Moreover, even if we target a lowbandwidth such as 10 Mbps with the SerDes, its efﬁciency is8.3x higher than the SPI. Also, we show the SerDes energy is21x smaller than the Hyper Bus. a r X i v : . [ c s . A R ] O c t ig. 1. High level architectural block diagram of the system overview II. S

YSTEM O VERVIEW

Fig. 1 shows an overview of the System on a Chip (SoC)hosting the proposed serial link. The main building blocks ofthe SoC are a RISC-V core coupled to a multi-bank word-level interleaved memory, and an autonomous input/outputsubsystem ( µ DMA) [10] to transfer data to the peripherals.The internal clock is generated by a frequency locked loops(FLL). Additionally, the SoC features a timer, a debug unit,and programmable GPIOs. The SerDes is composed of thetransmitter (TX), the receiver (RX), and conﬁguration registersmapped on the advanced peripheral bus (APB) used to accessenable signals, as well as the address, and the size of thecommunicated data. The SerDes is connected to the µ DMA,an autonomous DMA subsystem providing high-speed datatransfers between L2 and the peripherals.Data from the µ DMA are transmitted to another chip viathe TX module. Its enable signals (“

Comm-En ” and “

Warm-En ”) are from memory-mapped registers that are accessed viasoftware. The transferred data is captured by the RX moduleand delivered to the µ DMA. The µ DMA sends the receiveddata to the RX buffer which is allocated in the global memoryaccording to the conﬁguration registers.The SerDes operates in three modes: idle, warm-up, anddata-comm. During the idle mode, all the digital circuits aredeactivated, and the transceiver is in low-power mode. Thedata-comm mode sends/receives serialized data. However, toestablish a communication, the RX has to be synchronizedwith the transmitted data generated by another chip, potentiallyoperating at a different clock phase. Hence, during the warm-up mode, the TX sends a training sequence including its clockphase information to the RX. According to this input, the RXrecovers the transmitter clock. These three modes are selectedthrough “

Comm-En ” and “

Warm-En ” registers.To start the actual inter-chip communication, the TX isﬁrstly set to the warm-up mode. The RX in another chipreceives this information through a GPIO, resulting in theRX warm-up mode as well. Using the timer in Fig. 1, theprocessor in the RX chip waits a ﬁxed amount of time untilthe RX clock is ready for the communication. Then, througha GPIO, the RX chip notiﬁes the TX chip that the clock isready. Also, the information required by the µ DMA is storedin the conﬁguration registers. When this is ﬁnished, throughanother GPIO, the RX also informs the TX that the datacommunication is ready. Finally, the SerDes mode is changedto data-comm mode, and the TX starts to send main data bydeclaring its start and end point with a communication header(Start ﬂit), and a footer (Stop ﬂit).

Fig. 2. Architectural block diagram of the serial link (a)TX (b)RX

III. L OW - POWER S ERIAL L INK

A. Link architecture

Fig. 2 shows a detailed block diagram of the SerDes. TheTX is composed of an 8b/10b encoders, TX controller, 40:1serializer, pre-driver, and the driver. The RX is equipped withthe analog comparators, timing synchronizers, a deserializer,RX controller, Clock Data Recovery (CDR) circuit, and the10b/8b decoders. Both the TX and RX operate at the samefrequency. However, as previously described, the clock phasein the RX has to be adjusted. The CDR circuit performs theclock recovery so that the RX clock transitions occur at themid-point of the received data bit. The data communicationis conducted by a differential signal. Hence, four analog padsare required in addition to three GPIOs used to synchronizethe RX and TX in different chips.

B. TX design

At the TX, 40-bit “Start ﬂit”, “Stop ﬂit”, and the main bodyof the communication are serialized and transmitted to the RXin another chip. The multiplexer in Fig. 2 (a) selects one ofthem and sends it to the serializer which output serialized dataat the double data rate (DDR) of the TX clock. Then, the drivertransmits the data to the RX with low-voltage swing (200mV)signals. Here we adopt the serializer and driver in [12].The main body of the communication is encoded by the fourparallel 8b/10b encoders [11] which ensure that the serializeddata is DC-balanced and its disparity is less than ± . The TXcontroller is a ﬁnite state machine that manages the timingof these functionalities according to the FIFO handshakingsignals from the interface between the SerDes and the µ DMA,and the enable signals from the conﬁguration registers. TheTX clock is provided by the FLL and divided by two andfour. “

Clk ﬂl/4 ” is utilized for the encoders, multiplexer, andcontroller to reduce the power consumption. Since the µ DMAoperates at the system clock, the interface between the SerDesand µ DMA is implemented by an asynchronous FIFO.Firstly, the TX is set to the idle state by the state machine ofthe TX controller. By asserting “

Warm-En ”, its state is changedto the warm-up mode which outputs a training sequence gen-erated by the encoders. “Start ﬂit” is sent when the transferredata is ready (

Valid =“1”) and “

Comm-En ” is asserted. Afterthe header is transferred, the state is automatically changedto the data-comm mode, then the main part of the datacommunication is started. During this mode, the input of theserializer is updated every 20 cycles as the serial data aresynchronized at DDR. When the “

Valid ” signal is negated,“Stop ﬂit” is sent. Finally, the state is back to the idle one.

C. RX design

At the RX, the input is ﬁrstly captured by the analog com-parators [15] which restore the even and odd bit data from thechannel. These bits are buffered by the timing synchronizers,then deserialized, decoded, and sent to the µ DMA throughthe asynchronous FIFO interface. We employ the deserializerarchitecture reported in [12]. The “timing synchronizers” hereare buffers to ensure the timing constraints between digital andanalog circuits.Since the data communication begins from “Start ﬂit” andends at “Stop ﬂit”, the sequence detector monitors whetheror not they arrive. This is realized by checking 11011111( K , in [11]) for “Start ﬂit” and 10111111 ( K , ) for “Stopﬂit”. According to the information from the detector, the RXcontroller manages the deserializer and 10b/8b decoders forthe main body of the transferred data. The decoded data withthe “ Valid ” signal is sent to the FIFO when its “

Ready ” isasserted.The generated clock by the CDR scheme is divided intofour (“

Clk pi/4 ”) and two ( “

Clk pi/2 ”). The RX controller,decoders, and some parts of the CDR loop are synchronizedat “

Clk pi/4 ” to reduce the power consumption.

Clk pi/2 isutilized by the deserializer.

1) Sequence detector:

In the sequence detector, the evenand odd bits captured by the analog comparators are checkedto activate the entire RX when the start ﬂit arrives. The detectoris composed of a ﬁnite state machine as shown in Fig. 3.The state of the detector changes when the K , arrives. Inother words, when the ﬁrst two bit of 11011111 (i.e. 11) isdetected, the next state is “Check1”. After this, if the followingtwo bits are 01, the state is updated to “Check2”. When allthe bits of K , are detected, the deserializer and decoderare enabled through the RX controller. Also, during the datacommunication, it is monitored whether or not the stop ﬂitarrives with a similar procedure. When this is detected, thestate of the detector is backed to “Start”.It is important to mention that the RX has to considerwhether or not a bit shift occurs at arriving data. In otherwords, even if a bit is sent as even bit at the TX side, thereis no guarantee that it is captured as even bit at the RX. Forexample, the sequence of 11011111 might be captured as x110 11 11 1x. To manage this, the state machine holds the bitshift information as the signal “ Shift ”. Since an additional 2bits have to be checked when “

Shift ” is asserted, the “Check4”state is implemented.Also, note that the timing synchronizer adjusts the bit shiftaccording to the “

Shift ” signal from the sequence detector afterdetecting the start ﬂit. Hence, the deserializer always receivesthe even and odd bit correctly.

2) RX controller:

During the warm-up mode, the controlleractivates only the parts of the CDR loop. After the loop issettled, an enable signal for the sequence detector is providedfrom the conﬁguration registers. When the start ﬂit arrives,the controller state is in the data-comm mode which enablesthe entire deserializer. The decoders update their output when

Fig. 3. State machine of the sequence detectorFig. 4. Architectural diagram of the CDR loop

Valid ” signal is also generated after the latency of thedecoders. When the stop ﬂit arrives, the controller disables the8:40 deserializer and decoders if “

Warm-En ” is still asserted.In case that all the enable signals for the RX are negated, theRX is in the idle mode.

3) Clock Data Recovery module:

The CDR scheme iscomposed of the phase detector, digital ﬁlter and phase inter-polator (PI) which adjusts the phase of the FLL clock (Fig. 4).The “Early-Late” module consists of seven parallel Alexanderphase detectors [14] that compares 8-bit“

Data ” captured bythe normal clock (“

Clk ”) with 8-bit “

Edge ” synchronized ata quadrature clock (“

Clkq ”). Then, the number of “

Early ” issubtracted by the number of “

Late ”. The result is accumulatedand divided by 1/N (N=1,2,4,8,..., 128) at every 4 clock cycles.According to the divider output, the PI shifts the clock phasefor both of “

Clk ” and “

Clkq ”. The resolution of this adjustmentis set to π/ in the current design. The PI is a charge-basedinterpolator based on [16].IV. I MPLEMENTATION

We implemented a system-level layout including theSerDes. A 65-nm bulk CMOS technology [17] was used.This design includes three FLLs [13] as clock generatorsand 128KB of the L2 bank. Two of the FLLs are for themicrocontroller and peripherals except for the SerDes. The lastone is dedicated to the link for a testing purpose. At actualsystems, one of the other FLLs is shared with the SerDes tosave the system power consumption. The analog signals areconnected to 4 library I/O cells featuring a built-in 50-ohmresistor. Two of them are for the RX and the rests are forthe TX. Synopsys Design Compiler 2018.06-SP1 and CadenceInnovous v15.20 were employed for the synthesis and P&R.The nominal voltage and operational frequency of theSerDes are 1.2V and 400MHz, respectively. Hence, the targetbandwidth of the current design is 0.8 Gbps as the data transferis performed at DDR. Also, 1.2V is used for both digital andanalog circuits. This is because adding another voltage sourceincreases system costs which should be avoided for embeddedmicrocontrollers.

ABLE IIP

OWER CONSUMPTION OF THE S ER D ES @ 1.2VPower consumption RX 2.85mW(Analog parts) TX 0.59mWPower consumption 0.591mW : data-comm mode(Digital parts) RX 0.367mW : warm-up mode0.433 µ W : idle mode0.239 mW :data-comm & warm-upTX 32.7 µ W : idleFig. 5. Conceptual timing diagram of the duty-cycled operation

V. R

ESULTS

To evaluate the energy efﬁciency of the proposed SerDessystem, post-layout simulations are conducted with SynopsysPrime Time M-2016.12-SP3 for the digital part and CadenceSpectre 6.1 for the analog part. Table II shows the estimatedpower consumption at 1.2V of V DD and 400MHz of oper-ational frequency. Since the TX power is dominated by theanalog part and the serializer, other parts are omitted.According to the results, the entire power consumption ofthe SerDes is 4.27mW when the serial link is in the data-comm mode. The energy efﬁciency of the implemented linkis 5.34pJ/bit. A power of 4.05 mW is consumed during thewarm-up mode because most of the RX components need tobe activated. If the analog parts are turned off via an off-chippower switch during the idle state, the entire link power is33.1 µ W.In case that a required bandwidth is lower than 0.8Gbps,the power consumption is further lowered. However, since theCDR loop is designed for 0.8Gbps, lowering its operationalfrequency causes a loop convergence problem. Instead, a duty-cycled operation [20] which periodically turns on the SerDesis adopted in this paper. Fig. 5 shows its conceptual timingdiagram. Here, T Cycle , T

Act , T

W arm and T Idle represent onecycle period, duration of the data-comm, warm-up, and idlemode, respectively. The data communication is conducted untilthe RX buffer in the global memory is ﬁlled up. Then, the linkstate is back to the idle mode. When it is activated again, thewarm-up mode settles the CDR loop with the overhead of T W arm .Using these assumptions and the values in Table II, theSerDes energy efﬁciency during the duty-cycled operationis obtained (see Fig. 6). For a comparison to other existingperipherals, this graph also depicts the read/write averageenergy consumption of a single SPI (40-nm) and Hyper Bus(65-nm) implementation with an I/O voltage of 1.8V. Thetransferred data size of the Hyper Bus was 0.5 KB. The HyperBus is implemented by fast but power-hungry drivers, whilethe SPI adopts slow but low power ones. Hence, the SPI andHyper Bus operate up to 50 and 100MHz, respectively. Inother words, the maximum bandwidth of the former and latterare 50 Mbps and 1.6Gbps. As can be seen from the graph, theHyper Bus consumes much higher energy than the single SPIdue to the I/O drivers even though the Hyper Bus achieves abandwidth over 1Gbps. Thus, at the conventional digital in-terfaces, there is a trade-off between the maximum bandwidthand energy efﬁciency. On the other hand, our SerDes achieves − Bandwidth (

Mbit / s ) E n e r gyp e r b it ( p J / b i t ) Single SPIQuad SPI SDRQuad SPI DDROctal SPI SDROctal SPI DDRHyper BusSerDes

Fig. 6. Energy consumption compared to other peripheralsTABLE IIIT

HE NUMBER OF DATA PADS NEEDED FOR EACH SOLUTION

Single SPI Quad SPI Octal SPI Hyper Bus This work4 6 11 12 4 a high bandwidth and low energy consumption simultaneously.Indeed, the maximum bandwidth ( BW max ) with the 16KBRX buffer is 787Mbps. Compared to the best case of theSingle SPI (i.e. at 50Mbps), the SerDes efﬁciency is 10.2xhigher at 15.7x higher performance. Besides, even if thetarget bandwidth is lowered to 10Mbps, the proposed SerDesachieves 8.3x smaller energy than the SPI. Moreover, althoughthe Hyper Bus achieves about 2 times higher bandwidth, itsenergy efﬁciency is 21x lower than our SerDes operating at BW max .Based on the SPI measurement results (Fig. 6) and itsswitching activity, we estimated the energy efﬁciency of aQuad-SPI and Octal-SPI operating at both DDR and SDRwhich are also shown in Fig. 6. As can be seen from thegraph, the parallel SPI lanes improve the energy efﬁciency,at the cost of additional overheads in terms of pad usage(Table III), which is critical for small and often pad lim-ited microcontrollers. Nevertheless, the proposed SerDes stillachieves lower energy consumption, at a 3x smaller pad areacost. Indeed, the SerDes energy efﬁciency at BW max is 2.56xhigher than the case of the DDR Octal SPI, joining the beneﬁtsof low pad frame overhead, high bandwidth and high energyefﬁciency, essential features for next-generation near-sensordata analytics low-power architectures.VI. C ONCLUSION

In this paper, we presented the system architecture of a high-speed/low-power serial link. The proposed SerDes simultane-ously provides a high bandwidth and energy efﬁciency forembedded systems, unlike traditional digital interfaces suchas SPIs and a Hyper Bus. The evaluation results showedthat, thanks to the low-voltage swing property, the SerDesachieves about 10.2x higher energy efﬁciency at 15.7x higherbandwidth than the Single SPI link. Also, the duty-cycledoperation allows the SerDes to achieve 8.3x higher energyefﬁciency than the Single SPI even at 10Mbps, a low band-width requirement. Moreover, when compared to the HyperBus, the SerDes energy is 21x smaller.A

CKNOWLEDGMENT

This work was supported in part by the WiPLASH (Archi-tecting More Than Moore – Wireless Plasticity for Heteroge-neous Massive Computer Architectures) project founded fromthe European Union’s Horizon 2020 research and innovationprogram under Grant Agreement No. 863337. µ µµ