[PDF] A synchronous Gigabit Ethernet protocol stack for high-throughput UDP/IP applications

Abstract

State of the art detector readout electronics require high-throughput data acquisition (DAQ) systems. In many applications, e. g. for medical imaging, the front-end electronics are set up as separate modules in a distributed DAQ. A standardized interface between the modules and a central data unit is essential. The requirements on such an interface are varied, but demand almost always a high throughput of data. Beyond this challenge, a Gigabit Ethernet interface is predestined for the broad requirements of Systems-on-a-Chip (SoC) up to large-scale DAQ systems. We have implemented an embedded protocol stack for a Field Programmable Gate Array (FPGA) capable of high-throughput data transmission and clock synchronization. A versatile stack architecture for the User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP) over Internet Protocol (IP) such as Address Resolution Protocol (ARP) as well as Precision Time Protocol (PTP) is presented. With a point-to-point connection to a host in a MicroTCA system we achieved the theoretical maximum data throughput limited by UDP both for 1000BASE-T and 1000BASE-KX links. Furthermore, we show that the random jitter of a synchronous clock over a 1000BASE-T link for a PTP application is below 60 ps.

Full PDF

PPreprint typeset in JINST style - HYPER VERSION

A synchronous Gigabit Ethernet protocol stack forhigh-throughput UDP/IP applications

P. Födisch a ∗ , B. Lange a , J. Sandmann a , A. Büchner a , W. Enghardt b , c , d and P. Kaever a a Helmholtz-Zentrum Dresden - Rossendorf, Department of Research Technology,Bautzner Landstr. 400, 01328 Dresden, Germany b OncoRay - National Center for Radiation Research in Oncology, Faculty of Medicine andUniversity Hospital Carl Gustav Carus, Technische Universität Dresden,Fetscherstr. 74, PF 41, 01307 Dresden, Germany c Helmholtz-Zentrum Dresden - Rossendorf, Institute of RadiooncologyBautzner Landstr. 400, 01328 Dresden, Germany d German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ)Im Neuenheimer Feld 280, 69120 Heidelberg, GermanyE-mail: [email protected] A BSTRACT : State of the art detector readout electronics require high-throughput data acquisition(DAQ) systems. In many applications, e. g. for medical imaging, the front-end electronics are setup as separate modules in a distributed DAQ. A standardized interface between the modules and acentral data unit is essential. The requirements on such an interface are varied, but demand almostalways a high throughput of data. Beyond this challenge, a Gigabit Ethernet interface is predestinedfor the broad requirements of Systems-on-a-Chip (SoC) up to large-scale DAQ systems. We haveimplemented an embedded protocol stack for a Field Programmable Gate Array (FPGA) capableof high-throughput data transmission and clock synchronization. A versatile stack architecture forthe User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP) over InternetProtocol (IP) such as Address Resolution Protocol (ARP) as well as Precision Time Protocol (PTP)is presented. With a point-to-point connection to a host in a MicroTCA system we achieved thetheoretical maximum data throughput limited by UDP both for 1000BASE-T and 1000BASE-KXlinks. Furthermore, we show that the random jitter of a synchronous clock over a 1000BASE-Tlink for a PTP application is below 60 ps.K

EYWORDS : Gigabit Ethernet; Synchronous Ethernet; Field Programmable Gate Array (FPGA);High-throughput Data Acquisition (DAQ); User Datagram Protocol (UDP); Precision TimeProtocol (PTP), 1000BASE-T, 1000BASE-KX, MicroTCA. ∗ Corresponding author. a r X i v : . [ phy s i c s . i n s - d e t ] N ov ontents

1. Introduction 12. Requirements for an embedded Gigabit Ethernet protocol stack 23. Implementation 4

4. Measurements and results 8

5. Summary 15

1. Introduction

Distributed data acquisition systems are commonly spread over different ﬁelds of application in nu-clear physics or medical imaging. Depending on the application, there are various requirements forthe interconnections of the submodules. The main challenge for an interface is the user acceptancewith respect to handling and interoperability of different device types. In addition, the data through-put of the interface is an important criterion for usability and should not limit the performance ofthe whole data acquisition (DAQ) system. Even though proprietary interfaces can fulﬁll these re-quirements, standardized technologies beneﬁts from industry-proven components and are essentialfor reliable applications. A popular and well accepted speciﬁcation is the IEEE 802.3 Standard forEthernet [1]. This standard speciﬁes the physical layer used by the Ethernet. Until now, connec-tions up to 100 Gbit/s are speciﬁed and are going to be established by the industry. Nevertheless,for embedded systems a Gigabit Ethernet connection is the state of the art. A widespread technol-ogy is known as 1000BASE-T, which deﬁnes the 1 Gbit/s Ethernet over twisted pair copper cables.The application of Gigabit Ethernet is not restricted to the use in Local Area Networks. It alsoﬁnds its way into board-to-board applications. E. g. the backplane of a Micro TelecommunicationsComputing Architecture (MicroTCA) system should implement at least one port for an Ethernetconnection [2], which is usually implemented as 1000BASE-KX on the physical layer. A link onthe electrical backplane uses two differential pairs to establish a Gigabit Ethernet connection. WithGigabit Ethernet, the possibilities of applications range from a high speed data transfer to clock– 1 –ynchronization in a distributed DAQ system [3]. This work is related to the implementation andtest of an embedded Gigabit Ethernet protocol stack for Field Programmable Gate Arrays (FPGA).With a versatile stack architecture, we will demonstrate the performance of high-throughput datatransfers with the User Datagram Protocol (UDP) and clock synchronization over the PrecisionTime Protocol (PTP). Our aim is to investigate the maximum achievable data throughput with aFPGA-based System-on-Chip (SOC) as data source and a PC as receiver. For our application weneed a high-throughput DAQ to cope with the gamma rate expected for prompt gamma imaging inion beam therapy [4]. This will be evaluated with a 1000BASE-T and 1000BASE-KX on a Mi-croTCA system. In addition, we will demonstrate the performance of a synchronized point-to-pointconnection as shown in [5] with a Xilinx FPGA and different hardware for the physical layer.

2. Requirements for an embedded Gigabit Ethernet protocol stack

Physical layer

The embedded Gigabit Ethernet protocol stack connects to the physical layer through the data linklayer regarding the Open Systems Interconnections (OSI) model as shown in ﬁg. 1. Higher level

Embedded protocol stackMedia Access ControlXilinx PCS/PMAXilinx GTX TI DP83865 /Marvell 88E11111000BASE-KX 1000BASE-T 1. Physical layer2. Data link layer3. Transport layerUser defined interfaceGMII

Figure 1.

Layer stack according to the OSI model for our embedded Gigabit Ethernet protocol stack. Thehigher level protocols are implemented in the transport layer. For evaluation of a 1000BASE-T link we usethe external ICs DP83865 from Texas Instruments and 88E1111 from Marvell. All other components areimplemented in a single FPGA. functions shall be implemented above the embedded protocol stack in an application layer. TheIEEE 802.3 Standard for Ethernet deﬁnes different types of copper based connections between twotransceivers over the physical layer. For 1000BASE-T, it is proposed to use four pairs of cablesfor the signal transmission. The 1000BASE-KX technology uses two pairs for the transmissionof data. The speciﬁc signaling and coding in the physical layer will be done with industry-provenintegrated circuits (ICs). For the 1000BASE-T signal coding we will use an IC from Marvell(88E1111, [6]) and Texas Instruments (DP83865, [7]). To access the physical layer accordingto the 1000BASE-KX technology, we will use a GTX Transceiver of the Xilinx Kintex 7 FPGA[8] in combination with the "Xilinx 1G/2.5G BASE-X PCS/PMA Core" [9]. All physical layertransceivers (PHYs) have a common interface to the overlying data link layer and its Media Ac-cess Control (MAC). The MAC connects to the PHY via the Gigabit Media Independent Inter-face (GMII). The MAC should be implemented in the FPGA. Due to clear demands on high data– 2 –hroughput and hardware, we do not intend to provide a compatibility to other PHYs with Reducedor Serial Gigabit Media Independent Interface (RGMII or SGMII) or even lower speeds as speciﬁedfor 10BASE-T or 100BASE-T.

Data link layer

The data link layer with respect to ﬁg. 1 contains the MAC and a management interface. The MACcontrols the access to the PHY and transmits the data in an Ethernet packet. It processes the inputand output signals of the GMII with a frequency of 125 MHz. An Ethernet packet encapsulates theEthernet frame by adding the preamble and the start frame delimiter (SFD). The MAC composes(and also decomposes) the Ethernet packet with 8 bits per clock cycle (8 ns) from the Ethernetframe. This is essential for a maximum line rate of 1 GBit/s. The standard for Ethernet demandsthat two consecutive Ethernet packets are separated by the interframe gap (IFG) for at least 96 bittimes (96 ns).Usually all PHYs provide a management interface for the conﬁguration of their internal registerbanks. The Media Dependent Input Output (MDIO) interface is used for a basic link conﬁguration(e.g. autonegotiation advertisement).

Transport layer

The transport layer shall provide a stack for higher-level protocols encapsulated in the Ethernetframe. Its architecture must be easily extensible for any desired protocol in the layer stack. We tar-get a maximized data throughput from the application layer for UDP. The theoretical data through-put of the UDP with a payload of 1472 Byte, which corresponds to a Maximum Transfer Unit(MTU) of 1500 Byte for the Ethernet frame, is 114.09 MiB/s. If the host supports jumbo frameswith a MTU of 9000 Byte, the maximum data throughput is increased to 118.34 MiB/s. The em-bedded protocol stack should not limit the frame size of an Ethernet packet. Although variousimplementations of Gigabit Ethernet protocol stacks ([10], [11], [12], [13], [14], [15], [16], [17],[18], [19], [20], [21], [22]) are published, there exists no solution which achieves the theoreticalmaximum data throughput with UDP. Only [13] reached maximum performance with a TCP/IPprocessor. A comparison of slice logic resources as it is done in [10], [14], [15], [16], [18] and[20] is not our intention, because each implementation is based on a different FPGA architecture.Whereas slice logic utilization is an important design criterion, it varies in accordance of genericconﬁgurations (e.g. FIFO depths) as well as the supported features (e.g. checksum calculations).Thus, a comparison to other implementations without the context of the application is difﬁcult. Inorder to provide all the necessary functionality, we need a protocol stack that serves Address Reso-lution Protocol (ARP), Internet Control Protocol (ICMP), Precision Time Protocol (PTP) and UDPwith the focus on maximum data throughput. A protocol’s header should be partially conﬁgurableby an user interface but also calculated automatically (e. g. length ﬁelds). All stack layers support achecksum calculation if it is required by the protocol. In terms of Löfgren’s classiﬁcation proposedin [10], our requirements belong to a "Medium UDP/IP" core.– 3 – . Implementation

Our implementation is designed as Intellectual Property (IP) core with the hardware description lan-guage VHDL. It includes the MAC as well as an embedded protocol stack. For the 1000BASE-KXimplementation the PHY is already included in the FPGA. As shown in ﬁg. 2, the Gigabit EthernetIP core gets its data from the application layer through a common First-In-First-Out (FIFO) inter-face. The asynchronous FIFO is designed to operate at frequencies of 125 MHz with a bit width of

Gigabit EthernetIP coreApplicationlayer PHYMicrocontroller PLLs R ece i v e c l o c k Reference clockExternal clockRXD[7:0] R X _ DV R X _ E R G T X _ C L K TXD[7:0]TX_ENTX_ERbusinterfaceFIFOinterfaceSystem clocks

PHY

FPGA

Figure 2.

An overview of the SoC with the embedded Gigabit Ethernet IP core and its interfaces. In case ofthe 1000BASE-KX implementation, the PHY will be included in the FPGA.

32 bit and stores at least the payload of one packet. It is used to stream the application data withhigh throughput into the core’s transport layer (UDP). Another interface to the core is built with an32 bit microcontroller [23]. The microcontroller with its bus-interface limits the data throughputfor this interface far below a protocol’s limit. So it is used for slow-control applications over UDP,ICMP and ARP. The MDIO management interface is also handled by the microcontroller and isnot shown. Fig. 2 shows the signals of the GMII and their directions between the MAC (embeddedin the Gigabit Ethernet IP core) and the PHY. The same signals will be used for the 1000BASE-KXimplementation with the embedded Xilinx PHY. The system clocks as well as the necessary clocksfor the PHY will be generated by the FPGA’s built-in phase-locked loop (PLL).

The functions of the MAC are restricted to the basic needs for interfacing the GMII. Fig. 3 showsthe basic structure of the module for the transmission datapath of the MAC. For the transmissiondatapath, it will compose the Ethernet packet with its preamble and SFD which are initially storedin a shift register. In the following states, data is passed through this register and the arithmeticlogic unit (ALU) for the checksum calculation. Finally, the 32 bit frame check sequence (FCS) isadded at the end of the frame. The module’s ﬁnite state machine (FSM) controls this dataﬂow andkeeps the IFG at a programmable number of clock cycles. The module for the receiving datapathis built in the same way. It decomposes the Ethernet frame out of a received Ethernet packet andpasses it to the transport layer. The MAC logic is capable of running at the speed of the transceiver– 4 – dle presendlastcrcgap

Simplified FSM tx_entxd[7:0] tx_busyphy_txenphy_txerphy_txd[7:0]

Shift register 8x8Register 32CRC ALU

Figure 3.

Basic structure of the MAC for the transmission datapath. The output signals are connected to theGMII and the input is sourced by the transport layer. clocks (125 MHz). So there is no need for additional FIFOs for clock domain crossing. This resultsin a deterministic latency for the complete datapath from the transport layer to physical layer andvice versa. An example of a transmitted Ethernet packet is shown in ﬁg. 4. The waveforms arecaptured with an integrated logic analyzer (Xilinx Chipscope). tx entxd

AA 00 409E03 68C540D855 05 50 05 08 00 45 00 28 A5 40 00 40 11 13BFC0A800 0FC0A800 01 04 01 04 00 14 87EF0C76 985A0C76 985B0C76 985C AA tx busyphy txenphy txd

AA 55 D500 409E03 68C540D855 05 50 05 08 00 45 00 28A5A540 00 40 11 13BFC0A800 0FC0A800 01 04 01 04 00 14 87EF0C76 985A0C76 985B0C76 985C AA 12BE6DE5AA

Figure 4.

An example of a composed Ethernet packet through the MAC layer for GMII. The transmissionstarts at position 3 with the preamble (signal "phy_txd"). The MAC also adds the SFD (pos. 10), paddingdata (pos. 65-71) for a minimum payload length of 46 Byte and the FCS (pos. 71-75). The IFG is controlledwith the tx_busy signal.

With a look at the OSI reference model and its layers for a network communication, the stack ar-chitecture implies a dataﬂow from the top layer to the bottom layer. That means that the applicationpasses its data from the transport layer to the data link layer until it is transmitted by the physicallayer. So data will be "pushed" from the source to the sink and we call this dataﬂow as "Data-Push"model shown in ﬁg. 5. The scheme in ﬁg. 5 implies, that the application layer has valid data whichis transported through the UDP layer (layer 3). The underlying IP layer (layer 2) will start its trans-mission with one clock cycle delay, beginning with its own data for the IP header. The data comingfrom the upper layer has to be buffered in the underlying layer, while this layer sends its own data.The same situation occurs when the IP layer passes its data to the Ethernet layer (layer 1). Finally,the dataﬂows initiated at time t and t are encapsulated at time t and t respectively. To keep thisdata valid for the latency during transmission time, data buffers are needed. As a consequence, alayer has to buffer at least the data of the overlying layer. One can also easily imagine the situationwhere two layers have valid data and pass it to a shared underlying layer. In this case, the number– 5 – ayer 3 UDP data

Layer 2

IP header UDP data

Layer 1

ETH header IP header UDP data FCS t t t t t Figure 5.

An example of a dataﬂow through the stack layers driven by the "Data-Push" model of data buffers doubles. A consistent data ﬂow through all layers with the "Data-Push" model ishandled with the appropriate number of data buffers. This model consumes additional memory forredundant data.An alternative approach for a dataﬂow is shown in ﬁg. 6. We call this model as "Data-Pull" model.In contrast to the "Data-Push" model from ﬁg. 5, the dataﬂow is initiated by the low-level layer.

Layer 3

UDP data

Layer 2

IP header UDP data

Layer 1

ETH header IP header UDP data FCS t t t t t t Figure 6.

An example of a dataﬂow through the stack layers driven by the "Data-Pull" model

The data of the overlying layers is just passed through a single register stage at the time when itis encapsulated into the frame of the underlying layer. This reduces the amount of data bufferstremendously to a single register at each interconnection. In the example shown in ﬁg. 6, the la-tency from the UDP data in layer 3 to the time when it is encapsulated in layer 1 is reduced to twoclock cycles (from t to t ). Each register stage in the underlying layer introduces one clock cycledelay. Of course the dataﬂow can be optimized to zero latency without additional register stages,but this will cause timing problems. Data buffers are needed in the application layers as well, butbuffer redundancy in comparison to the "Data-Push" model is eliminated. The costs for this im-plementation are a simple arbiter and control logic and range far below those of the "Data-Push"approach. The basic scheme of the interconnections of layers is shown in ﬁg. 7. All modules inthe same layer N+1 pass their state to the arbiter logic. In the simplest case this is a FIFO statewhich indicates whether there is valid data to send or not. In case of valid data the arbiter decideswhich module of a layer is served ﬁrst and passes this information to the underlying layer N. Themodule from layer N controls the dataﬂow of the overlying layer with its controlbus. After all, thedata from layer N+1 is multiplexed to the receiving module in layer N. A real data transfer of theimplemented "Data-Pull" model is shown in ﬁg. 8. The example in ﬁg. 8 shows at its initial clockcycle at position 1 that the UDP layer has valid data to send (signal "udp_ﬁfo_empty" is low). Inconjunction with the arbiter bus, the IP layer also reports that there is valid data to send (signal"ip_ﬁfo_empty" is low). With this condition the Ethernet layer starts the transmission of data (sig-– 6 – ata busLayerN, 0LayerN+1, 0LayerN+1, 1 Ctrl DemuxData Mux

ArbiterN+1 Control busArbiter busSignals:

Figure 7.

Interconnections of layers with an arbiter and control logic. This architecture eliminates the needfor redundant data buffers in a layer stack. udp ﬁfo emptyip ﬁfo emptyudp tx enudp txd

0C 04 01 04 00 14 87EF0C76 985A0C76 985B0C76 985C 0C tx start udptx next ipip tx enip txd

0C 45 00 28 A5 40 00 40 11 13BFC0A800 0FC0A800 01 04 01 04 00 14 87EF0C76 985A0C76 985B0C76 985C 0C tx start iptx next ethtx entxd

AA 00 40 9E 03 68C540D855 05 50 05 08 00 45 00 28 A5 40 00 40 11 13BFC0A800 0FC0A800 01 04 01 04 00 14 87EF0C76 985A0C76 985B0C76 985C AA

Figure 8.

Example of the dataﬂow for a UDP packet through the transport layers with the "Data-Pull" model. nal "tx_en" is high) at pos. 2. At position 14 the Ethernet layer pulls the data from the overlyinglayer by setting the signal "tx_next_eth" to high. The Ctrl Demux from the interconnection logicof the layers shown in ﬁg. 7 switches this signal to the IP layer (signal "tx_start_ip" is high). Atthe next clock cycle (pos. 15), the IP layer transmits its data occurring with an additional delay ofone clock cycle in the frame of the Ethernet layer (signal "txd", pos. 16). The IP layer encapsulatesthe application data from the UDP layer in the same way into its frame. This can be seen by thecontrol signals "tx_next_ip" and "tx_start_udp" at position 33 and the UDP data (signal "udp_txd")and the IP data (signal "ip_txd") at pos. 34 and 35 respectively. Finally, the MAC composes theentire packet as shown in ﬁg. 4. – 7 – .4 Clock synchronization

An important issue in a distributed DAQ is a uniform clock distribution. Although a dedicatedclock line is a simple and precise solution, it cannot be used for an absolute synchronization of alltimestamps in the system. For this purpose an additional data signal for the transmission of a knowntimestamp reference is needed. The PTP offers the possibility to synchronize the timestamps overa data link. Additionally, a Gigabit Ethernet link has the property, that a transmission clock isembedded in the datastream, because the transferred data is synchronous to this reference clock.As a consequence, a receiver can recover this clock frequency. In a 1000BASE-T application, theslave recovers the master’s clock out of the data stream. This task is done by the PHY (see ﬁg. 9).So it is possible to synchronize the clock signals as well as the timestamps over a single GigabitEthernet link. It is also known that the clock offset from a master and a slave can not be correctedwith PTP below a resolution of the 8 ns (this corresponds to the transceiver clock of 125 MHz)without a phase alignment of the clocks. An accurate implementation is already done with theWhite Rabbit project [3], but does not support a 1000BASE-T link by default. A synchronizationover a 1000BASE-T link was done by [5]. They achieved a precision of 180 ps with the DP83865from Texas Instruments and a FPGA from Altera. The limiting factor was the jitter of the FPGA’sbuilt-in PLL. Our implementation is based on Xilinx FPGAs with an improved jitter. So we wantto determine the absolute precision which is achievable with these devices and different ICs for thephysical layer. The implemented clocking scheme is shown in ﬁg. 9. Each PHY is conﬁgured withthe MDIO interface to act as a master or as a slave. During the autonegotiation procedure, theseconﬁgurations are advertised.

Gigabit EthernetIP corewith PTP PHY(Master) M a s t e r c l o c k125 M H z GMII 1000BASE-T

FPGA

125 MHzSynchronizedclockPulse persecond Crystaloscillator 25 MHz

PHY(Slave)PLL

Crystaloscillator 25 MHz

Gigabit EthernetIP corewith PTPFPGA

125 MHzGMIIRecovered clock125 MHz Synchronizedclock

PLL

Pulse persecond

Figure 9.

Scheme of the clock synchronization through a point-to-point connection over a 1000BASE-Tlink. One PHY acts as master and embeds the clock reference into the datastream. The slave recovers asynchronized clock signal with a frequency of 125 MHz. A PLL inside the FPGA is used to build up theclock tree. Our Ethernet IP core provides synchronized timestamps and a pulse per second for the test setup.

4. Measurements and results

For our performance test on a 1000BASE-T link we use the Xilinx evaluation board SP605 equippedwith a Spartan 6 (LX45T) FPGA and the PHY 88E1111 from Marvell. We also use a FPGA Mez-zanine Card (FMC) equipped with two PHYs from Texas Instruments (DP83865) attached to theSP605. The host is a MicroTCA crate equipped with an Advanced Mezzanine Card (AMC) CPUmodule from Concurrent Technologies (AM 900/412-42) and a MicroTCA Carrier Hub (MCH)– 8 –rom N.A.T. (NAT-MCH-PHYS). The CPU module provides two 1000BASE-T ports at the frontand two 1000BASE-KX ports at the backplane. A 1000BASE-KX link to the CPU is establishedwith a Kintex 7 (325T) on the HGF-AMC from DESY/KIT through the MicroTCA backplane andthe switch from the MCH. The operating system on the host is Ubuntu.To evaluate the MAC Layer and the latency of the entire stack, we measured its output signalson the GMII. A maximum throughput is achieved if the transmit enable signal (see "phy_txen" inﬁg. 4) is high all the time except the time for the IFG. A constant latency is achieved, if a transmis-sion cycle and the arrival time of a packet at the receiver have a time deviation much smaller than aclock cycle. Both conditions could be experimentally veriﬁed, which indicates that the MAC layeris capable of transferring the maximum throughput with a constant latency (see ﬁg. 10). The mea-

Figure 10.

The transmission enable signal (signal "PHY1_TX_EN") of the transmitting MAC and thereceive enable signal (signal "PHY2_RX_EN") of the receiving PHY. The oscilloscope measurement veriﬁesthat the MAC keeps the IFG at 96 ns while sending data with maximum throughput and constant latency.The packet length of 592 ns corresponds to a UDP payload of 20 Byte. surement shown in ﬁg. 10 was captured with an oscilloscope during a transmission of UDP packetswith a ﬁxed payload of 20 Byte. This test was chosen to verify a maximum throughput, a constantlatency of the core (see measurement "packet length" and "IFG" in ﬁg. 10) and the overall systemlatency between two PHYs (see measurement "Phy1-Phy2 latency" in ﬁg. 10). For this setup weused the Marvell 88E1111 on both sides connected by a cable of 50 cm length.

To check the performance of our FPGA implementation with a 1000BASE-T PHY we directly es-tablished a point-to-point connection between the FPGA and the CPU module’s front connector.The host serves a UDP socket where the incoming data throughput is measured. The data from theFPGA contains an increasing 32 bit counter value which is used to identify a missing packet or acorrupted datastream. For this measurement the throughput of the UDP on a Gigabit Ethernet link– 9 –s our reference. As mentioned in sec. 2, this value is 114.09 MiB/s for a payload of 1472 Byte.If the payload is decreased, the data throughput decreases as well because of the increasing rate ofprotocol overhead. The table 1 shows the achievable data throughput in dependence of the UDPpayload. Additional overhead in the Ethernet packet limits the line rate. Thus the Ethernet Stan-

Table 1.

Theoretical data throughput in dependence of the payload of a UDP packet.

UDP payload / Byte Data throughput / (MiB/s) Line rate / (1 GBit/s)

Data throughput / (MiB / s)114.108 114.112 114.116 N o r m a li z ed c oun t s / a . U . Data throughputmu = 114.112 MiB/ssig = 2196 Byte/scnt = 3.2k

Figure 11.

Distribution of data throughput for 1472 Byte UDP payload measured over 9 hours. data throughput. The values above this limits are caused by frequency uncertainties for the trans-mission clock. The reference values from tab. 1 are calculated at a clock frequency of 125 MHz.A ﬁxed deviation of that frequency and the mentioned lack of a precise time measurement on aLinux system can cause a data throughput value above the reference value. This tests also show theimportance of an efﬁcient host as receiver. If the host is not conﬁgured appropriately, packet losses– 10 – able 2.

Measured data throughput in dependence of the UDP payload. The tests were performed withdifferent hardware platforms. Data throughput is the mean value for more then 100 s.

Physical layer Hardware UDP payload / Byte Data throughput / (MiB/s)

Table 3.

The results of the ICMP layer test with Ping requests. The host generated 1000 Ping requests andthe RTT is measured.

Physical layer Hardware Min / ms RTT mean / ms Max / ms Std.-dev / ms .2 Synchronization

The performance of the clock synchronization is limited by the accuracy of the clock recoverysystem in the signal chain of the 1000BASE-T slave. Whereas the master’s clock can achieve thedesired precision by choosing an appropriate clock source, the precision of the slave’s clock de-pends on its components in the signal chain for the clock distribution and recovery (see ﬁg. 9).For our evaluation hardware, the accuracy of the signal chain is mainly determined by the PHYwhich is responsible for the clock recovery out of the datastream. The second component whichinﬂuences the absolute precision is the FPGA, where the recovered clock is used for timestampgeneration. Usually a PLL inside the FPGA is used to build the clock tree for all clock domains.So we want to evaluate, whether the built-in PLL limits the overall system. At ﬁrst we measuredthe phase noise of a clock source with very low jitter which will be used as the input signal forthe PLL. The phase noise is correlated with the random jitter and therefore it determines the pre-cision of the timing system. The measurements of a low jitter clock and the performance of theFPGA’s (Spartan 6 LX45T) built-in PLL with that input is shown in ﬁg. 12. These measurementswere taken with a HA7062B phase noise analyzer from Holtzworth Instrumentation and the signalgenerator SMA100A from Rohde & Schwarz as low jitter clock source. Although there are various

Frequency offset / Hz10 P ha s e no i s e / ( d B c / H z ) -170-160-150-140-130-120-110-100-90-80 Xilinx PLL 125 MHzClock source 125 MHz

Figure 12.

A measurement of the phase noise of a 125 MHz low jitter clock source (red) which sources aPLL in a Spartan 6 LX45T. The corresponding output of the PLL (blue) has an integrated phase noise of6.47 ps in the range of 10 Hz to 1 MHz. conﬁgurations possible for the PLL, for this measurement we set up the multiplier and the dividervalue to 8. The integrated phase noise of the Xilinx PLL was found out to be 6.47 ps in the range of10 Hz to 1 MHz. Without any additional hardware, this constitutes a design limit for the precisionof synchronous timestamp generation with the FPGA. To ﬁnd out the random jitter of the clocksynchronization over a 1000BASE-T link, we set up a point-to-point link between a master and aslave PHY and measured the clock to clock jitter in the time domain. The master’s clock triggersthe measurement of time difference between the two rising edges of both clocks. The results of– 12 –he measurement for the PHY DP83865 (master and slave) are shown in ﬁg. 13. The clock signal " t / ns3.7 3.75 3.8 3.85 3.9 3.95 4 4.05 N o r m a li z ed c oun t s / a . U . " t Master-Slavemu = 3.901 nssig = 55.1 pscnt = 484k Figure 13.

Clock to clock jitter between the master and the slave PHY. Both are the DP83865. The standarddeviation of the distribution is 55.097 ps. is measured at an output pin of the FPGA with a frequency of 125 MHz (see ﬁg. 9). It is bufferedwith an output register (ODDR2 primitive from Xilinx). During the measurement in ﬁg. 13 theFPGA sends UDP packets with maximum throughput and PTP packets at an interval of 1 s. Wealso bypassed the PLL and distributed the recovered slave clock with an ordinary built-in clockbuffer to the timing logic. With regard to that, the jitter was increased from 55.097 ps to 64.273 ps.Finally, we repeated the measurement of ﬁg. 13 with the PHY 88E1111 for the master and theslave. With this setup we achieved a clock to clock jitter of 70.33 ps. In both setups the master’sclock source was an crystal oscillator with approximately 6 ps random jitter (measured with thephase noise analyzer in the range from 10 Hz to 1 MHz). As a result, we can state that the precisionis inﬂuenced by all components in the signal chain. It depends mainly on the clock recovery systemof the PHY and the ability of the FPGA’s PLL to reduce random jitter. In addition to the measure-ment of the clock to clock jitter, we have taken measurements to estimate the synchronization ofthe master’s and slave’s timestamps. Both devices run on the synchronized clock signal with thesame frequency. An absolute synchronization of the timestamps is performed with PTP every sec-ond. Each synchronized device generates one pulse per second (PPS) at an output of the FPGA. Ameasurement of the time difference between the PPS signal of the master and the slave is shown inﬁg. 14. The measurement was running over 13 hours in the lab and shows that the timestamps aresynchronized with a random jitter of 58.932 ps. The digital logic for timestamp generation in theFPGA was sourced by a clock signal with a frequency of 125 MHz from the built-in PLL. For themeasurements shown in ﬁg. 14 we used the DP83865. The same measurements were repeated withthe 88E1111 PHY and resulted in a random jitter of 71.536 ps for the timestamp synchronizationin a short-term measurement (approx. 2 hours). All measurements showing a constant offset up to8 ns which cannot be reduced with the PTP. Further investigations have to be done with excessivetemperature stress for the PHYs and the FPGAs.

Our Ethernet IP core can be conﬁgured with an arbitrary FIFO size for the application above theUDP layer. Our basic conﬁguration consists of two channels for the application interface to theUDP layer. One channel is interfaced by the microcontroller and one is interfaced by a high-– 13 – igure 14.

Oscilloscope measurement for the time difference between the master’s and the slave’s times-tamps. The FPGA outputs a PPS signal (for this measurement every 268.4 ms) which shows the precision ofthe absolute timestamp synchronization with PTP. The standard deviation of the time difference is 58.932 pswith a constant offset of about 3.94 ns throughput application. On each interface there is one FIFO for the payload with a depth of at leasttwo UDP packets. The payload size of one packet is set by generics and is by default 1472 Byte.For the support of jumbo frames this size can be easily adjusted to 8972 Byte. Larger FIFO depthsand payload sizes are possible as well. The FIFOs are implemented on Dual-Port Block Memoryintegrated in the Xilinx FPGA. They can also be placed on distributed slice registers. The ICMP,PTP and ARP layer can store one packet to send. All receiving datapaths are conﬁgured to storeone packet as well. Because there are various conﬁgurations possible, we renounce a comparison toother implementations. The resource utilization reported by the Xilinx tools for the Ethernet stackwith support for ARP, PTP, ICMP and UDP is presented in tab. 4 and tab. 5, using the Xilinx ISE14.7 tools for the implementation. The 1000BASE-T implementation with support for PTP and a

Table 4.

Slice Logic utilization for the Gigabit Ethernet stack with a Spartan 6 (LX45T)

Module Slices Slice Reg LUTs LUTRAM BRAM

MAC 140 459 345 16 0Ethernet 246 479 648 0 0ARP 163 498 492 0 0IP 274 546 669 0 0ICMP 60 169 108 24 1UDP 237 551 573 1 9PTP 672 1890 2071 0 0Sum 1792 4592 4906 41 10– 14 –TU of 1500 Byte on a Spartan 6 FPGA with 6822 slices occupies 1792 slices corresponding to26.27 % (16,42 % without PTP) total occupied slices.The implementation on Kintex 7 is designed for a 1000BASE-KX link on a MicroTCA backplane.This implementation uses a Xilinx IP core with a GTX transceiver as PHY. This consumes addi-tional logic but doesn’t need an external PHY. This implementation aims only at maximum datathroughput and is not designed to perform a synchronization over the MicroTCA backplane. Thus,the PTP layer is not included. The 1000BASE-KX implementation without support for PTP and a

Table 5.

Slice logic utilization for the Gigabit Ethernet stack with a Kintex 7 (325T)

Module Slices Slice Reg LUTs LUTRAM BRAM

GMII_to_GTX 446 997 826 71 0MAC 94 299 276 32 0Ethernet 137 378 419 0 0ARP 173 498 485 0 0IP 250 546 684 0 0ICMP 61 169 114 24 1UDP 198 580 580 1 5Sum 1359 3467 3384 128 6MTU of 1500 Byte on a Kintex 7 FPGA with 50959 Slices occupies 1359 Slices corresponding to2.67 %.All reports for slice logic utilization also include several logic for internal tests and debug options.

5. Summary

With the need of a high-throughput UDP application, we have presented an entire stack archi-tecture for a Gigabit Ethernet interface on a FPGA. The stack was built for the protocols UDP,ICMP, IP, ARP and PTP and can be easily extended or cut down in functionality. For a straightforward implementation we showed two basic models for the dataﬂow in a stacked architecture.Our embedded Gigabit Ethernet protocol stack is designed with the "Data-Pull" model to eliminateredundant buffers. A clear modular architecture for each layer with a control and arbiter logic at theinterconnections keeps this implementation versatile. The underlying MAC and physical layer arealso replaceable. All modules are written in VHDL and tested on Xilinx Spartan 6 and Kintex 7.We demonstrated the data throughput with an UDP application on a 1000BASE-T and 1000BASE-KX link. In both cases we achieved the maximum data throughput of 114.1 MiB/s with a MTU of1500 Byte and 118.3 MiB/s with jumbo frames of 9000 Byte. The overall performance for otheruse cases is also excellent. Finally, we investigated the performance of a clock synchronizationover a 1000BASE-T link. In dependence of the PHY, we achieved a precision of 55.1 ps for theclock to clock jitter between the master and the slave. An absolute synchronization of timestampswas done with PTP. The long-term test showed a standard deviation of 58.9 ps for the synchronizedtimestamps. Due to the generic data interface, this UDP/IP stack can be easily adapted to detectorapplications where high data throughput is required. For precise timing applications the relativetiming is in the sub nanosecond range whereas the absolute accuracy remains in the limits of PTP.– 15 – eferences [1] IEEE Computer Society,

IEEE Standard for Ethernet , IEEE Std 802.3-2012, 2012[2] PICMG Speciﬁcation MTCA.4,

MicroTCA Enhancements for Rear I/O and Precision Timing , Rev. 1.0,2011[3] P. Moreira et al.,

White rabbit: Sub-nanosecond timing distribution over ethernet , ISPCS 2009[4] F. Hueso-González et al.,

First test of the prompt gamma ray timing method with heterogeneous targetsat a clinical proton therapy facility , Phys. Med. Biol. (2015) 6247[5] C. Girerd et al., MicroTCA implementation of synchronous Ethernet-Based DAQ systems for largescale experiments , Real Time Conference, 2009. RT ’09. 16th IEEE-NPSS[6] Marvell, , Integrated 10/100/1000 Ultra Gigabit Ethernet Transceiver, 2004[7] Texas Instruments,

DP83865 Gig PHYTER V 10/100/1000 Ethernet Physical Layer , SNLS165B, 2004[8] Xilinx, , UG476, 2015[9] Xilinx, , PG047, 2015[10] A. Löfgren et al.,

An analysis of FPGA-based UDP/IP stack parallelism for embedded Ethernetconnectivity , NORCHIP Conference, 2005. 23rd[11] A. Dollas et al.,

An open TCP/IP core for reconﬁgurable logic , IEEE FCCM 2005[12] W. Kühn et al.,

FPGA based compute nodes for high level triggering in PANDA , Journal of Physics:Conference Series (2008) 2[13] T. Uchida,

Hardware-Based TCP Processor for Gigabit Ethernet , IEEE Trans. Nucl. Sci. (2008) 3[14] Herrmann et al., A Gigabit UDP/IP network stack in FPGA , IEEE ICECS 2009[15] N. Alachiotis et al.,

Efﬁcient PC-FPGA Communication over Gigabit Ethernet , IEEE CIT 2010[16] P. Lieber et al.,

FPGA Communication Framework , IEEE FCCM 2011[17] F. Nagy et al.,

Hardware accelerated UDP/IP module for high speed data acquisition in nucleardetector systems , IEEE NSS/MIC 2011[18] N. Alachiotis et al.,

A versatile UDP/IP based PC - FPGA communication platform , ReConFig 2012[19] A. Sasi et al.,

UDP/IP stack in FPGA for hard real-time communication of Sonar sensor data ,SYMPOL 2013[20] M.R. Mahmoodi et al.,

Reconﬁgurable Hardware Implementation of Gigabit UDP/IP Stack Based onSpartan-6 FPGA , ICITEE 2014[21] S. Zhou et al.,

Gigabit Ethernet Data Transfer Based on FPGA , Trustworthy Computing and Services (2014)[22] B. Batmaz et al., UDP/IP Protocol Stack with PCIe Interface on FPGA , Int’l Conf. EmbeddedSystems and Applications (ESA’15), 2015[23] Øyvind Harboe,

The worlds smallest 32 bit CPU with GCC toolchain , Zylin AS, Zylin AS