[PDF] Achieveing reliable UDP transmission at 10 Gb/s using BSD socket for data acquisition systems

Abstract

User Datagram Protocol (UDP) is a commonly used protocol for data transmission in small embedded systems. UDP as such is unreliable and packet losses can occur. The achievable data rates can suffer if optimal packet sizes are not used. The alternative, Transmission Control Protocol (TCP) guarantees the ordered delivery of data and automatically adjusts transmission to match the capability of the transmission link. Nevertheless UDP is often favored over TCP due to its simplicity, small memory and instruction footprints. Both UDP and TCP are implemented in all larger operating systems and commercial embedded frameworks. In addition UDP also supported on a variety of small hardware platforms such as Digital Signal Processors (DSP) Field Programmable Gate Arrays (FPGA). This is not so common for TCP. This paper describes how high speed UDP based data transmission with very low packet error ratios was achieved. The near-reliable communications link is used in a data acquisition (DAQ) system for the next generation of extremely intense neutron source, European Spallation Source. This paper presents measurements of UDP performance and reliability as achieved by employing several optimizations. The measurements were performed on Xeon E5 based CentOS (Linux) servers. The measured data rates are very close to the 10 Gb/s line rate, and zero packet loss was achieved. The performance was obtained utilizing a single processor core as transmitter and a single core as receiver. The results show that support for transmitting large data packets is a key parameter for good performance. Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, thread affinity, core locality and efficient timers.

Full PDF

PPrepared for submission to JINST

Achieveing reliable UDP transmission at 10 Gb/s usingBSD socket for data acquisition systems

M.J. Christensen, a T. Richter a a European Spallation Source, Data Management and Software CentreOle Maaløes vej 32200 Copenhagen NDenmark

E-mail: [email protected]

Abstract: User Datagram Protocol (UDP) is a commonly used protocol for data transmission insmall embedded systems. UDP as such is unreliable and packet losses can occur. The achievabledata rates can suﬀer if optimal packet sizes are not used. The alternative, Transmission ControlProtocol (TCP) guarantees the ordered delivery of data and automatically adjusts transmission tomatch the capability of the transmission link. Nevertheless UDP is often favored over TCP dueto its simplicity, small memory and instruction footprints. Both UDP and TCP are implementedin all larger operating systems and commercial embedded frameworks. In addition UDP alsosupported on a variety of small hardware platforms such as Digital Signal Processors (DSP) FieldProgrammable Gate Arrays (FPGA). This is not so common for TCP. This paper describes how highspeed UDP based data transmission with very low packet error ratios was achieved. The near-reliablecommunications link is used in a data acquisition (DAQ) system for the next generation of extremelyintense neutron source, European Spallation Source. This paper presents measurements of UDPperformance and reliability as achieved by employing several optimizations. The measurementswere performed on Xeon E5 based CentOS (Linux) servers. The measured data rates are very closeto the 10 Gb/s line rate, and zero packet loss was achieved. The performance was obtained utilizinga single processor core as transmitter and a single core as receiver. The results show that supportfor transmitting large data packets is a key parameter for good performance.Optimizations for throughput are: MTU, packet sizes, tuning Linux kernel parameters, threadaﬃnity, core locality and eﬃcient timers.Keywords: Computing (architecture, farms, GRID for recording, storage, archiving, and distribu-tion of data), Data acquisition concepts, Software architectures (event data models, frameworks anddatabases) Corresponding author. a r X i v : . [ phy s i c s . i n s - d e t ] J un ontents European Spallation Source [1] is a next generation neutron source currently being developedin Lund, Sweden. The facility will initially support about 16 diﬀerent instruments for neutronscattering. In addition to the instrument infrastructure, the ESS Data Management and SoftwareCentre (DMSC), located in Copenhagen, provides infrastructure and computational support for theacquisition, event formation and long term storage of the experimental data. At the heart of eachinstrument is a neutron detector and its associated readout system. Both detectors and readoutsystems are currently in the design phase and various prototypes have already been produced [2–5].During experiments data is being produced at high rates: Detector data is read out by customelectronics and the readings are converted into UDP packets by the readout system and sent to eventformation servers over 10 Gb/s optical Ethernet links. The event formation servers are based on– 1 –eneral purpose CPUs and it is anticipated that most if not all data reduction at ESS is done insoftware. This includes reception of raw readout data, threshold rejection, clustering and eventformation. UDP is a simple protocol for connectionless data transmission [6] and packet loss canoccur during transmission. Nevertheless UDP is widely used, for example in the RD51 ScalableReadout System [7], or the CMS trigger readout [8], both using 1 Gb/s Ethernet. The two centralcomponents are the readout system and the event formation system. The readout system is a hybridof analog and digital electronics. The electronics convert deposited charges into electric signalswhich are digitized and timestamped. In the digital domain simple data reduction such as zerosuppression and threshold based rejection can be performed. The event formation system receivesthese timestamped digital readouts and performs the necessary steps to determine the position ofthe neutron. These processing steps are diﬀerent for each detector type. The performance of UDPover 10G Ethernet has been the subject of previous studies [9] [10], which measured TCP and UDPperformance and CPU usages on Linux using commodity hardware. Both studies use a certain setof optimizations but otherwise using standard Linux. In [9] the transmitting process is found to bea bottleneck in terms of CPU usage, whereas a comparison between Ethernet and InﬁniBand [10]reinforces the earlier results and concludes that Ethernet is a serious contender for use in a readoutsystem. This study is aimed at characterizing the performance of a prototype data acquisitionsystem based on UDP. The study is not so much concerned with transmitter performance as weexpect to receive data from a FPGA based platform capable of transmitting at wire speed at allpacket sizes. In stead comparisons between the measured and theoretically possible throughputand measurements of packet error ratios are presented. Finally, this paper presents strategies foroptimizing the performance of data transmission between the readout system and the event formationsystem.

Since TCP is reliable and has good performance whereas UDP is unreliable why not always justuse TCP? The pros and cons for this will be discussed in the following. Both TCP and UDPare designed to provide end-to-end communications between hosts connected over a network ofpacket forwarders. Originally these forwarders were routers but today the group of forwardersinclude ﬁrewalls, load balancers, switches, Network Address Translator (NAT) devices etc. TCPis connection oriented, whereas UDP is connectionless. This means that TCP requires that aconnection is setup before data can be transmitted. It also implies that TCP data can only be sentfrom a single transmitter to a single receiver. In contrast UDP does not have a connection conceptand UDP data can be transmitted as either Internet Protocol (IP) broadcast or IP multicast. Asmentioned earlier the main argument for UDP is that it is often supported on smaller systems whereTCP is not. A notable example are FPGA based systems (see [11] for one such example). But someof the TCP features are not actually improving the performance and reliability in the case of specialnetwork topologies as explained below.

Any forwarder is potentially subject to congestion and can drop packets when unable to cope withthe traﬃc load. TCP was designed to react to this congestion. Firstly TCP has a slow start algorithm– 2 –hereby the data rate is ramped up gradually in order not to contribute to the network congestionitself. Secondly TCP will back oﬀ and reduce its transmission rate when congestion is detected.In a readout system such as ours the network only consists of a data sender and a data receiverwith an optional switch connecting them. Thus the only places where congestion occurs are at thesender or receiver. The readout system will typically produce data at near constant rates duringmeasurements so congestion at the receiver will result in reduced data rates by the transmitter whenusing TCP. This ﬁrst causes buﬀering at the transmitting application until the buﬀer is full andeventually pakets are lost.For some detector readout it is not even evident that guaranteed delivery is necessary. In onedetector prototype we discarded around 24% of the data due to threshold suppression, so spendingextra time making an occasional retransmission may not be worth the added complexity.

Since TCP requires the establishement of a connection, both the receiving and transmitting appli-cations must implement additional state to detect the possible loss of a connection. For exampleupon reset of the readout system after a software upgrade or a parameter change. With UDP thereceiver will just ’listen’ on a speciﬁed UDP port whenever it is ready and receive data when itarrives. Correspondingly the transmitter can send data whenever it is ready. UDP reception sup-ports many-to-one communication, supporting for example two or more readout systems in a singlereceiver. For TCP to support this would require handling multiple TCP connections.

UDP can be transmitted over IP as multicast. This means that a single sender can reach multiplereceivers without any additional programming eﬀort. This can be used for seamless switchovers,redundancy, load distribution, monitoring, etc.. Implementing this in TCP would add complexityto the transmitter.In summary: For our purposes UDP appears to have more relevant features than TCP. Thus it ispreferred provided we can achieve the desired performance and reliability.

This section explains the factors that contribute to limiting performance, reproducibility or accuracyof the measurements. Here we also discuss the optimization strategies used to achieve the results.

An Ethernet frame consists of a ﬁxed 14 bytes header the Ethernet payload, padding and a 4 bytechecksum ﬁeld. Padding is applied to ensure a minimum Ethernet packet size of 64 bytes. Thereis a minimum gap between Ethernet frames of 20 bytes. This is called the Inter Frame Gap (IFG).Standard Ethernet supports ethernet payloads from 1 to 1500 bytes. Ethernet frames with payloadsizes above 1500 bytes are called jumbo frames. Some Ethernet hardware support payload sizes of9000 bytes corresponding to Ethernet frame sizes of 9018 when including the header and checksum– 3 –elds. This is shown in Figure 1 (top). The Ethernet payload consists of IP and UDP headersas well as user data. This is illustrated in Figure 1 (bottom). For any data to be transmitted overEthernet, the factors inﬂuencing the packet and data rates are: The link speed, IFG and the payloadsize. The largest supported Ethernet payload is called the Maximum Tranmission Unit (MTU). Forfuther information see [12] and [13]. eth. frame n IFG eth. frame n+1 64 – 9018 20 64 – 9018 ... ... eth ip udp eth 14 20 8 4 1 – 8972 ifg ifg 20 20

Eth. payload User data Eth. payload

Figure 1 . (top) Ethernet frames are separated by a 20 byte inter frame gap. (bottom) The Ethernet, IP andUDP headers take up 46 bytes. The largest UDP user data size is 1472 bytes on most Ethernet interfacesdue to a default MTU of 1500. This can be extended on some equipment to 8972 bytes by the use of jumboframes.

Sending data larger than the MTU will result in the data being split in chunks of size MTUbefore transmission. Given a speciﬁc link speed and packet size, the packet rate is given byrate [ packets per second ] = ls8 · ( ps + ifg ) where ls is the link speed in b/s, ps the packet size and ifg the inter frame gap. Thus for a 10 Gb/sEthernet link, the packet rate for 64 byte packets is 14.88 M packets per second (pps) as is shownin Table 1. Table 1 . Packet rates as function of packet sizes for 10 Gb/s Ethernet

User data size [B] 1 18 82 210 466 978 1472 8972Packet size [B] 64 64 128 256 512 1024 1518 9018Overhead [%] 98.8 78.6 44.6 23.9 12.4 5.5 4.3 0.7Frame rate [Mpps] 14.88 14.88 8.45 4.53 2.35 1.20 0.81 0.14Packets arriving at a data acquisition system are subject to a nearly constant per-packet pro-cessing overhead. This is due to interrupt handling, context switching, checksum validations andheader processing. At almost 15 M packets per second this processing alone can consume most ofthe available CPU resources. In order to achieve maximum performance, data from the electronicsreadout should be bundled into jumbo frames if at all possible. Using the maximum Ethernet packetsize of 9018 bytes reduces the per-packet overhead by a factor of 100. This does, however, come atthe cost of larger latency. For example the transmission time of 64 bytes + IFG is 67 ns, whereas– 4 –or 9018 + IFG it is 902 ns. For applications sentitive to latency a tradeoﬀ must be made betweenlow packet rates and low latency.Not all transmitted data are of interest for the reciever and can be considered as overhead.Packet headers is such an example. The Ethernet, IP and UDP headers are always present and takesup a total of 46 bytes as shown in Figure 1 (bottom). The utilization of an Ethernet link can becalculated as U = dd + + ifg + padwhere U is the link utilization, d the user data size, ifg the inter frame gap and pad is the paddingmeantioned earlier. For user data larger than 18 bytes no padding is applied. This means thatfor small user payloads the overhead can be signiﬁcant, making it impossible to achieve highthroughput. For example transmitting a 32 bit counter over UDP will take up 84 bytes on the wire(20 bytes IFG + 64 byte for a minimum Ethernet frame) and the overhead will account for approx.95% of the available bandwidth. In contrast when sending 8972 byte user data the overhead is aslow as 0 . A UDP packet can be dropped in any part of the communications chain: The sender, the receiver,intermediate systems such as routers, ﬁrewalls, switches, load balancers, etc. This makes it diﬃcultin general to rely on UDP for high speed communications. However for simple network topologiessuch as the ones found in detector readout systems it is possible to achieve very reliable UDPcommunications. When for example the system comprise two hosts (sender and receiver) connectedvia a switch of high quality, the packet loss is mainly caused by the Ethernet NIC transmit queueand the socket receive buﬀer size. Fortunately these can be optimized. The main parameters forcontrolling socket buﬀers are rmem_max and wmem_max . The former is the size of the UDP socketreceive buﬀer, whereas the latter is the size of the UDP socket transmit buﬀer. To change thesevalues from an application use setsockopt() , for example int buffer = 4000000;setsockopt(s, SOL_SOCKET, SO_SNDBUF, buffer, sizeof(buffer));setsockopt(s, SOL_SOCKET, SO_RCVBUF, buffer, sizeof(buffer));

In addition there is an internal queue for packet reception whose size (in packets) is named netdev_max_backlog , and a network interface parameter, txqueuelen which were also adjusted.The default value of these parameters on Linux are not optimized for high speed data linkssuch as 10 Gb/s Ethernet, so for this investigation the following parameters were used. net.core.rmem_max=12582912net.core.wmem_max=12582912net.core.netdev_max_backlog=5000txqueuelen 10000

These values have largely been determined by experimentation. We also conﬁgured the systemswith an MTU of 9000 allowing user payloads up to 8972 bytes when taking into account that IPand UDP headers are also transmitted. – 5 – .3 Core locality

Modern CPUs rely heavily on cache memories to achieve performance. This holds for bothinstructions and data access. For Xeon E5 processors there are three levels of cache. Some isshared between instructions and data, some is dedicated. The L3 cache is shared across all coresand hyperthreads, whereas the L1 cache is only shared between two hyperthreads. The way to ensurethat the transmit and receive applications always uses the same cache is to ’lock’ the applicationsto speciﬁc cores. For this we use the Linux command taskset and the pthread API function pthread_setaffinity_np() . This prevents the application processes to be moved to other coresand thereby causing interrupts in the data processing, but it does not prevent other processes to beswapped onto the same core.

The transmitter and receiver applications for this investigation periodically prints out the measureddata speed, PER and other parameters. Initially the standard C++ chrono class timer was used(version: libstdc++.so.6). But proﬁling showed that signiﬁcant time was spent here, enough toaﬀect the measurements at high loads. Instead we decided to use the CPU’s hardware based TimeStamp Counter (TSC). TSC is a 64 bit counter running at CPU clock frequency. Since processorspeeds are subject to throttling, the TSC cannot be directly relied upon to measure time. In thisinvestigation time checking is a two-step process: First we estimate when it is time to do the periodicupdate based on the inaccurate TSC value. Then we use the more expensive C++ chrono functionsto calculate the elapsed time used in the rate calcuations. An example of this is shown in the sourcecode which is publicly available. See Section A for instructions on how to obtain the source code.

The experimental conﬁguration is shown in Figure 2. It consists of two hosts, one acting as a UDPdata generator and the other as a UDP receiver. The hosts are HPE ProLiant DL360 Gen9 serversconnected to a 10 Gb/s Ethernet switch using short (2 m) single mode ﬁber cables. The switch isa HP E5406 switch equipped with a J9538A 8-port SFP+ module. The server speciﬁcations areshown in table 2. Except for processor internals the servers are equipped with identical hardware.

HP ProLiant DL360 HP E5406 J9538A HP ProLiant DL360 S0 S0 S1

Generator Receiver Switch eno49 eno49

Figure 2 . Experimental setup.

The data generator is a small C++ program using BSD socket, speciﬁcally the sendto() systemcall for transmission of UDP data. The data receiver is based on a DAQ and event formation systemdeveloped at ESS as a prototype. The system, named the Event Formation Unit (EFU), supportsloadable processing pipelines. A special UDP ’instrument’ pipeline was created for the purpose– 6 – able 2 . Hardware components for the testbedMotherboard HPE ProLiant DL360 Gen9Processor type (receiver) Two 10-core Intel Xeon E5-2650v3 CPU @ 2.30GHzProcessor type (generator) One 6-core Intel Xeon E5-2620v3 CPU @ 2.40GHzRAM 64 GB (DDR4) - 4 x 16 GB DIMM - 2133MHzNIC dual port Broadcom NetXtreme II BCM57810 10 Gigabit EthernetHard Disk Internal SSD drive (120GB) for local installation of CentOS 7.1.1503Linux kernel 3.10.0-229.7.2.el7.x86_64 of these tests. Both the generator and receiver uses setsockopt() to adjust transmit and receivebuﬀer sizes. Sequence numbers are embedded in the user payload by the transmitter allowing thereceiver to detect packet loss and hence to calculate packet error ratios. Both the transmitting andreceiving applications were locked to a speciﬁc processor core using the taskset command and pthread_setaffinity_np() function. The measured user payload data-rates were calculatedusing a combination of fast timestamp counters and microsecond counters from the C++ chronoclass. Care was taken not to run other programs that might adversly aﬀect performance whileperforming the experiments. CPU usages were calculated from the /proc/stat pseudoﬁle as alsoused in [9].A measurement series typically consisted of the following steps:1. Start receiver2. Start transmitter with speciﬁed packet size3. Record packet error ratios (PER) and data rates4. Stop transmitter and receiver after 400 GBThe above steps were then repeated for measurements of CPU usage using /proc/stat averaged over 10 second intervals.A series of measurements of speed, packet error ratios and CPU usage where made as a functionof user data size for reasons discussed in Section 3.1.

The current experiments are subject to some limitations. We do not however believe that these poseany signiﬁcant problems in the evaluation of the results. The main limitations are described below.

Multi user issues:

The servers used for the tests are multi user systems in a shared integrationlaboratory. Care was taken to ensure that other users were not running applications at the sametime to avoid competition for cpu, memory and network resources. However a number of standarddemon processes were running in the background, some of which triggers the transmission of dataand some of which are triggered by packet reception.

Measuring aﬀects performance:

Several conﬁguration, performance and debugging tools needaccess to kernel or driver data structures. Examples we encountered are netstat , ethtool and– 7 – ropwatch . However the use of these tools can cause additional packet drops when running athigh system loads. These tools were not run while measuring packet losses. Packet reordering:

The test application is unable to detect misordered packets. Packet reorderinghowever is highly unlikely in the current setup, but would be falsely reported as packet loss.

Packet checksum errors:

The NICs perform checksums of Ethernet and IP in hardware. Thuspackets with wrong checksums will not be delivered to the application and subsequently be falselyreported as packet loss. For the purpose of this study this is the desired behavior.

The performance results covers user data speed, packet error ratios and cpu load. These topics willbe covered in the following sections.

The result of the measurements of achievable user data speeds is shown in Figure 3 (a). The ﬁgureshows both the measured and the theoretical maximum speed. For packets with user data sizeslarger than 2000 bytes the achieved rates match the theoretical maximum. However at smaller datasizes the performance gap increases rapidly. It is clear that either the transmitter or the receiveris unable to cope with the increasing load. This is mainly due to the higher packet arrival ratesoccurring at smaller packet sizes. The higher rates increases the per-packet overhead and also thenumber of interrupts and system calls. At the maximum data size of 8972 bytes the CPU load onthe receiver was 20%.

The achieved packet error ratios in this experiment are shown in Figure 3 (b), which also shows thecorresponding values obtained using the default system parameters. The raw measurements for theachieved values are listed in Table 3. It is observed that the packet error ratios depends on the sizeof transmitted data. This dependency is mainly caused by the per-packet overhead introduced byincreasing packet rates with decreasing size. The onset of packet loss coincides with the onset ofdeviation of observed speed from the theoretical maximum speed suggesting a common cause. Nopacket loss was observed for data larger than 2200 bytes. When packet loss sets in at lower datasizes, the performance degrades rapidly: In the region from 2000 to 1700 bytes the PER increasesby more than four orders of magnitude from 1 . · − to 7 . · − . Table 3 . Packet error ratios as function of user data size size [B] 64 128 256 472 772 1000 1472 1700PER 4 . · − . · − . · − . · − . · − . · − . · − . · − size [B] 1800 1900 2000 2200 2972 4472 5972 8972PER 3 . · − . · − . · − – 8 – .3 CPU load The CPU load as a function of user data size is shown in Figure 3 (c). The observation for bothtransmitter and receiver is that the CPU load increases with decreasing user data size. When thetransmitter reaches 100% the receiver is slightly less busy at 84%. There is a clear cut-oﬀ valuecorresponding to packet loss and deviations from theoretical maximum speed around user data sizesof 2000 bytes. The measured CPU loads indicate that the transmission is the bottle neck at smalldata sizes (high packet rates), and that most CPU cycles are spent as System load as also reportedby [9]. But the comparisons diﬀer both qualitatively and quantitatively upon closer scrutiny. Forexample in this study we ﬁnd the total CPU load for the receiver (system + user) to be 20% for userdata sizes of 8972 bytes. This is much lower than reported earlier. On the other hand we observea sharp increase CPU usage in soft IRQ from 0% to 100% over a narrow region which was notobserved previously. We also observe a local minimum in Tx CPU load around 2000 bytes followedby a rapid increase at lower data sizes.

Measurements of data rates and packet error ratios for UDP based communications at 10 Gb/shave been presented. The data rates were achieved using standard hardware and software. Nomodiﬁcations were made to the kernel network stack but some standard Linux commands wereused to optimize the behavior of the system. The main change was increasing network buﬀers forUDP communications from a small default value of 212 kB to 12 MB. In addition packet error ratioswere measured. The measurements shows that it is possible to achieve zero packet error ratios at10 Gb/s, but that this requires the use of large Ethernet packets (jumbo frames), preferably as largeas 9018 bytes. Thus the experiments have shown that it is feasible to create a reliable UDP baseddata acquisition system supporting readout data at 10 Gb/s.This study supplements independent measurements done earlier [9] and reveals diﬀerencesin performance across diﬀerent platforms. The observed diﬀerences are likely to be caused bydiﬀerences in CPU generations, Ethernet NIC capabilities and Linux kernel versions. Thesediﬀerences were not the focus of our study and have not been investigated further. But they doindicate that some performance numbers are diﬃcult to compare directly across setups. They alsoprovide a strong hint to DAQ developers: When upgrading hardware or kernel versions in s Linuxbased DAQ system, performance tests should be done to ensure that speciﬁcations are still met.There are several ways to improve performance to achieve 10 Gb/s with smaller packet sizes,but the complexity increases. For example it is possible to send and receive multiple messagesusing a single system call such as sendmmsg() and recvmmsg() which will reduce the numberof system calls and should improve performance. It is also possible to use multiple cores for thereceiver instead of only one as we did in this test. This adds some complexity that has to handledistributing packets across cores in case it cannot be done automatically. One method for automaticload distribution is to use Receive Side Scaling (RSS). However this requires the transmitter to useseveral diﬀerent source ports in the UDP packet when transmitting instead of one currently used.This may require changes to the readout system. It is also possible to move network processingaway from the kernel and into user space avoiding context switches, and to change from interrupt– 9 – igure 3 . Performance measurements. a) User data speed. b) Packet Error Ratio. c) CPU Load. Note thatfor the optimized values PER is zero for user data larger than or equal to 2200 bytes (solid line). – 10 –riven reception to polling. These approaches are used in the Intel Data Plane Development Kit(DPDK) software packet processing framework.

A Source code

The software for this project is released under a BSD license and is freely available at GitHub https://github.com/ess-dmsc/event-formation-unit.git . To build the programs usedfor these experiments complete the following steps. To build and start the producer: > git clone https://github.com/ess-dmsc/event-formation-unit> cd event-formation-unit/udp> make> taskset -c coreid ./udptx -i ipaddress to build and start the receiver: > git clone https://github.com/ess-dmsc/event-formation-unit> mkdir build> cd build> cmake ..> make> ./efu2 -d udp -c coreid

The central source ﬁles for this paper are udp/udptx.cpp for the generator and prototype2/udp/udp.cpp for the receiver. The programs have been demonstrated to build and rund on Mac OS X, Ubuntu16 and CentOS 7.1. However some additional libraries need to be installed, such as librdkafka andgoogle ﬂatbuﬀers.

B System conﬁguration

The following commands were used (performed as superuser) to change the system parameters onCentOS. The examples below modiﬁes network interface eno49. This should be changed to matchthe name of the interface on the actual system. > sysctl -w net.core.rmem_max=12582912> sysctl -w net.core.wmem_max=12582912> sysctl -w net.core.netdev_max_backlog=5000> ifconfig eno49 mtu 9000 txqueuelen 10000 up

Acknowledgments

This work is funded by the EU Horizon 2020 framework, BrightnESS project 676548.We thank Sarah Ruepp, associate professor at DTU FOTONIK, and Irina Stefanescu, DetectorScientist at ESS, for comments that greatly improved the manuscript.– 11 – eferences [1]

European Spallation Source ERIC , Retrieved from http://europeanspallationsource.se/ .[2] T. Gahl et al.,

Hardware Aspects, Modularity and Integration of an Event Mode Data Acquisition andInstrument Control for the European Spallation Source (ESS) , arXiv:1507.01838v1.[3] A. Khaplanov et al.,

Multi-Grid detector for neutron spectroscopy: results obtained on time-of-ﬂightspectrometer CNCS , JINST (2017) P04030.[4] I. Stefanescu et al., Neutron detectors for the ESS diﬀractometers , JINST (2017) P01019.[5] F. Piscitelli et al., The Multi-Blade Boron-10-based Neutron Detector for high intensity NeutronReﬂectometry at ESS , arXiv:1701.07623v1.[6] J. Postel

User Datagram Protocol , IETF , Retrieved from https://tools.ietf.org/html/rfc768 .[7] S. Martoiu, H. Muller and J. Toledo

Front-end electronics for the Scalable Readout System of RD51 , IEEE Nuclear Science Symposium Conference Record (2011) 2036.[8] R. Frazier, G. Illes, D. Newbold and A. Rose

Software and ﬁrmware for controlling CMS trigger andreadout hardware via gigabit Ethernet , Physics Procedia , , (2012) 1892-1899.[9] M. Bencivenni et al., Performance of 10 Gigabit Ethernet Using Commodity Hardware , IEEE Trans.Nucl. Sci. , , (2010) 630-641.[10] D. Bortolotti et al., Comparison of UDP Transmission Performance Between IP-Over-Inﬁniband and10-Gigabit Ethernet , IEEE Trans. Nucl. Sci. , , (2011) 1606-1612.[11] P. Födisch, B. Lange, J. Sandmann, A. Büchner, W. Enghardt and P. Kaever A synchronous GigabitEthernet protocol stack for high-throughput UDP/IP applications , J.Inst , (2016).[12] IEEE 802 LAN/MAN Standards Committee , IEEE , Retrieved from .[13]

Request For Comments , IETF , Retrieved from ..