[PDF] Fast Entropy Coding for ALICE Run 3

Abstract

In LHC Run 3, the upgraded ALICE detector will record Pb-Pb collisions at a rate of 50 kHz usingcontinuous readout. The resulting stream of raw data at 3.5 TB/s has to be processed with a setof lossy and lossless compression and data reduction techniques to a storage data rate of 90 GB/swhile preserving relevant data for physics analysis. This contribution presents a custom losslessdata compression scheme based on entropy coding as the final component in the data reductionchain which has to compress the data rate from 300 GB/s to 90 GB/s. A flexible, multi-processarchitecture for the data compression scheme is proposed that seamlessly interfaces with the datareduction algorithms of earlier stages and allows to use parallel processing in order to keep therequired firm real-time guarantees of the system. The data processed inside the compressionprocess have a structure that allows the use of an rANS entropy coder with more resource efficientstatic distribution tables. Extensions to the rANS entropy coder are introduced to efficientlywork with these static distribution tables and large but sparse source alphabets consisting of upto 25 Bit per symbol. Preliminary performance results show compliance with the firm real-timerequirements while offering close-to-optimal data compression.

Full PDF

FFast Entropy Coding for ALICE Run 3

Michael Lettrich 𝑎, ∗ for the ALICE collaboration 𝑎 CERN, Technische Universität München,Geneva, Switzerland

E-mail: [email protected]

In LHC Run 3, the upgraded ALICE detector will record Pb-Pb collisions at a rate of 50 kHz usingcontinuous readout. The resulting stream of raw data at 3.5 TB/s has to be processed with a setof lossy and lossless compression and data reduction techniques to a storage data rate of 90 GB/swhile preserving relevant data for physics analysis. This contribution presents a custom losslessdata compression scheme based on entropy coding as the ﬁnal component in the data reductionchain which has to compress the data rate from 300 GB/s to 90 GB/s. A ﬂexible, multi-processarchitecture for the data compression scheme is proposed that seamlessly interfaces with the datareduction algorithms of earlier stages and allows to use parallel processing in order to keep therequired ﬁrm real-time guarantees of the system. The data processed inside the compressionprocess have a structure that allows the use of an rANS entropy coder with more resource eﬃcientstatic distribution tables. Extensions to the rANS entropy coder are introduced to eﬃcientlywork with these static distribution tables and large but sparse source alphabets consisting of upto 25 Bit per symbol. Preliminary performance results show compliance with the ﬁrm real-timerequirements while oﬀering close-to-optimal data compression. ∗ Speaker © Copyright owned by the author(s) under the terms of the Creative CommonsAttribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/ a r X i v : . [ phy s i c s . i n s - d e t ] F e b ast Entropy Coding for ALICE Run 3 Michael Lettrich

1. Introduction

ALICE (A Large Ion Collider Experiment) [1] is a heavy-ion collision detector at the LHC(Large Hadron Collider) [2] at CERN, built to study the physics of strongly interacting matter.Throughout the Long Shutdown 2 (LS2) of the LHC, the ALICE detector receives a substantialupgrade [3] and will record Pb-Pb collisions at a rate of 50 kHz using continuous readout with animproved tracking precision in the upcoming Run 3 and Run 4 of the LHC. The resulting raw datarate of ∼ ∼

90 GB/s. This is achieved by theALICE Online-Oﬄine (O ) software [4] via a sequence of compression and data reduction stepswithout aﬀecting physics. The ﬁnal stage in this chain is a data compression scheme that providesa lossless, space eﬃcient representation of the input data suitable for permanent storage.General purpose compression schemes such as gzip/deﬂate [5] and Zstandard [6] are designedto provide good compression without prior knowledge of the processed data. Compression schemesthat take into account the structure of the data however can be signiﬁcantly more eﬃcient as is showne.g. by the Draco 3D data compression scheme [7] for 3D geometries or purpose built compressionschemes for data acquisition systems (DAQ) [8]. Therefore ALICE in LHC Run 2 used a customcompression scheme based on the Huﬀman entropy coder [9] as well. However with a new approachto data taking and processing during LHC Run 3 as well as considering technological advances incompression algorithms, a completely new compression scheme has to be developed for Run 3.The purpose of this contribution is thus to outline the main components of a custom datacompression scheme for the ALICE detector in LHC Run 3. It describes the strategy used tocompress the data from previous stages using rANS, a state of the art entropy coder and the requiredadaptations to rANS to allow fast and close-to-optimal entropy compression of ALICE Run 3 data.

2. Choice of Compression Algorithm

Data taking at 50 kHz continuous readout results in a stream of 3.5 TB/s, evenly split into timeframes (TF) of ∼ O ( ) Event Processing Nodes (EPN) such that eachprocesses one TF at a time at ﬁrm real-time requirements. The result of zero suppression and lossydata reduction is a ﬂat structure of integer arrays (SoA) which has to be compressed from ∼ ∼

90 GB/s on the same EPN before being written to permanent storage as a compressed timeframe (CTF) [4]. Each array inside an SoA has a deﬁned value range of 4–25 Bits per value withadditional padding and its own distribution of values. The length of the individual arrays howeveris variable and depends on the amount of extracted information from a raw time frame.There are two major classes of widely used general purpose compression algorithms: dictio-nary compression and entropy compression. Both interpret the source data of a message 𝑚 as aconcatenation of symbols 𝑠 𝑖 from a ﬁnite alphabet A , but rely on diﬀerent concepts. Dictionarycompression replaces reoccurring sequences of symbols by a reference to a dictionary that is con-structed by the algorithm on the ﬂy. This principle is e.g. implemented in the LZ77, lzma andlz4 algorithms [10]. Entropy coders on the other hand compress data based on the distribution ofsymbols in a message via a coding function 𝐶 that transforms source symbols into a representationwhere less probable symbols use more bits then highly probable symbols [10]. Examples for entropycoders are Huﬀman coding [11] and Asymmetric Numeral Systems coding (ANS) [12], [13].2 ast Entropy Coding for ALICE Run 3 Michael LettrichGeneral purpose compression schemes such as deﬂate (gzip) [5] or the newer Zstandard [6]combine both concepts by applying entropy compression on dictionary compressed data. Thecompression achieved by these schemes on simulated Run 3 data however was not satisfactory. Itis highly likely that the probability for reoccurring patterns is small for our large alphabets of up to2 unique symbols and thus the dictionary compression is not eﬀective. The entropy compressionstep in these schemes on the other hand cannot be adjusted suﬃciently to our input data. Forentropy coders compression performance does not depend on the size of the source alphabet A orreoccurring patterns but rather on a non-uniform distribution of source symbols. Therefore a plainentropy coder is the best choice for compression of ALICE Run 3 data.The most suitable entropy coding algorithm for ALICE Run 3 data was selected in a study [14]on simulated detector data of the ALICE time projection chamber (TPC). Evaluating compressionrate and bandwidth as well as the ability to work with a 2 Bit symbol alphabets, the rANS entropycoder, a variant of ANS, has shown the best and most consistent results across the input data. Givenpre-calculated distribution tables for all arrays, a prototype rANS implementation in C++ managedto compress the contents of a SoA practically down to the bound of information-theory entropy 𝐻 [15] achieving a compression factor 2 at an average bandwidth of 600 MB/s on commodityhardware. rANS was therefore selected for further investigation. With the lack of a universallibrary implementation of the algorithm however, an ALICE speciﬁc implementation is required.

3. Entropy Coding Strategy

The raw time frame is handled on the EPN by the ALICE O data processing layer (DPL)[16] — a distributed, multi-process framework that allows connecting components via messagepassing. The SoAs constituting the CTF are produced in parallel by sets of multi-stage processesthat compress the raw-data of one or multiple sub-detectors. Depending on the algorithms and theamount of data, the latency for each SoA is diﬀerent. To prevent buﬀering of large amounts of datain shared memory, a distributed approach is chosen where each SoA passes through its speciﬁcentropy coder before all fragments are merged into a ﬁnal CTF that is sent to permanent storage(see Figure 1). The distributed approach furthermore decouples SoA speciﬁc pre-processing andentropy coding tasks from the ﬁnal merging of uniformly structured blocks of encoded data.The compression achievable by an entropy coder highly depends on how closely the distributiontable used by the coder matches the underlying distribution of the input data. Individual compressionof each array in the SoA respecting its value range and symbol distribution will yield the best results.Building the exact symbol distribution table for each array in each time-frame dynamically howeveris unfeasible as it would require a full pass over the input data before encoding can take placein a second pass which is too expensive in our setting. Additionally the information about thesymbol distributions needs to be stored as metadata for decoding. The resulting increase in ﬁlesize for source alphabets spanning a 25 Bit value range is not acceptable. However since a time-frame contains data of a large numbers of collisions, the distribution of the raw signals will notchange unless the data-taking conditions change which will only happen over a time span of manytime-frames. This allows pre-calculation of a distribution table for each individual array in a SoArespecting the speciﬁc value range and symbol distribution of the array and reuse the distributiontable across time frames without heavy penalties on compression rate which was veriﬁed using3 ast Entropy Coding for ALICE Run 3 Michael Lettrich

CTF Entropy Coding act «structured»

Sub-Detector 3Receive Timeframe Merge CTF «structured»

Sub-Detector 1Compression

SoA«structured»

Sub-Detector 2Compression Entropy Coding

SoA

Entropy CodingCompression

SoA

Entropy Coding

Figure 1:

UML Activity diagram of the parallel, distributed processing of TF to CTF. Data from the TFis processed in a multi-stage compression and data reduction chain for each sub-detector producing a SoAwhich is entropy coded individually and merging into the CTF. simulated detector data. In addition the tables can be saved centrally to the ALICE Conditionand Calibration Data Base (CCDB) [4] and fetched for decompression. This avoids large storageoverhead caused by including distribution tables with each CTF ﬁle.

4. Eﬃcient Custom rANS Entropy Coder Implementation rANS is part of a family of variable range entropy coders called Asymmetric Numeral Systems(ANS) [12] [13]. Given a message 𝑚 consisting of symbols 𝑠 𝑖 from a ﬁnite alphabet A anda probability distribution 𝑓 , an arithmetic coding function 𝐶 𝑓 : A ↦→ N encodes all symbols 𝑠 𝑖 ∈ 𝑚 into a single integer 𝑥 ∈ N called the state variable. Starting from an empty initial state 𝑥 ,symbol 𝑠 𝑖 is encoded onto a state 𝑥 𝑖 − containing encoded information of all symbols 𝑠 , . . . , 𝑠 𝑖 − .This will lead to a new state 𝑥 𝑖 = 𝐶 𝑓 ( 𝑥 𝑖 − , 𝑠 𝑖 ) > 𝑥 𝑖 − that grows inversely proportional to theprobability of the encoded symbol, i.e. 𝑥 𝑖 ≈ 𝑥 𝑖 − / 𝑃𝑟 [ 𝑠 𝑖 ] . Renormalization keeps 𝑥 constrainedwithin an interval 𝐼 that can eﬃciently be handled by a computer and bits are streamed out ifthe upper limit is surpassed during encoding or read in when the lower limit is surpassed duringdecoding. The state variable 𝑥 behaves like a last-in-ﬁrst-out (LIFO) stack which requires thedecoder to always exactly invert the encoding step 𝐷 ( 𝐶 ( 𝑥 𝑖 , 𝑠 𝑖 )) = ( 𝑥 𝑖 , 𝑠 𝑖 ) to recover the input.The generalization of this idea is that, an arbitrary transformation 𝑡 can be applied on to state 𝑥 during encoding as long as it is inverted by 𝑡 − during decoding, which also allows nesting i.e. 𝑡 − 𝑛 ( . . . 𝑡 − ( 𝐷 𝑓 ( 𝐶 𝑓 ( 𝑡 ( . . . 𝑡 𝑛 ( 𝑥 𝑖 , 𝑠 𝑖 )))))) = ( 𝑥 𝑖 , 𝑠 𝑖 ) . Eﬃcient implementations on pipelined, SIMDcapable CPUs or GPGPUs [17] rely on these transformations to enable instruction level parallelism.The ALICE rANS implementation additionally uses a transformation function 𝑡 for handlingstatic distribution tables. With larger alphabets chances increase to encounter infrequent symbolswith a probability close to zero. The pre-calculated distribution table thus can contain 𝑃𝑟 [ 𝑠 𝑟 ] = 𝑠 𝑟 which is incompatible with the rANS algorithm, that strictly requires 𝑃𝑟 [ 𝑠 𝑖 ] > , ∀ 𝑠 𝑖 ∈ A . Incompressible symbols can be encoded by introducing a functional symbol 𝑟 into A with 𝑃𝑟 [ 𝑟 ] >

0. If a symbol 𝑠 𝑖 is marked as incompressible in the distribution table, a transformationreplaces 𝑠 𝑖 with 𝑟 and passes it to the encoder. The original symbol 𝑠 𝑖 is pushed onto a stack which4 ast Entropy Coding for ALICE Run 3 Michael Lettrichis appended as a special block at the end of the encoded data. If during decoding the functionalsymbol 𝑟 is encountered, it is replaced with the top element of the stack saved alongside the data.Algorithm 1 and Algorithm 2 formally describe the encoding/decoding of incompressible symbols.Run-length encoding (RLE) [10] is implemented as a transformation in a similar way.rANS relies on some costly arithmetic operations that depend on the probability of the encodedsymbol. A pre-calculated lookup table (LUT) can be used to replace these reoccurring arithmeticswith table lookups. For large alphabets with up to 2 symbols these LUTs no longer ﬁt intoCPU cache, reducing the performance beneﬁts. Thankfully, many of the distribution tables forthese large alphabets are sparse, containing over 90% incompressible symbols. Using a LUT witha single indirection instead of direct indexed lookup allows the implementation of more eﬃcientdata structure. Referencing all incompressible symbols directly to the special functional symbol 𝑟 shrinks sparse LUTs by up to a factor of 16 preventing cache eviction. The probability of asymbol directly translates to the expected frequency of lookup. Sorting symbols in storage by theirprobability measurably increases the probability of cache hits in higher level caches. Since theLUTs are reused for many time-frames, setup costs occur only during initialization. if Pr[ 𝑥 𝑖 ] > then 𝐶 ( 𝑥 𝑖 , 𝑠 𝑖 ) ; else incompressible.push( 𝑥 𝑖 ); 𝐶 ( 𝑥 𝑖 , 𝑟 ) ; endAlgorithm 1: Encoder withincompressible symbols 𝑠 𝑖 ← 𝐷 ( 𝑥 𝑖 ) ; if 𝑠 𝑖 == 𝑟 thenreturn incompressible.pop(); elsereturn 𝑠 𝑖 ; endAlgorithm 2: Decoder withincompressible symbols

5. Status of the Implementation and Outlook

The entropy compression scheme for ALICE Run 3 consists of two components, a generalpurpose, conﬁgurable rANS entropy coding library and an ALICE speciﬁc component performingcompression of the SoAs and ﬁnal CTF creation inside ALICE O using the rANS library. At thetime of writing a base implementation for both components exists and most of the sub-detectors areintegrated. Preliminary measurements based on simulated detector data show excellent compressionof SoAs by the entropy coder, within per mills to the information-theory limit of entropy 𝐻 [15] whilekeeping the ﬁrm real-time requirements. For the production code further performance improvementscan be achieved with a better use of pipeling, SIMD vectorization and multithreading. Optimizationsin the ROOT based CTF data format can additionally decrease overhead introduced by metadata.

6. Conclusion

The new, purpose build compression scheme presented in this contribution allows the ALICEO framework to reduce the amount of data sent to storage eﬀectively. Combining the ﬂexibilityof the O DPL with a custom implementation of a rANS entropy coder that leverages the structureof the data allows fast and quasi-optimal compression while operating within the ﬁrm real-timebounds required by the online processing for ALICE in LHC Run 3.5 ast Entropy Coding for ALICE Run 3

Michael Lettrich

References [1] K. Aamodt et al. “The ALICE experiment at the CERN LHC”. In:

JINST .[2] L. Evans and P. Bryant. “LHC Machine”. In:

JINST .[3] B. Abelev et al. “Upgrade of the ALICE Experiment: Letter Of Intent”. In:

J. Phys.

G41(2014), p. 087001. doi: .[4] P. Buncic, M. Krzewicki, and P. Vande Vyvre.

Technical Design Report for the Upgrade ofthe Online-Oﬄine Computing System . Tech. rep. CERN-LHCC-2015-006. ALICE-TDR-019.Apr. 2015. url: https://cds.cern.ch/record/2011297 .[5] L. P. Deutsch.

DEFLATE Compressed Data Format Speciﬁcation version 1.3 . RFC 1951.May 1996. doi: .[6] Y. Collet and M. Kucherawy.

Zstandard Compression and the application/zstd Media Type .RFC 8478. Oct. 2018. doi: .[7] F. Galligan.

Draco - 3D data compression . 2017. url: https://google.github.io/draco/spec/ (visited on 10/25/2020).[8] J. Duda and G. Korcyl.

Designing dedicated data compression for physics experiments withinFPGA already used for data acquisition . 2015. arXiv: .[9] J. Berger et al. “TPC data compression”. In:

Nuclear Instruments and Methods in PhysicsResearch Section A: Accelerators, Spectrometers, Detectors and Associated Equipment .[10] D. Salomon, D. Bryant, and G. Motta.

Handbook of Data Compression . Springer London,2010. doi: .[11] D. A. Huﬀman. “A Method for the Construction of Minimum-Redundancy Codes”. In:

Proceedings of the IRE .[12] J. Duda.

Asymmetric numeral systems . 2009. arXiv: .[13] J. Duda.

Asymmetric numeral systems: entropy coding combining speed of Huﬀman codingwith compression rate of arithmetic coding . 2013. arXiv: .[14] M. Lettrich. “Fast and Eﬃcient Entropy Compression of ALICE Data using ANS Coding”.In:

EPJ Web Conf.

245 (2020), p. 01001. doi: .[15] C. E. Shannon. “A mathematical theory of communication”. In:

The Bell System TechnicalJournal .[16] G. Eulisse et al. “Evolution of the ALICE software framework for Run 3”. In:

EPJ Web Conf.

214 (2019). Ed. by A. Forti et al., p. 05010. doi: .[17] F. Giesen.

Interleaved entropy coders . 2014. arXiv:1402.3392 [cs.IT]