Real Time Global Tests of the ALICE High Level Trigger Data Transport Framework
B. Becker, S. Chattopadhyay, C. Cicalo J. Cleymans, G. de Vaux, R.W. Fearick, V. Lindenstruth, M. Richter, D. Rorich, F. Staley, T.M. Steinbeck, A. Szostak, H. Tilsner, R. Weis, Z.Z. Vilakazi
11 Real Time Global Tests of the ALICE High LevelTrigger Data Transport Framework
B. Becker, S. Chattopadhyay, C. Cical`o J. Cleymans, G. de Vaux, R .W. Fearick, V. Lindenstruth, M. Richter,D. R¨orich, F. Staley, T.M. Steinbeck, A. Szostak, H. Tilsner, R. Weis and Z.Z. Vilakazi.
Abstract —The High Level Trigger (HLT) system of the ALICEexperiment is an online event filter and trigger system designedfor input bandwidths of up to 25 GB/s at event rates of upto 1 kHz. The system is designed as a scalable PC cluster,implementing several hundred nodes. The transport of data inthe system is handled by an object-oriented data flow frameworkoperating on the basis of the publisher-subscriber principle, beingdesigned fully pipelined with lowest processing overhead andcommunication latency in the cluster. In this paper, we reportthe latest measurements where this framework has been operatedon five different sites over a global north-south link extendingmore than 10,000 km, processing a “real-time” data flow.
Index Terms —Trigger, Event Filter, Online Processing,Publisher-Subscriber, distributed computing
I. I
NTRODUCTION
For the future Large Hadron Collider (LHC) [1], [2] atCERN, four new experiments ALICE [3], ATLAS [4], CMS[5] and LHCb [6] are currently being built. They have theprincipal tasks of searching for the Higgs-Boson and physicsbeyond the standard model (ATLAS and CMS), CP violation(LHCb), and heavy-ion physics with a particular emphasis onquark-gluon plasma research [7]. Each of these experimentswill produce large data sets that have to be handled by theirtrigger and data acquisition systems. In all four experimentslarge PC farms will handle a significant part of the generateddata flow by performing event trigger, filter, and selectiontasks. For the ALICE experiment these tasks are the primaryfunction of the High Level Trigger (HLT) system [8], whichis designed to implement up to 1000 multiprocessor 19”commercial PC-type computers. All three functions of the HLTrequire reconstruction and pattern recognition to be performedonline, analysing data streams of 25 GB/s under real-timeconditions.
B. Becker and C. Cical`o are with the I.N.F.N, Sezione di Cagliari, CittadellaUniversitaria di Monseratto, Casella Postale 170 09042 Monseratto (Cagliari),Italy.S. Chattopadhyay is with the Saha Institute, 1/AF Bidhannagar, Kolkata700 064, IndiaJ. Cleymans, G. de Vaux, R. W. Fearick, A. Szostak and Z. Z. Vilakaziare with the Department of Physics, University of Cape Town, Private BagRondebosch, 7000, South AfricaV. Lindenstruth, T.M. Steinbeck, H. Tilsner and R. Weis are with the Kir-choff Institute, Ruprecht-Karls University Heidelberg, D-69120 Heidelberg,GermanyM. Richter and D. R¨orich are with the Department of Physics, Universityof Bergen, Allegaten 55, N-5007 Bergen, NorwayF. Staley is with the DAPNIA/SPhN CEA Saclay, Fr-91191 Gif sur YvetteCedex, FranceThis work is supported by the National Research Foundation (NRF2357),South Africa; INFN-Cagliari, Italy; Commissariat `a l’Energie Atomique,Saclay, France; BMBF (06HD157I), Germany.
For the transport of the data inside the HLT system asoftware framework has been developed based upon thepublisher-subscriber principle [9], also known as the producer-consumer paradigm. The publisher-subscriber design is par-ticularly suited to distributed data-driven systems where flex-ibility is required by, for example, the need to specify aconfiguration only at run time. The implementation of thisdesign in the ALICE HLT framework has already been used ina large number of tests and benchmarks as well as in beam testscenarios of detector components. Through its use of simplecomponents, plugged together via well-defined interfaces, itprovides a flexible and easily reconfigurable capability for thecontrolling the data flow. For efficiency reasons data copyingis kept to a minimum inside a node. This is achieved byplacing the data into shared memories and exchanging onlydescriptors to that data via named pipes. Dedicated softwareexist to connect components on different nodes. After the finalselection and compression of the data inside the HLT it isrecorded to permanent storage for later collaborative off-linedetailed analysis. The event selection by the HLT serves toreduce the overall rate of data written to permanent storageOne characteristic feature of each of the four mentionedexperiments is that they consist of large globally distributedcollaborations, with several tens to more than 150 institutesand more than 1000 scientists taking part. The HLT providesan real-time filter for the data acquisition system by selectingwhich events to write to permanent storage, thereby achievinga background suppression of several factors, depending onthe particular signal and background processes (see [10]for a discussion of the data suppression capabilities of theHLT). Nonetheless, the set of data acquired during a year-long run of the experiment will exceed 1 PB. For the taskof analysing these large data sets which were preselectedby the on-line HLT system, a global distributed effort isnecessary, not least because more than 10 000 processors arerequired to accomplish this task. Each event has a size of12.5 MB independent from the next, implying that parallelcomputing can be exploited to a very large degree in theexperiment’s computing model (see reference [11]). Thereforeoff-line processing is planned to be performed on a computinggrid - mostly at sites participating in the LHC ComputingGrid, LCG [12] - exploiting the independence of the individualevents.The on-line processing systems, having real-time charac-teristics, are typically run on the computer farm close to theexperiment. The algorithms implemented in the ALICE HLTtypically have the characteristic that they operate on sub-event level data, exploiting to the largest degree possible parallelismin the data to increase performance. The real-time, data rateand most importantly the reliability constraints of the on-linesystems make it prohibitive to currently operate them as gridapplication. In particular it is desirable for such a critical partof the experimental apparatus as the trigger to be as close aspossible to the autonomous experimental hardware, in orderto ensure the highest possible availability.However, if the on-line communication architecture is de-signed carefully, avoiding round trip latencies in the commu-nication and flow control paths it may be possible to performeven on-line trigger and filter functionality in a widely dis-tributed fashion, such as a grid-like system. Besides the triggerprocessing, it is also possible that the same data transportframework could be used to perform similar, but less criticalprocessing tasks, such as off-line processing. At this point thepresented data driven, distributed grid processing infrastructureis being studied amongst others with respect to its applicationto the distributed radio astronomy system LOFAR, wheremultiple, distributed sites produce data streams, which have tobe pre-processed individually and then combined with othersites. This process bears a lot of similarities to the ALICE HLTarchitecture. The following article details the first successfultest of the ALICE HLT on-line trigger system over distancesexceeding 10000 km in a proof-of-principle test. We beginwith an overview of the software design, giving a descriptiono the framework, its components and how configuration ismanaged. The actual test configuration and results are givenin section III, and we give our conclusions and a summary insection IV. II. O
VERVIEW OF THE S OFTWARE
A. Data Transport Framework Design
The system architecture is based on the publisher-subscriberparadigm because this allows a dynamically determined num-ber of data consumers to connect to the appropriate producer.This design allows dynamic reconfiguration and inherent faulttolerance as detailed in the following sections. Another addedconsequence of this design choice is encapsulation of theactual processing code, into individual modules, connected bythe publish-subscriber interface. In the design of the particularpublisher/subscriber interface, particular emphasis was givento three important points: efficiency, flexibility, and faulttolerance. For a full description of this functionality of theframework, we refer the reader to [13].Efficiency is important for the communication frameworkbecause the analysis of the event data is very compute intense.This is achieved by placing the data into shared memorysegments by its publishing object and by passing descriptorsto the subscribers via named pipes. When all subscribers havesubsequently informed the publisher that they have finishedprocessing a given event, it is released and the shared memoryregion can be re-used.Flexibility has to be present in the framework as the config-uration has to adopt to dynamically changing requirements ofthe experiment and the analysis. The primary mechanism forproviding flexibility is the separation of the framework into components which can be dynamically connected in differentconfigurations. Using the data flow components defined below,any processing hierarchy can be constructed. As the publisher-subscriber supports dynamic connections and disconnectionsat runtime, the system configuration can be adapted while itis active.Fault tolerance is also achieved, using the dynamic reconfig-uration capability of the communication building blocks. Forinstance, it allows for the replacement of failed componentsduring runtime and also for the addition and/or removal ofcomponents in the data stream as required in reaction todynamic events occurring in the system. This feature is inparticular important in a distributed grid-like environment,where public networks are being used and many operationalconditions are out of the control of the operators. A secondmajor fault tolerance building block is related to the bridgecomponents connecting different nodes as described in moredetail below. These components also have the ability toestablish connections dynamically at runtime, not only forre-establishing existing connections but also for establishingnew connections between nodes. This mechanism allows theisolation of faulty nodes in the system and to replace them withstand-by nodes. In essence the HLT fault tolerance architectureuses both a bottom-up and a top down approach at certainlevels. The bottom-up aspect ensures that all modules of thesystem are somewhat independent and capable of dynamicreconfiguration. The top-down aspects of the framework’sfunctionality implement the intelligence to discover and reactto any issues in the system, in a semi-automated way, andissue the appropriate commands to the fundamental HLTcommunication framework. The actual analysis code is notaffected and completely independent.
B. Framework Components
The HLT communication framework components can becategorized into three groups:1)
Dataflow components are utility programs designed toshape the dataflow in a framework system.2)
Application components perform the actual on-linedata processing and encapsulate the analysis code be-tween a subscriber and publisher object. There is alsoa number of maintenance application programs, suchas data integrity and performance checkers, dummyroutines, etc.3)
Application component templates provide a base fromwhich application components may be constructed for aspecific system.Application components templates and application compo-nents exist in three variations: • Data input source components are the points where data isinserted into a dataflow chain constructed with the frame-work. They access entities external to the framework andmake their data available to other framework components. • Data processing components perform the work inside aframework system. They accept data from other compo-nents, process it to produce new output data, and make this new data available again to other framework com-ponents. By chaining multiple processing componentstogether complex analysis processes can be performed. • Data sink components act as output of a dataflow chain.They accept data from other framework components andcan transmit this data to entities outside of the framework.There are 5 primary dataflow components contained inthe framework which influence the dataflow with distinctcharacteristics: • EventScatterer components accept a single stream ofevents as input and fans it out into multiple event streamsat the output. Each event is left as is, therefore each outputstream only consists a subset of events, corresponding tothe fan-out level. • EventGatherer components are the inverse componentsto EventScatterers. They fan-in multiple input eventstreams and forward each received event unchangedinto to their single output event stream. The EventScat-terer/EventGaterer pair is used for load balancing byfanning a data stream out to as many processing streamsare required in order to maintain the required eventprocessing rate. • EventMerger components also have multiple inputstreams and a single output stream. Unlike gatherersthey expect one specific part of an event to arrive onevery input stream. The descriptors for the input datablocks of these received sub-events are then merged intoa combined event with a single event descriptor, whichis sent subsequently to the output stream. EventMergersalso implement fault tolerance functionality in order toavoid the data flow from blocking in case of one sub-event being lost or delayed due to the reconfiguration ofpart of the system. The EventMerger maintains lists ofincomplete events, which are set aside, while continuingthe processing of other complete events. • Subscriber- and
PublisherBridgeHead components acttogether in pairs to create a transparent bridge betweencomponents on different nodes. The purpose of these so-called
BridgeHead components is to provide a commoninterface to publishers and subscribers for processingcomponents, without having explicitly to deal with net-working code.In the Subscriber- and PublisherBridgeHead componentsnetwork, communication is handled by an abstract class librarythat provides all required interfaces for the communication.Implementations of these interfaces currently exist for theTCP/IP network protocol and the SCSI API for Dolphin Scal-able Coherent Interconnect (SCI) adapters, thereby supportingboth the socket based streaming data and the Remote DirectMemory Access (RDMA) paradigm. The bridge componentsare thus independent of the underlying network functionality,so that only an implementation of the abstract API has to beprovided in order to support a new network. SCSI : Small Computer Systems Interface Scalable Coherent Interconnect is an IEEE standard that defines thearchitecture and protocols to support shared-address-space computing overa collection of processors.
C. The TaskManager
One of the important challenges in the ALICE HLT isthe management of the large number of processes distributedin the cluster. It has to be ensured that all processes arestarted and connected in the correct order. For this purpose theTaskManager [13] has been developed to control and supervisethe HLT framework.The design of the framework supports hierarchical opera-tion with multiple levels of TaskManagers, each controllingsubordinate TaskManagers, possibly running at different sites.At the lowest level the slave TaskManagers actually controlthe framework components. This hierarchical configurationscheme provides several advantages over a flat hierarchy witha single process controlling all components. The TaskManagerallows to split up the system into a hierarchy of separate parts,which are easier to handle than a single large configuration.Faults occurring in the system are thus isolated where theyoccur ensuring that they do not influence the system as awhole. The hierarchical approach also facilitates building afault-tolerant system, in particular by avoiding single-points-of-failure in the control infrastructure.Configurations for the TaskManager are stored in XML files.Amongst other items multiple sections of Python code arecontained in configuration files. They are executed when arespective event in the TaskManager occurs, e.g. a state changein one of the supervised components. The code thus specifiesthe actions that are to be taken upon such an event, makingthe system very flexible. The communication with controlledsub-components is handled via a shared library, specified inthe configuration files and loaded at runtime.III. T HE G LOBAL T EST
A. Global Test Motivation
The ALICE HLT data transport framework has been de-signed with quite a specific environment in mind, that of low-latency clustered computing centres, with dedicated bandwidthand processing hardware available locally. The functionality ofthe framework has indeed been tested in this environment andhas been shown to satisfy the constraints imposed on it by thetriggering and data taking scenarios of ALICE.Real-time on-line systems, such as the HLT, usually havehigh input data rates and stringent processing requirements.The data rates typically exceed the available wide area networkbandwidths by up to two orders of magnitude. A furtherconstraint of all real-time systems is the available maximumprocessing latency per event. However, due to its architecture,the ALICE High Level Trigger does not have any fixed latencyrequirements. Since the HLT is built by a collaboration withmembers in Bergen (Norway), Cape Town (South Africa), andHeidelberg (Germany), a globally distributed test of an HLT-like system was considered.This test served many purposes. First it required a veryhigh degree of flexibility in the framework in order to allowits operation on various different sites with many differentinstallations. It had to work across various firewalls. Thecontrol of this system, while running on many independentsites behind firewalls was very complex. On the other hand any unknown cyclic dependency would show as blockingduring the global test. Another feature of this setup is thedemonstration of a radically different way of distributed pro-cessing, being purely based on global, re-routable data streams.The resources at the various centres were under the directcontrol of the collaboration, and were centrally controlled.The configuration of the test is described below, and wasdone in Heidelberg, working remotely from Cape Town. Thisvery strict control over the computing resources is somewhata deviation from the traditional federated grid of computingresources idea and does not present a principle requirementbut was rather done in order to simplify the test itself.Limitations and problems of this approach should be exam-ined in further work if a working system could be obtained. Asa first step in this investigation, the test under discussion heredid not implement grid authentication or resource managementmechanisms. However, these features could also been added inthe future, using for example Virtual Private Networks (VPN),or standard lightweight grid middleware. In order to create afull global north-south axis across the globe as well as someeast-west expansion two further sites in Tromsø(Norway) andDubna (Russia) have been included in the setup, in additionto the listed HLT collaboration institutes.
B. Global Test Configuration
For the global test a configuration was chosen that mim-ics a part of the ALICE HLT processing. The setup wasconfigured similar to the ALICE’s Time Projection Chamber(TPC) [14] and DiMuon Spectrometer [15] process (see figure2). In order to avoid the bottleneck posed by the availablenetwork bandwidth to Cape Town, as detailed below, no actualsimulated detector data was transmitted. Instead only mock-up data objects with the same size as that expected duringexperimental running were sent between the different locationsand consequently no real analysis components were used.Therefore mock-up components not performing any processingwere running in the setup.At three of the sites (Bergen, Tromsø, and Dubna) thecomponents were set up to mimic the cluster-finding on dataof the TPC sectors. As an example, the setup at one of thesites is shown in Fig. 1, and each patch was processed on aseparate node.The output data produced at these three sites was then sentto Heidelberg, where it was merged. Further mock-up process-ing steps, which correspond to the tracking in the simulatedTPC sector, were performed on two twin processor nodes. Oneach CPU one mock-up process was active, as shown in Fig. 2.The output produced by these four components was then sentfurther to Cape Town.In Cape Town the TPC data stream was merged withthe DiMuon data stream. This was generated by anotherprocessing chain, fully running in Cape Town as well, whichsimulated processing raw data from the DiMuon detector [15]from the cluster-finding up to the tracking. The tracked mock-up data was then merged with the received TPC data as thelast step in the processing chain. During LHC data taking thiscomponent will be the location where the trigger decision is made and/or where the completely reconstructed event datacould be written to permanent storage.The data flow on the different sites is shown in Fig. 3. Ascan be seen, one characteristic of this test are the relay nodesrequired by the data flow in order to traverse the firewalls toreach the actual processing nodes. For instance in Heidelbergthe cluster access node was not directly accessible fromoutside the institute so that another relay step was necessary.One advantage of this setup is that inside a cluster all nodesare treated equally and are trusted. Therefore inside the clusterthere is only little security functionality required. On theseaccess nodes privileged relay components have been running,which do not touch their input data but only forward it to theiroutput.Control of the system was provided by a three-level setof TaskManager processes with the top master TaskManagerrunning on the fire wall in Heidelberg. Second-level TaskMan-agers were setup on the access nodes, communicating bothwith the top master TaskManager and with the third-levelTaskManagers on the cluster nodes themselves, therefore per-forming bridging functionality. No second-level TaskManagerwas required on nodes in Dubna, being directly accessible. Oneach node the local third-level slave TaskManager was used tocontrol the HLT framework processes and ensure their properoperation. The TaskManager hierarchy is shown in Fig. 4.A second test has been run with a very similar setup, thedifference being that instead of mock-up data and mock-upprocessing components simulated raw data for the ALICE TPCdetector and real analysis components for that data have beenused. The analysis components used here will also run in theoperational HLT. In this set-up the cluster in Cape Town wasexcluded, as the size of the simulated event data is significantlylarger than the mock-up data. Transferring this amount of dataover the comparatively slow link to Cape Town was thereforenot considered, so that only the clusters in Tromsø, Bergen,Dubna, and Heidelberg were used.
C. Global Test Results
All tests discussed below demonstrate the successful opera-tion of the HLT on-line framework in a grid-like environment.For the first setup an initial test was started with the eventrate limited explicitly to 10 Hz, in order not to over-stressthe network link to Cape Town. This test had been runningovernight unattended for more than 15 hours. During thistime more than 500,000 events have been passed through theprocessing chain before the test was stopped by the operators.In a second test with the same configuration the limit to10 Hz was deactivated and the chain was allowed to run atthe maximum achievable rate. During the test’s runtime ofabout two hours a maximum rate of 15 Hz was reached.Note that the speed of light in vacuum corresponds to 40ms over the given distance of 12000 km. One round tripdelay results in a theoretical 12 Hz limit. This test thereforedemonstrates that there is no global flow control our roundtrip delay anywhere in the system. All communications areperformed pipelined and point to point. The number of events,being processed simultaneously and the corresponding latency
Fig. 1. Setup of the cluster-finding processing steps at the Bergen site.Fig. 2. Setup of the tracking processing steps at the Heidelberg site. and memory requirements in the system scales with the systemsize. This system effectively implements a more than 10000km long physical pipeline, operating at the limits of the slownetwork link to Africa. A third test was then run using thesecond configuration, excluding the Cape Town site but usingrealistic simulated ALICE TPC detector data and real analysiscomponents on the four remaining sites. This configurationhas been running at 3.3 Hz for about 100 min., thereforeprocessing approximately 20000 events. In order to start theoverall system, all tasks on all nodes had to be started ina co-ordinated way. Any error in this process would requireto restart the entire procedure. The built-in fault recoverymechanisms, however, allowed to remedy many errors locally, without completely restarting the chain. Without this featurethe tests would not have been successful during the short timeavailable for the tests. In the long run the repair mechanisms -restarting processes and reconnecting them to the framework -will be done automatically by appropriate intelligent daemons,currently being developed.In all test cases the bottleneck was clearly network band-width between the sites, for the first two tests in particularthe one to Cape Town. This was mainly due to the fact thatthe test had been run on normal working days using the sites’normal internet connections as links. The links were thereforebusy with the site’s basic traffic, leaving only a limited amountof bandwidth available for the test. Using dedicated links the
Fig. 3. The global setup with all involved sites and nodes. Components on each node are not shown. achievable rates should be higher, correspondingly with theincrease in bandwidth.The ALICE HLT is targeted to run at 200 Hz for TPC eventsand up to 1 kHz for DiMuon only events. The required rateshere are about a factor 30 too low. However, the goal wasnot to prove the feasibility to run the HLT application as gridapplication under current conditions, as this would require a25 GB/sec bandwidth into public networks, which would beprohibitively expensive. The goal was to demonstrate that sucha system is possible in the first place and secondly is working.It can be adapted to any application with a somewhat lowernetworking requirement.IV. S
UMMARY AND C ONCLUSIONS
As an overall result it can be stated that the approach ofusing the ALICE High Level Trigger data transport frameworkfor a grid-like system was successful. The framework, whichwas designed for use in a single cluster configuration, has functioned as desired in a globally distributed system. Perfor-mance of the system was restricted only by the limited amountof bandwidth available for the tests on the normal Internet con-nections of the involved sites. Therefore, the use of distributedgrid-like online systems has been shown to be feasible inprinciple. The fundamental requirements are primarily definedby the bandwidth requirements of the system involved. Thelatencies incurred due the large distances do not matter in thisapplication and only require some additional on-line storagespace in order to maintain the deeper pipelines. The HLTon-line system will implement queues, supporting thousandsof events or several seconds of running time. Therefore thelatency incurred even in a system of the presented scale willnot make any difference.We have demonstrated the successful operation of a datadriven, distributed processing paradigm, operating on globalscale. There are no particular latency constraints. The trans-action rate of the system depends only available networkbandwidth and data sizes, while the depth of the pipeline
Fig. 4. The hierarchy of the TaskManager control processes. Each L3 TaskManager slave is controlled by the closest L2 intermediate TaskManager or inthe case of directly accessible nodes by the L1 master TaskManager. is irrelevant. Therefore the ALICE HLT software frameworkcan in principle be operated over arbitrarily large distances,given a TCP/IP connection. However, making a complex,distributed system operate at all at the presented scale is anachievement, documenting how powerful the communicationframework is. Depending on the data rate requirements andavailable networks, which could be dedicated, this softwareinfrastructure is suitable for any data driven, distributed pro-cessing, in particular for applications where the data is beingacquired at various, distributed sites, therefore lending itselfto a data driven, distributed processing system.R
EFERENCES[1] T. S. Pettersson and P. Lef`evre, “The large hadron collider conceptualdesign,” Tech. Rep.[2] R. Schmidt, “Status of the lhc at cern,” prepared for 39th ICFA AdvancedBeam Dynamics Workshop on High Intensity High Bightness HadronBeams 2006 (HB2006), Tsukuba, Japan, 29 May - 2 Jun 2006.[3] B. Alessandro et al. , “ALICE: Physics performance report, volume II,”
J. Phys. , vol. G32, pp. 1295–2040, 2006. [4] ATLAS Collaboration, “ATLAS : Detector and physics performancetechnical design report. volume 1,” CERN-LHCC-99-14.[5] N. Neumeister, “The CMS experiment at the LHC: Status and physicspotential,”
Czech. J. Phys. , vol. 50S1, pp. 59–68, 2000.[6] T. Nakada, “The LHCb experiment,”
Nucl. Phys. , vol. A675, pp. 285c–290c, 2000.[7] F. Carminati et al. , “ALICE: Physics performance report, volume I,”
J.Phys. , vol. G30, pp. 1517–1763, 2004.[8] T. Alt et al. , “The ALICE high level trigger,”
J. Phys. , vol. G30, pp.S1097–S1100, 2004.[9] T. M. Steinbeck, V. Lindenstruth, and M. W. Schulz, “An object-orientednetwork-transparent data transportation framework,”
IEEE Trans. Nucl.Sci. , vol. 49, pp. 455–459, 2002.[10] A. S. Vestbo, “Pattern recognition and data compression for the ALICEhigh level trigger,” 2004.[11] F. Carminati, C. W. Fabjan, L. Riccati, and H. de Groot,
ALICE Com-puting Technical Design Report , ser. Technical Design Report ALICE.Geneva: CERN, 2005, submitted on 15 Jun 2005.[12] M. Lamanna, “The LHC computing grid project at CERN,”
Nucl.Instrum. Meth. , vol. A534, pp. 1–6, 2004.[13] T. M. Steinbeck, “A modular and fault-tolerant data transportframework,” ph.D. Thesis, Ruprecht-Karls-University Heidelberg.[Online]. Available: ar χ iv:cs/0404014 [14] C. Garabatos, “The ALICE TPC,” Nucl. Instrum. Meth. , vol. A535, pp.197–200, 2004.[15] A. Baldisseri, “The ALICE dimuon spectrometer,”