The H.E.S.S. central data acquisition system
A. Balzer, M. Füßling, M. Gajdus, D. Göring, A. Lopatin, M. de Naurois, S. Schlenker, U. Schwanke, C. Stegmann
TThe H.E.S.S. central data acquisition system
A. Balzer a,b , M. F¨ußling b , M. Gajdus c , D. G¨oring d , A. Lopatin b , M. de Naurois e , S. Schlenker f , U. Schwanke c ,C. Stegmann b,a a DESY, D-15738 Zeuthen, Germany b Institut f¨ur Physik und Astronomie, Universit¨at Potsdam, Karl-Liebknecht-Strasse 24/25, D-14476 Potsdam, Germany c Institut f¨ur Physik, Humboldt-Universit¨at zu Berlin, Newtonstr. 15, D-12489 Berlin, Germany d Physikalisches Institut, Universit¨at Erlangen-N¨urnberg, Erwin-Rommel-Str. 1, D-91058 Erlangen, Germany e Laboratoire Leprince-Ringuet, Ecole Polytechnique, CNRS/IN2P3, F-91128 Palaiseau, France f CERN, CH-1211 Geneva 23, Switzerland
Abstract
The High Energy Stereoscopic System (H.E.S.S.) is a system of Imaging Atmospheric Cherenkov Telescopes (IACTs)located in the Khomas Highland in Namibia. It measures cosmic gamma rays of very high energies (VHE; >
100 GeV)using the Earth’s atmosphere as a calorimeter. The H.E.S.S. Array entered Phase II in September 2012 with theinauguration of a fifth telescope that is larger and more complex than the other four. This paper will give an overview ofthe current H.E.S.S. central data acquisition (DAQ) system with particular emphasis on the upgrades made to integratethe fifth telescope into the array. At first, the various requirements for the central DAQ are discussed then the generaldesign principles employed to fulfil these requirements are described. Finally, the performance, stability and reliability ofthe H.E.S.S. central DAQ are presented. One of the major accomplishments is that less than 0 . Keywords:
DAQ, Data acquisition, VHE, Gamma ray astronomy, H.E.S.S.
Contents1 Requirements 2
Email addresses: [email protected] (A. Balzer), [email protected] (A. Lopatin)
The High Energy Stereoscopic System (H.E.S.S. ) is anarray of Imaging Air Cherenkov Telescopes (IACTs) locatedin the Khomas Highland of Namibia. It is dedicated tothe observation of very-high-energy (VHE) gamma rays [1].IACTs detect Cherenkov light emitted by charged particlesin cosmic-ray-initiated air-showers. The measured lightdistributions are then used to reconstruct the propertiesof the primary particle. More details about the IACT Preprint submitted to Elsevier September 5, 2018 a r X i v : . [ a s t r o - ph . I M ] D ec echnique, and the state of gamma-ray astronomy in theTeV regime can be found in [2].Currently, the H.E.S.S. experiment has entered so called“Phase II” with the inauguration of a fifth telescope [3, 4] onSeptember 28 th , 2012. This upgrade of the H.E.S.S. Arraywill lower the energy threshold of the experiment fromabout 100 GeV [1] to about 30 GeV [5, 6] and improve theoverall sensitivity by a factor of ∼
1. Requirements
The requirements of a DAQ system for an experimentlike H.E.S.S. can be divided into three groups. First, thereare design goals that need to be met in order to achieveoptimal scientific output. Moreover, there are technicalrequirements that need to be fulfilled by the DAQ. Thelast but not least group of requirements is about ensuringthat the system is as user-friendly as possible.
The expected flux of VHE gamma rays observable byground-based IACTs like H.E.S.S. is very low comparedto the expected background rate. The IACT arrays usestereoscopic observation of air showers to help reduce thehigh background rate of hadron-induced air showers aswell as to improve the direction reconstruction of photon-induced air showers [1]. Therefore, the dead time of thewhole array must not be increased by the DAQ. As aresult, the number of telescopes that detect an air showerand are able to record the event is not to be reduced bythe DAQ. Moreover, variations in the trigger rate suchas fluctuations induced by night sky background or bytransient phenomena (i.e. bursts) should not cause deadtime due to DAQ related processing back-pressure.The maximum hardware event readout rate is about900 Hz with an event size of 4 . . MB / s for the primaryscientific data during routine operation. To account forother hardware devices on-site as well as some additionalcapacity for tests, the DAQ system is required to processat least 80 MB / s , thereby having sufficient throughput fornetwork connections and storage facilities. The same dataformat, as decided by the H.E.S.S. Collaboration, is usedfor storage on-site as for high level tasks like event recon-struction and data analysis. This approach allows highlevel analysis algorithms and visualisation tools to be usedon a DAQ level. Therefore, the raw data that are sentfrom the Cherenkov detectors have to be converted into acommon data format at a very early stage, which requiresadditional computing power. Moreover, the common dataformat allows data that are identical to those recorded bythe DAQ to be easily simulated.On top of the mere acquisition of data, their integrityand quality needs to be verified in real time. Furthermore,the data should be checked and analysed by a preliminaryanalysis during observation. Due to the common dataformat it is possible to use the standard offline analysissoftware for this purpose. The array needs to be able tooperate in different observation modes with different setsof telescopes. Several independent operation modes withdisjoint sets of telescopes, so called SubArrays , should bepossible simultaneously, for example, observation runs aswell as calibration and maintenance runs.The DAQ system needs to be able to respond to targetof opportunity (ToO) alerts, like alerts from the Gamma-Ray Burst Coordinates Network , in real time. In order toallow for prompt follow-up observations, the DAQ shouldrespond to such alerts as fast as possible (including aslewing of the telescopes to a new observation target) andshould thus be highly automated. On top of that, the DAQmust be able to handle continuous data streams of otherhardware components (for example monitoring data fromweather stations) as well as to merge these streams withthe data taken during observation.The H.E.S.S. Array only takes data during the moonlesspart of a night when the weather conditions are good,i.e. no clouds are above the detector. To maximise thephysics output of these time periods the scheduling of theobservation targets should be automated and configurableremotely (i.e. from Europe where most host institutes are). To facilitate the communication between the differentdevices, subsystems and computers, a common network in-frastructure has to be established. This network should bebased on robust, proven and commonly available technolo-gies. This, in general, also holds for the hardware of the The Gamma-Ray Burst Coordinates Network (GCN) distributesinformation about the location of a Gamma-Ray Burst detected byvarious spacecraft http://gcn.gsfc.nasa.gov/
Controller ). More complexhardware devices may correspond to multiple
Controllers .Furthermore, the configuration of the different pieces ofhardware, as well as the DAQ itself, should be as flexible aspossible. New hardware should be usable without the needto change major parts of the code (e.g. data transport anddevice monitoring). As a side effect of a flexible configura-tion, the array is more tolerant to missing or malfunctioninghardware.To prevent damage to the telescopes, the Cherenkovcameras and other critical systems, error handling needs tobe redundant and decentralised. Therefore, each subsystem(e.g. the camera or the drive system) is responsible for itsown safety and fatal problems have to be handled directlyin firmware. Only after immediate danger is averted, the de-vices inform the DAQ system of the error condition, whichin turn takes appropriate actions, such as propagating theerror to other subsystems or bringing the corresponding
SubArray to a safe state. In order to avoid hazardousconditions for the hardware, slow control information fromall the subsystems (e.g. device temperatures, positions orvoltages) has to be monitored by the DAQ.The H.E.S.S. Array is situated in a remote locationwhere no cheap, reliable and fast internet connection isavailable. Therefore, the data cannot be streamed to thedifferent institutes of the H.E.S.S. Collaboration. Instead,the data have to be shipped to the institutes in regularintervals via magnetic tapes. The DAQ provides the re-sources to store the data on-site for up to three monthsuntil the shipped data has been verified in Europe.In general, the system should be as self-sufficient aspossible. There must be no dependencies of DAQ compo-nents on remote network connections whatsoever, meaningdata taking must not be interrupted by a failing internetconnection. Furthermore, all relevant documentation andmanuals need to be available on-site.
The H.E.S.S. DAQ should facilitate integration of newhardware and allow a quick response to hardware mal- functions without interfering with the rest of the array.The software should, therefore, be modular and supportdifferent hardware configurations.The H.E.S.S. Array is operated by collaboration mem-bers (PhD students,. . . ) during ”Shifts” of one moonperiod. As a result, the array is operated by monthlychanging non-expert personnel on-site (Shifters) with re-mote support by a team of subsystem experts working atthe respective member institutes as well as a local shiftexpert responsible for the training of the Shifters at thebeginning of each Shift. As a security precaution the ac-cess of Shifters to critical parts of the array has to belimited (for example configuration databases or low-leveltelescope movement), while subsystem experts must beable to perform remote maintenance. Moreover, the DAQsoftware should use stable releases that may not be alteredby Shifters for observations, but should still allow subsys-tem experts on-site to develop and debug their softwaremodules.As subsystem experts are working at their home in-stitutes most of the time and not on the H.E.S.S. site, aremote access to the various subsystems of the array mustbe available. Due to limitations in internet connectivity,these remote connections need to be simple. To furtherease the maintenance of the array, a dedicated logging sys-tem with detailed information for Shifters and subsystemexperts has to be part of the DAQ software. For testingpurposes, and in case of malfunctioning hardware, the abil-ity to manually override automated procedures must begiven at all times.Due to the non-expert personnel operating the arrayduring a shift, manuals and documentation have to beavailable on-site at all times. This also includes detailedinstructions about safety regulations and error recovery.The DAQ should also be operable in a semi-automaticway by a single Shifter (for security reasons at least twoShifters have to be present at all time). Moreover, theShifters must be able to get fast feedback about the currentstatus of the array during data taking and especially in thecase of system errors. Easy to understand error reportsand clearly described error recovery procedures are key tohigh-efficiency data taking. By design, the Shifters havethe final say in deciding whether data taking continues ornot after an error has occurred. This also holds true inH.E.S.S. for weather and atmospheric conditions: there isno automated response from the DAQ and the Shifters’decision is necessary.
2. Central DAQ hardware
The primary data flow starts in the Cherenkov camerasof the five telescopes. If an air shower event is seen bya telescope, i.e. the camera trigger decided to launch theread out of an event, the camera sends a trigger signal viaoptical fiber to the Central Trigger and begins to digitisethe recorded image. The Central Trigger checks for co-incidences in at least two telescopes, if no coincidence is3 igure 1: Scheme of the net-work layout on the H.E.S.S.site in Namibia. In the up-per part, the server room withthe five storage servers as wellas the ten worker nodes is de-picted. Furthermore, the Cen-tral Trigger as well as the mainswitches are shown. The Che-renkov Telescopes (CT) are la-beled from one to five, with thelast one being the H.E.S.S. IItelescope. Black lines witharrows indicate gigabit datanetwork ethernet connectionswhile the green lines with cir-cles represent a physically sepa-rated gigabit ethernet networkfor the mounting of the NFSand GlusterFS servers. Thepurple lines with the diamondshaped ends represent the di-rect optical fibre connectionsbetween the Cherenkov cameratrigger sytems and the CentralTrigger. found it sends a “fast-clear” signal to the telescope, whichwill drop the event. In case at least two telescopes triggerwithin ≈
80 ns, the camera image is sent to the DAQ. Amore detailed discussion of the H.E.S.S. II Central Triggerand camera readout can be found in [11, 12]. All the datarecorded by the different cameras that belong to one eventare sent via network to one node (see Figure 1) in theH.E.S.S. DAQ cluster. Furthermore, the data from theCentral Trigger are sent to the same node, including aunique event number and GPS time stamp for each event.The received raw data are buffered in memory at the re-ceiving node, converted to a common data format by thatnode and stored on file servers.
The communication between the different componentsof the DAQ is done via a Gbit Ethernet network only.If a piece of equipment is not able to use Ethernet, agateway server is used, e.g. a serial COM-Server. SeveralGbit switches connect all hardware with each other onthe H.E.S.S. site via two 48 Port switches as the centralcommunication hub inside the server room. Long distancesare covered using Gbit media converters and optical fibers.The connection between the camera trigger systems and theCentral Trigger is a direct optical fibre connection and notpart of the normal data network. This is done to achieveminimal signal latency to reduce the over all dead timeof the array. The average mean dead time during normalobservation is ≈ ≈ ≈ . The cluster of the H.E.S.S. II Array consists of tenworker nodes and five storage servers. It is located in aclimate controlled server room inside the control buildingnext to the telescopes. Each storage server is equippedwith a 12 TB RAID6 [13] with an additional hot spare diskusing an XFS [14] file system. This ensures redundancyas well as enough IO bandwidth and disk space for datataking. The NFS [15] and GlusterFS [16] protocols areused for distributed file access through the worker nodes.Due to the low-bandwidth internet connection on-site, datahave to be transported using magnetic tapes at the endof each Shift, currently LTO-4 tapes [17] are used. Thestorage capacity on-site is sufficient to hold all data untilit has been received and verified in Europe which does nottake longer than 3 months. Every month H.E.S.S. Phase Itakes 420 GB of data and for H.E.S.S. Phase II this value isaround 11 TB, while, the total available disk space on-siteis about 60 TB.Different hardware subsystems require special purposemachines, like e.g. custom boot servers for the camera andCentral Trigger board electronics. These custom machines4ave been developed together with their respective hard-ware components in the lab. On-site they have been turnedinto virtual machines which are hosted by one of the servers.These virtual machines are fully customized to their respec-tive task, e.g. using 32 bit CPU architecture. Furthermore,it is possible to backup the machines and migrate them toanother physical host in case of a hardware failure.
The maintenance of the H.E.S.S. central DAQ comput-ing cluster and the DAQ software itself is crucial for theoperation of the experiment. To facilitate the maintenanceand to reduce the complexity of the overall system, thecluster was designed to be as homogeneous as possible.This means that the different computing nodes and storageservers are based on the same hardware and run the sameoperating system allowing the same spare parts and samesoftware to be used for all machines. To cope with powerinterruptions, which are quite common at the H.E.S.S. site(several times a week), a diesel generator in combinationwith online/double-conversion uninterruptible power sup-plies (UPSs) [18] are used to ensure a continuous powersupply to the cluster. The diesel generators provide enoughpower for the telescopes and all electronic equipment on-site. However, since they need several seconds to kick in,the UPS are needed to ensure a constant power output tothe computing cluster. If necessary, the DAQ cluster canbe sustained by the UPS for up to 25 min. The status ofthe cluster is monitored constantly with several open sourcetools like
Ganglia [19] and
Collectd [20]. On top of that,email notifications as well as
SNMP traps are used to geterror notifications from the different hardware components.
A uniform administration of all machines in the clusteris possible, using the same software packages. The
SystemImager tool [21] is used to create daily backups of thecurrent operating system, including all relevant software torun the H.E.S.S. central DAQ. If the operation system isflawed after a software change, the System Imager allowsthe system to be rolled back to a previous, stable imagewith a single command. Dedicated backups of importantfiles are performed on a daily basis. With increasing ageof the backups, the frequency is reduced from daily, toweekly, to monthly snapshots. The databases (see section3.6), that are being used throughout the DAQ, are alsobacked up on a daily basis.
3. DAQ software concept and implementation
The transport and storage of objects needs a seriali-sation mechanism which is provided by the ROOT DataAnalysis Framework [22] on which the H.E.S.S. raw dataformat is based.
Figure 2: Data flow within the H.E.S.S. DAQ. Each piece of hardwareis read out and managed by at least one DAQ
Controller process.The data which is obtained from the hardware is sent using the
Push mode to the corresponding receiver process. In
Push mode dataintegrity is guaranteed. The received data is then stored in ROOTfiles using the common H.E.S.S. data format. For fast feedback tothe Shift Crew, the data processed by a receiver can be pulled by a
Displayer process and displayed on one of the available screens inthe control room. The data can be sent at an arbitrary rate. Errorhandling and process synchronisation is taken care of by the
Manager process.
To ensure a common interface and easy access to allthe different raw data formats used in H.E.S.S., a commonraw data base class is implemented in the software. Thisbase class also takes care of correct time stamps for thedifferent events that are recorded and saved to disk. Thisenforces a correct order in the loading and processing ofevents. Both the ROOT-based H.E.S.S. data format andthe ROOT graphics and histogram classes are used onlineand off-line allowing a seamless integration of the high-leveldata analysis code into the DAQ software, see Section 4.2.
The communication between the different processes inthe DAQ is built on the CORBA [23] distributed, object-oriented inter-process communication standard. A clientprocess can call remote procedures on a server object usingCORBA messages. The result of the function call (withpossible read, read-write and write parameters) is propa-gated back to the client after the execution of the code onthe server side. Due to the combined usage of CORBAand ROOT, several multi-threading issues in the ROOTFramework were discovered. To achieve the required threadsafety, several patches have been contributed to ROOT,and are now part of all ROOT releases [24, 25].CORBA allows the use of different programming lan-guages and different operating systems on the server and5he client side. The H.E.S.S. DAQ uses the free (LGPL)omniORB [26] implementation, which comes with C++and Python support. A central directory naming serviceprovides object registration (using a hierarchical namingtree with processes being grouped by CORBA contexts)and allows any process to easily obtain connections to otherremote processes.The H.E.S.S. DAQ implements two different data trans-port mechanisms: pushing and pulling of data, see Figure 2.In
Push mode the data are sent from one process to an-other, ensuring the data reception and integrity. The
Push mode is used to store all measured data, including thescientific data from the Cherenkov cameras. In
Pull mode,a client polls the server process for new data in periodicintervals. Like this, data can be dropped or requested ata lower rate, but the data integrity is still ensured. Thistransport mechanism is used by displaying processes wherea sub-sample of the data is sufficient.To store data on disks, multiple instances of a genericreceiver process are used throughout the DAQ. These re-ceiver processes are implemented in a generic way whichallow the introduction of new data without the need tochange existing code. The generic receiver makes use of thecommon base class of the H.E.S.S. raw data formats andcan save any data using the ROOT serialisation mechanism.Moreover, basic data quality checks can be performed whilethe data is being received, e.g. monitoring if the time thatpasses between two events or the size of the event is in acertain range and has a certain mean value.
To be able to deal with high data rates so called
NodeSwitching is used in the H.E.S.S. DAQ, this is a round-robinload balancing scheme [27]. All the telescopes in a givenrun send their data to one node for a given amount of time(usually 4 s). After that, the Central Trigger announces thenext node that is free to receive data from all the cameras.In case of insufficient computing power the number of nodescan be adjusted manually. Using this scheme the DAQ canreceive and process data at higher rates than required bythe H.E.S.S. telescope array.A specialisation of the generic receiver mentioned beforeis the
CameraReader , which is the process responsible forreceiving, buffering and processing the Cherenkov cameraevents. The received events are unpacked, converted intothe H.E.S.S. raw data format, joined with the Central Trig-ger information and finally written to the storage servers.For very high data rates, the storage servers may not beable to store the events fast enough. In that case the nodescan be configured to use their local disks for the durationof the observation after which the data are merged andcopied to the storage servers.
All hardware used by the DAQ is represented by a
Con-troller , which is a software process running on a computing node on the DAQ cluster, acting as a uniform interfacebetween the given hardware component and the DAQ itself.Simple hardware can be represented by a single
Controller whereas more complex hardware like the Cherenkov cam-eras use multiple
Controllers , e.g.
Camera HV Controller , Camera Trigger Controller , Camera Lid Controller , . . . A state machine maps the state of the hardware to thestate of the corresponding
Controllers . The mapping of
Controller states to hardware states for some processes isshown in Table 1. If a
Controller is in the
Safe state, thecorresponding hardware is either turned off or in a stateof minimal activity. In preparation for data taking and inbetween runs all
Controllers are in the
Ready state, i.e. thehardware is turned on and slow control information is be-ing recorded. As an intermediate step shortly before datataking, the
Configured state indicates that the hardwarehas received all the necessary configuration parametersfrom its
Controller . Finally, if a
Controller is Running the corresponding hardware is read out and the data areprocessed by the DAQ and stored on disk.State transitions are used to change from one stateto another. A scheme of the state machine used by theH.E.S.S. DAQ can be seen in Figure 3. If a
Controller cannot perform its state transition successfully (for exampledue to hardware problems) it will fall back to its previousstate. A detailed description of the error handling is givenin Section 4.4.There is only one transition from one state to an adja-cent state for a given direction which simplifies the handlingof the state machine because loops or unreachable statescannot occur. Furthermore, a flat state machine allowsprocesses to be sent to any given target state that need notbe adjacent to the current state, i.e. from
Safe to Running or vice versa. (The
Controller will perform all necessarytransitions to get from the
Safe to the
Running state au-tomatically, i.e. the order of executed transitions wouldbe
GettingReady , Configuring and
Starting .) Transitionstowards the
Running state are called “upwards”, and thosein the other direction are called “downwards”. To be ableto monitor the performance of the DAQ processes, timestamps are written into a MySQL database at the begin-ning of each transition, once all of the dependencies (seeFigure 4) of a
Controller have finished their transition andagain at the end of each transition (see Section 4.3).
To synchronise the different processes during data tak-ing
Managers are used throughout the DAQ. Each
Manager is also a
Controller , i.e. a software process running on acomputing node of the DAQ cluster, providing an extendedinterface to deal with collections of
Controllers . The
Man-agers are responsible for distributing the run configurationto their subordinated processes as well as managing theirstates. The state of the Manager is determined by theminimum state of the all managed processes, e.g. if theCT5/Tracking is
Running (i.e. CT5 is tracking the observa-tion target) but some camera process in CT5 is still
Starting
Table 1: Mapping of state machine states to real hardware states for some important software processes in the H.E.S.S. DAQ.Figure 3: Visualisation of the state machine used in the H.E.S.S. DAQ. The boxes represent the states a
Controller (and its respective device)can be in, while the arrows show the available transitions. This state machine is linear by design, which simplifies the synchronisation ofmultiple
Controllers . As all the hardware
Controllers are mapped to the same state machine, it becomes quite easy to determine the state ofthe whole array or any subset, e.g. when all processes of telescope CT1 are
Ready , the whole telescope is
Ready for observation.Figure 4: An example interaction of various important H.E.S.S. DAQ processes during the starting of an observation run. The change of thestate of the corresponding
SubArray is indicated by the dashed lines and ranges from
Ready , over
Configured to Running . The hexagonalboxes represent state transitions of the corresponding DAQ
Controllers . The solid and dotted black lines with a filled circle at the end indicatedependencies between processes within the H.E.S.S. DAQ, i.e. the HV of the PMTs in the camera must not be turned on while the telescopeis still moving and the camera trigger should not be activated before the camera HV is turned on. As a result a process only starts withits transition if all of its dependencies have successfully finished their own corresponding transition. The solid lines indicated the slowestdependency of a given process while the dotted lines represent dependencies that already reached the required state. The global state of the
SubArray is determined by the slowest process, i.e. once the last process finished its transitions from
Ready to Configured the whole
SubArray is considered to be
Configured . (e.g. the high voltage is turned on) the CT5/Manager’sstate is still Configured . Furthermore, they are used forerror handling which is discussed in detail in Section 4.4.The DAQ uses a hierarchical structure to control all theprocesses of a given run; with the
SubArray Manager ontop and
Managers for every Context (see Section 3.2) inthe order of the hierarchy of the contexts. The hierarchy of
Managers and
Controllers in the H.E.S.S. DAQ is shownin Figure 5.The
Manager of a given context gets the list of the processes needed for a given run from the database. Thereare three distinct types of processes that can be specifiedwithin the database.
Disturbing processes are sent to asafe state at the beginning of the run, e.g. the cameralid is disturbing for a park-in run. Therefore, it is closedbefore the telescopes are parked in (i.e. moving the camerainto the camera shelter and closing its roof).
Required processes are necessary for a given run and a run will bestopped or will fail to start if a required process is not inthe correct state. Finally,
Optional processes are part of a7 igure 5: An example overviewof the process hierarchy withinthe H.E.S.S. DAQ is shown.The
Run Manager distributesthe scheduled data taking runsover an arbitrary number of
SubArrays . Each
SubArray con-sists of a
SubArray Manager process and several other subcontexts. These contexts nor-mally are various Node andCT contexts representing theCherenkov telescopes and the
CameraReaders . Each
SubAr-ray can operate independentlyfrom the others and Nodes andCT contexts can be assignedin any combination to any
Sub-Array . The so-called Slow Con-trol context is responsible for at-mospheric monitoring tasks aswell as all processes that displaydata to the Shift Crew and isconstantly in a
Running state. given run but an error or a misoperation of one of theseprocesses does not influence the data taking, e.g. a crash ofthe process displaying monitoring data to the Shifters doesnot stop data taking. Once the problem with an optionalprocess is fixed it is allowed to rejoin data taking.In certain cases, a
Controller has to be in a certain statebefore another
Controller can begin to perform its tran-sition to its target state. These dependencies are enteredinto the database and
Controllers will get a list of all theirdependencies for a given run type from their corresponding
Manager . Controllers will wait for their dependent pro-cesses to finish their transitions before starting with theirown, e.g. a camera trigger will wait for the camera HVto be turned on before configuring itself. Some exampledependencies of processes within the DAQ can be seen inFigure 4. Dependencies are applied in reverse order fordownwards transitions, this ensures that processes are shutdown in the right order.
In general, the whole configuration of the H.E.S.S. DAQis contained in MySQL databases, including the processesand machines in use, as well as common environment vari-ables and the configuration of the various hardware com-ponents. This allows a flexible configuration on-site andenables the setup of test and development environments inEurope where only certain pieces of hardware are available.In addition to the on-site database there are duplicatesin four different locations in Europe. These replicateddatabases are also used for off-line quality checks and anal-ysis. The H.E.S.S. DAQ uses a database abstraction layer,called simpletable , to access the database. It was designedand implemented as part of the H.E.S.S. DBTools [28](in C/C++) and additionally implemented in Python forthe Central DAQ. simpletable facilitates the grouping ofmultiple records in so-called sets . Such a set is for examplea collection of calibration parameters for all pixels of atelescope. The library takes care of table locking and trans-actions, so that either a complete set is written/modifiedor nothing at all.Another important aspect is that a permanent historyof configurations, calibration parameters etc. is provided.It is possible to store and access multiple versions of aconfiguration set. Under normal circumstances only newsets of data are added, even when these are just minormodifications of older sets still residing in the database.This also makes it easy to create temporary configurations,e.g. deactivating a malfunctioning piece of hardware duringthe night and to roll back to an earlier version of theconfiguration when the error is resolved.
A dedicated logging framework has been implementedin the DAQ. Log files for every DAQ process are created ona daily basis and timestamps with microsecond precisionare written with each message to file. These logs are easilyaccessible using the central DAQ GUI and are stored forseveral years on-site and occasionally copied to Europe.There are six different message types available in thelogging framework as shown in Figure 6. Possible actionson message reception can range from print-outs to the log8les, to pop-up messages with or without sound for theShifters, to an automated security shutdown of the wholetelescope array by the DAQ.The log files are primarily used by the Shift Crew duringerror recovery to find the source of the underlying problem.Furthermore, the corresponding subsystem experts can usethe log files at any time to search for problems.If a severe error is detected by a hardware component,it has to be able to prevent any damage to itself and informits
Manager via its
Controller about the error state afterimmediate danger is averted. Once the DAQ is informedabout the malfunction a fully automated procedure takesover which, depending on the severity of the error, canbring the whole array to a safe and consistent state. Thisautomatic procedure is designed to prevent possible damageto any other hardware equipment but especially to takecare of human safety, i.e. to stop the telescope movementimmediately, to shut down the Cherenkov Camera HV,etc. If an
Error does occur the corresponding
SubArray is brought to the
Ready state as quickly as possible usingso called “immediate transitions”. They work exactly likenormal transitions with the exception that all dependenciesare ignored ensuring the arrival into a safe state as fast aspossible. In case of a
Fatal error message the
SubArray is sent to the
Safe state and all hardware belonging tothe run is shut down. Once the DAQ is finished with itsautomatic response, the Shift Crew has to take over andidentify and solve the problem to be able to continue withnormal operation.
Another highly automated procedure is the start andshut down of the DAQ. Starting the DAQ can be donein less than 1 min and stopping in less than 0 . Resource Handler , which gets the list andconfiguration of all the processes that belong to the DAQfrom the MySQL database and starts
Host Handlers onevery machine that has been configured to be part of theDAQ in the database. These handlers are responsible forstarting, monitoring and, if necessary, restarting all theprocesses that run on their machine.
4. Array operation
The H.E.S.S. Array takes data in periods of 28 min, socalled runs , during astronomical darkness only. There are observation runs which make up the bulk of the availabledark time. Furthermore, dedicated calibration runs aretaken with the detector at regular intervals as well as other Calibration runs are needed to determine the conversion of therecorded electrical signal into a number of Cherenkov photons. Amore detailed description of the H.E.S.S. calibration is given in [29]. special purpose runs, e.g. system tests at the beginningof the night. All the different run types are specified in adatabase and run types can be added, removed or modifiedwithout the need to change any code. This includes anycombination of hardware that has to take part in a run of agiven type. This is also true for the detailed configurationparameters (
Run Parameters ) of the hardware used in agiven run (which can be different for different run types).Furthermore, the targets scheduled for observations dur-ing a given shift are contained in the same database and adedicated tool, called the “AutoScheduler” [30] schedulesall observation runs for a given night. The AutoSched-uler takes into account various predetermined conditions,e.g. target priority, zenith angle, number of runs alreadytaken on that target and available telescopes and usesan optimisation algorithm to prepare the schedule. Thisschedule is then written to the database and processed bythe DAQ. The Shift Crew can adjust the schedule, e.g. byadding calibration runs manually, but is not allowed tochange the observation schedule, unless there are excep-tional circumstances, e.g. a ToO alert.For further flexibility the DAQ can schedule runs forany combination of available telescopes. This includesmultiple runs with different sets of telescopes, for example,an observation run using CT1, CT2 and CT4, a calibrationrun with CT3 and another observation run with CT5 on adifferent target. For that purpose,
SubArrays are used tomanage the participating hardware in a given run. In theexample above, three
SubArrays would have been used totake data with all five telescopes in three different runs atthe same time.The execution of the different scheduled runs is doneby the
Run Manager . It constantly monitors the availableresources (e.g. telescopes and nodes) and is aware of thescheduled runs, their type and their requested hardware.It sequentially parses all scheduled runs and checks forfree resources. As soon as all requirements are fulfilled(mainly there being enough free resources) the
Run Man-ager allocates a
SubArray , and sends the run configurationto the corresponding
SubArray Manager (see Section 3.5)which will then proceed on its own, i.e. configure all thenecessary hardware, start the data taking and, once therun is finished, un-configure the hardware again.
All of the interaction between the Shift Crew—as wellas other subsystem experts—and the DAQ is done in thecentral control room on-site. Several dedicated displaymachines are used to show monitoring information to theShifters as well as to give fast feedback about the cur-rent status of the array. This includes information aboutthe weather conditions outside (temperature, air pressure,wind speed and humidity), the camera monitoring (temper-ature, high voltage and currents), telescope pointing andmotion as well as real time calibration and analysis data(e.g. shower images in the cameras and sky maps contain-ing the source direction of all reconstructed gamma-like9 igure 6: There are six distinct message types used in the H.E.S.S. DAQ system for inter-process communication and logging. A
Print message is piped into the daily log file of the corresponding process and is not distributed any further. An
Info message is send upwards in thehierarchy of the DAQ processes and also printed in the DAQ log message window which is displayed on one of the screens of the main DAQcontrol PC in the control room.
Caution and
Warning messages as well as
Error and
Fatal messages are also distributed upwards in the DAQprocess hierarchy. Furthermore, all four of these message types also generate different sound notifications in the control room to get immediateshifter attention.
Error and
Fatal messages can also generate a pop-up window on the central screen of the main DAQ control PC. Moreover,the Run
Manager is notified if an
Error or Fatal message was issued and no further runs are scheduled, i.e. the
Run Manager is doing atransition to the
Safe state. Also the corresponding
SubArray is send to the
Ready state in case of an
Error message and the
Safe state incase of a
Fatal message as a safety precaution. events). A central graphical user interface (GUI) is usedfor interaction with the DAQ; a screenshot of the mainpart of the GUI is shown in Figure 7. It serves as a singlepoint of contact between Shifters and the DAQ where allessential settings and configuration options can be changed.This allows Shifters to take over manual control in case oferror recovery and special operations in an easy way. Thecentral GUI, and most other GUIs in the H.E.S.S. DAQ, areimplemented in the Python programming language usingPyGTK [31]. Furthermore, all GUI processes implementthe state
Controller interface and are managed like all otherprocesses in the DAQ, e.g. they receive
Run Parameters and perform transitions like
Starting and
Stopping .Building upon the common raw data format based onROOT, a generic data
Displayer has been developed (usingC++). It uses the object introspection capabilities of theROOT data analysis framework to gain access to any datamember of any H.E.S.S. data storage format. Therefore it isable to plot any data that are recorded in the H.E.S.S. DAQand can be configured solely using the MySQL database.Different specialisations of this
Displayer can plot timelines, bar charts, wind roses, camera images, etc. Anexample of available displays that are shown in the controlroom can be seen in Figure 8.Some basic data quality checks can be performed withthe
Displayer as well, e.g. range checks with warning soundsand pop-up messages. Normally, displays are updated at arate of a few Hz providing fast feedback to the Shifters.For complex hardware components, e.g. the Cherenkovcameras, expert GUI modes are also available. Those al-low detailed control of the various hardware componentsand can be used by subsystem experts as well as experi- enced Shifters on-site for development, error handling anddebugging.
A dedicated collection of Displayers is also used toshow the plots of the “real-time pipeline” to the Shifters.It is a full analysis, based on the HAP TMVA analysis[32], of the data being taken running in real time. Theonly limitation is that a default camera calibration has tobe used, e.g. the High Gain to Low Gain ratio, flatfieldcoefficients, single photo electron values as well as muoncoefficients with the exception of the calculation of thepedestal position of the two gain channels of the photomul-tiplier [29, 33, 34]. For background determination eitherthe Ring Background (the default; using several speed-upsconcerning coordinate transformations), the Reflected Re-gion or the Template Background method can be used [35].Run results are stored on disk and used to perform a real-time analysis with data from consecutive runs on the sametarget. The output of the real-time pipeline are calibratedcamera images with intensities in photo electrons as well assignificance maps of the region of the sky that is currentlybeing observed and θ plots around the target source posi-tion. If a significant detection of a source is made the ShiftCrew is alerted using pop-up messages and advised to callfor expert input to allow swift follow-up observations. Tocope with the high-data rates the real-time pipeline is splitinto several different processes. Each CameraReader hasa corresponding
Analyser process which subscribes to thedata stream that is processed by the
CameraReader . The
Analyser processes all events that are generated by the
CameraReader . This includes event calibration and event10 igure 7: Main part of the central DAQ GUI used by the Shifters to interact with the H.E.S.S. II DAQ. It is separated in three distinct parts.The upper row consists of several groups of buttons that are needed during normal operation. The lower left part of the GUI is dedicated togiving detailed information about and control over the available
SubArrays . This includes information about run duration, observation target,used telescopes, etc. The main part of the GUI is used to show all running DAQ
Controllers to the Shifters. They are grouped in contexts andcan be expanded to show all processes within this context. The amount of processes in a context as well as their current state are also shown.Detailed control over every single process (even multiple ones if more than one are selected) is possible using the right-click menu. In case oferrors, the
Controllers at fault are marked with an error flag (not shown).Figure 8: Example selection of slow-control displays for immediate feedback to the Shift Crew. The five largest camera displays show pixelintensities in photo-electron volts for all telescopes. Right next to the intensity plots the High Gain ADC count camera displays as well as theaverage camera drawer temperature over time histograms can be seen. On the right side of the screenshot the different rows of camera displaysfrom the top to the bottom correspond to current pixel HVs, pixel currents, pixel scaler values and drawer temperatures (the pixel scaler valuebeing the number of pixel triggers in a certain time period for a given pixel). reconstruction as well as gamma-hadron separation. Theprocessed data is collected by
AnalysisServer processes;one for each
SubArray in use. The input maps for thesignificance maps are filled at the
Analyser and the fi-nal significance maps are created at the
AnalysisServer . The latter is a time consuming processes (several minutes)and therefore happens in parallel to the input maps beingreceived.11 .3. Monitoring and shift logs
To calculate the data taking efficiency of the H.E.S.S.telescope array the so called Transition Time Tools are used.It is a Python framework that analyses the gaps betweenthe various runs and the timestamps, that were written toa database by the different DAQ
Controllers . The DAQmonitors whether there are any gaps in data taking duringdark time. If a gap is detected the Shifters are asked togive a reason for each gap between data taking runs. Theanswers the Shifters provide are stored in the same database.In combination, the information about the transitions, thegaps between runs and their reasons can be used to calculatethe data taking efficiency of the H.E.S.S. Array as well asthe percentage of lost observation time of the array due toDAQ problems. Furthermore, it is possible to benchmarkdifferent processes with microsecond resolution and identifybottlenecks in the data-taking procedure. A further sourceof information is the detailed shift log that is sent to amailing list and that is stored on disk after every nightof a shift to keep the collaboration up to date about theongoing activities on-site.
In case of a hardware failure, the DAQ automaticallyperforms a safety shutdown, as described in section 3.7.Apart from this automatic procedure, manual overrides areavailable for every automatic configuration and automaticDAQ action. If a hardware device is not available or mal-functioning, its corresponding
Controller can be replacedwith a noopController which just implements the commoninterface of the state machine while not doing anythingelse. This procedure is wrapped in a Python script withthe name of the
Controller that should be replaced as asingle command line parameter allowing the Shift Crew toeasily continue observations without the failed hardware.Along with the manual overrides, extensive documenta-tion of all hardware components and the DAQ are availableon-site. While the hardware manuals are mostly presentin analogue form, the DAQ manual and the DAQ trou-bleshooting guide are located within a MediaWiki [36]on-site. Shifters are encouraged to contribute to the Wikiwhile they are on Shift with an emphasis on the Shifter’sNotes, a summary of the current shift for the next ShiftCrew.At the beginning of each Shift the current Shift Crewis introduced to the handling of the array and to the emer-gency procedures by a local Shift Expert who spends thefirst ten days with the Shifters. After that time the Shiftersare on their own. Should they be unable to solve a problem,subsystem experts on call are available to help resolve theproblem.
The H.E.S.S. site is located in a remote region with verylimited internet connectivity. To make the remote mainte-nance easier and to minimise the number of maintenance trips from Europe, several remote maintenance tools areused. For instance, IPMI-cards [37] are installed on everymachine in the DAQ cluster. This allows the machines tobe power-cycled remotely as well as giving access to theBIOS and other configuration menus during the boot-up ofthe machines. Furthermore, VNC [38] servers are used toforward the graphical displays once the operating systemhas started. They can be started on every machine of theDAQ cluster and are running constantly on the machinesin the control room. The VNC connections have proven tobe an invaluable tool when it comes to remote assistance incase of problems during data taking as well as for remotemaintenance of the cluster and related machines. The ac-cess to these VNC servers is restricted to DAQ experts toprevent tampering with the system by third parties.The remote access to the network is possible with anOpenVPN [39] server running on the gateway machine ofthe DAQ cluster. With this it is possible to access all of thedifferent networks on-site without compromising security orthe separation between the data network and users network.To ensure a stable production environment, the Shift Crewas well as other members of the collaboration cannot changethe software used for data taking. Only DAQ experts canmake software changes and can, therefore, guarantee aproperly working DAQ system. The Shifters are givenrestricted access to the cluster to minimise the possibilityof human error.
The DAQ system can receive target of opportunityalerts from other experiments via the Gamma Ray Burst(GRB) Coordinates Network (GCN). A dedicated process,the
GCNAlerter , has been developed by the H.E.S.S. Col-laboration. It listens for messages from the GCN network,checks whether the coordinates are visible and takes furtheraction. For H.E.S.S. Phase I the
GCNAlerter informed theShift Crew and prepared a script to alter the observationschedule, which had to be confirmed and executed by thehuman operators. The results of the Gamma Ray Burstobservations with H.E.S.S. I are described in [40]. For thestart of H.E.S.S. Phase II a revised target of opportunityalert scheme has been developed by the H.E.S.S. Collabora-tion [41] and is currently being implemented into the DAQ.If the
GCNAlerter decides that an alarm justifies a promptobservation, the DAQ will react fully automatically andstart data taking on the new target without the need forhuman intervention. For safety reasons the Shift Crew isnot allowed to enter the array if the ToO alert system isactive.For these prompt observations the time span betweenreceiving the alert and the beginning of the observation hasto be minimised; transient events occur on time scales of afew seconds and several minutes. The bulk of the transitiontime between two runs is due to the slewing time of thetelescopes to the new target. In normal operation, someadditional time is used to switch off the high voltage whilethe telescopes are moving to prevent damage to the photo12ultipliers of the Cherenkov cameras due to bright starsin the field of view.In case of a GCN alert, the telescopes immediatelystart moving to the new target while the ongoing runs arestopped and the reconfiguration of all
Controllers is done.Moreover, the high voltage of the camera is not turned offand the cameras are just sent to an internal paused statewhere they stop taking data but remain fully configured.The effect of these actions is that the transition time is justthe slewing time of the telescopes which cannot be sped upany more (see [42]) and a short overhead for the unpausingof the camera and the reactivation of the camera pixels.Moreover, dependencies which are optional processesare not waited upon so that they do not increase the du-ration of the transition. On top of that, only CT5, dueto its higher movement speed, is required for the start ofthe data taking, CT1 to CT4 are configured to be optionalprocesses and will join data taking once they are on tar-get. The Cherenkov Camera Trigger and Central Triggerconfiguration are the same as during normal observations(CT5 monoscopic trigger are allowed all the time). Thedrive system of CT5 also has the option to use reversetracking of ToO targets, i.e. driving beyond zenith, andthe fine positioning of the telescope is done during datataking because the errors on the target position of a ToOalert are large.
5. Software management
To test the DAQ software without real hardware, afull DAQ simulation can be set up, i.e. raw camera dataare sent from a camera emulation process to the nodereceivers and a Central Trigger emulation process sendstrigger blocks to the corresponding receivers as well. Thiscan be used to test receivers, the node switching as well asthe real time analysis with real data as input. However, toprocess a full run of real data in an acceptable time frame,a computer cluster similar to the one in Namibia has tobe used. The H.E.S.S. Test-DAQ-Cluster, a scaled downversion of the DAQ cluster on-site (five nodes instead often, two storage servers instead of five, one switch insteadof four), provides enough computing power for that pur-pose. Moreover, the same operating system and software isrunning on the Test-DAQ, making it an ideal test bench forsoftware development. Its location in Europe also allowseasy access in contrast to the cluster on-site.
During the development of the H.E.S.S. II DAQ soft-ware the Make-based build system [43] of the software wasreplaced by one using SCons [44]. Apart from a code clean-ing during this transition, it is now possible to build thesoftware using multiple jobs in parallel. The legacy buildsystem was not able to build the software correctly using multiple jobs. It was simpler to re-implement the build sys-tem, instead of modifying the Make-based one . Anotherbenefit of SCons is the use of Python for its configurationscript files which allows a quick start for beginners andfacilitates maintenance of the build system. To further aidin development a Bugzilla Bug tracker [45] is used by thesoftware developers of the H.E.S.S. Collaboration.For development purposes and benchmarking so-called Dummy Controllers are available in C++ and Python.They provide the basic interface of the state machine andare used to mock missing hardware
Controllers in a testingenvironment. The Dummy Controllers are also used to testthe dependency and time-out handling within the DAQ aswell as to test the automatic shutdown and mechanisms incase of errors. They mock long-running transitions or canthrow exceptions during specified transitions, dependingon the options that are passed as command line arguments.To aid other subsystem experts in the developmentof their own DAQ
Controllers for their hardware a dedi-cated virtual machine, the DAQ-VM, is available. Mostof the time the subsystem hardware cannot be shippedto the Test-DAQ-Cluster and a software test environmenthas to be set up at the location of the hardware (for in-stance properly configured operating system, database andCORBA omniName server). The DAQ-VM was createdusing VMware Fusion 3 [46], the hardware requirements area single-core 2 GHz processor, 1 GB memory and 20 GBdisk space. Taking the hardware requirements into ac-count, the DAQ-VM can be run on almost all currentlyavailable laptops. This allows non-DAQ-Experts to testnew DAQ
Controllers with their corresponding hardwareunder conditions which are as close as possible to thoseon-site without detailed knowledge about how to set upsuch a test environment.Another helpful tool for non-DAQ-Experts is the de-tailed documentation available in an internal Wiki of theH.E.S.S. Collaboration. Together with basic example
Con-trollers and the extensive how-to guides in the Wiki non-DAQ-Experts can quickly start to write and test new DAQ
Controllers .
6. Performance
The H.E.S.S. DAQ system has been in use since thecommissioning of the first H.E.S.S. telescope in 2003. Sincethen the DAQ system has evolved continuously to its cur-rent state as described in the previous sections. Over the 10year period of operation corruption of data due to centralDAQ malfunctions was extremely rare and is negligible. Inpreparation for H.E.S.S. Phase II the central DAQ softwarehas been overhauled, including being ported from 32 bit to64 bit, to make full use of the architecture of recent server The SCons built-in dependency management is more sophisti-cated and can be extended quite easily compared to Make’s built-independency rules.
The H.E.S.S. DAQ has been in operation for almost tenyears and performed well throughout this time. In spiteof frequently failing hard disks and the harsh environmentfor sensitive electronic equipment, the amount of data lostover time is negligible. The redundancy of the availablehardware and software has played a critical role in thisachievement, i.e. broken hard disks are replaced immedi-ately by the RAID setup and other broken components canbe replaced with spare parts from a common pool. At thesame time it is possible to redistribute processes to othermachines, because they have identical hard- and softwareenvironments. This is especially true for all services neededby the central DAQ and all DAQ
Controllers , which canbe started on any machine of the cluster (storage-server orfarm node). This multi-level redundancy design both min-imises the probability of losing data as well as the recoverytime after a computer hardware failure.The central DAQ overhead contributing to the tran-sition time between two consecutive observation runs isnegligible, i.e. the time needed by the hardware to be readyfor the next run is several orders of magnitude bigger thanthe central DAQ overhead of the corresponding controllers.Moreover, the central DAQ does not increase the systemdeadtime. In the time period from 01.01.2009 to 31.12.2012the H.E.S.S. Array was not operational due to central DAQproblems for 0 . MB / s on average. This includes 1 % ofslow control data and of log files which are stored everynight for debugging purposes. The average CPU load ofthe cluster during data taking does not exceed 10 % (offive 2 . . New hardware components can be added easily to theH.E.S.S. Array. Only the software for the
Controller re-sponsible for the new hardware has to be written and addedto the DAQ. The rest of the system, i.e. data transport,storage and visualisation does not need to be modified. Fur-thermore, the inter-dependence of the new
Controller withalready existing ones have to be added to the database.The central DAQ software used for Phase II of theH.E.S.S. telescope array evolved from the central DAQsoftware designed for Phase I of the experiment. One ofthe main differences between the new central DAQ softwareand the earlier implementation is compatibility with 64 bitarchitecture of recent CPUs, which required many minorpatches. However, the principal design ideas of the H.E.S.S.central DAQ software remained the same. This includes theconfiguration of the DAQ itself and the controlled hardwarefrom a central database, as well as the common data formatfor all monitoring and scientific data. For the latter onlyadditional information had to be included, i.e. the-pixelwise timing information that became available with thenewer electronics [11] of the CT5 camera. The fact that noredesign of the central DAQ software was necessary is dueto the uniform interfaces for hard- and software: Ethernetstandard for communication to all hardware componentsand the common interface of the software
Controllers forall devices.Another example of the capability of the H.E.S.S. DAQto quickly adapt to new situations, is the fact that duringthe whole commissioning of the fifth telescope the livesystem was used to test and develop
Controllers relevant forthe new telescope while the array was taking data using theother four. This includes parallel and joined observations ofthe old and new telescope. Due to this ability to take datawith different
SubArrays the CT5 commissioning could bedone mostly in parallel to data taking with CT1-4 whichhad only minor downtime.The commissioning for H.E.S.S. Phase II of the cen-tral DAQ went smoothly. This was possible due to theextensive documentation and guides about DAQ
Controller development, and the DAQ Virtual Machine (VM) whichallowed non-DAQ-Experts to easily develop and test the
Controllers for their own custom hardware. For exam-ple, the development of the Tracking Controller for CT5,the
Controller for the calibration device of the new CT5camera as well as the adjustment of the Central TriggerController has benefited. Moreover, the DAQ VM wasused to test the central DAQ software with the cameratest-benches (this includes old and new camera hardware,14.e. test-benches mimicking CT1-4 as well as CT5) dur-ing the initial development phase of the DAQ for H.E.S.S.Phase II.The main platform for the development of the H.E.S.S.Phase II central DAQ was the TestDAQ, including a fullsimulation of the data-taking process. In a simulated ob-servation run, real raw data from previous runs taken withthe H.E.S.S. DAQ are fed into the system, which allowsthe central DAQ to be tested under near real conditions,the main difference being that there is no actual hardwareconnected to the TestDAQ cluster and
Dummy Controllers take the place of real ones. These
Dummy Controllers canalso be used to fake slow hardware and to test the errorhandling of the DAQ. The simulation runs can be used totest the real-time analysis, to test data quality checks andto do benchmarks with the hardware of the cluster as wellas the DAQ software running on it to identify bottlenecks.Moreover, using the TestDAQ to test firmware and driverupdates, as well as other hardware or software modifica-tion, before applying them to the live system helps to avoidcomplications on-site and contributes to the stability of theDAQ.
7. Conclusion
The H.E.S.S. central DAQ has been in operation foralmost 10 years without any major problems since the inau-guration of the first telescope in 2003. The central DAQ didonly contribute to a loss of 0 . Controllers in the H.E.S.S. Collabo-ration proved to be difficult due to the many differentenvironments at the different member institutes and labo-ratories. To avoid some of the problems that usually ariseduring updates on site (i.e. introducing new functionalityfor
Controllers , installing updates for the operating sys-tem or cluster hardware drivers) it proved very usefullyto integrate the software and test the installation on theTestDAQ and the DAQ Virtual Machine, which helped toresolve many problems beforehand. While technically the central DAQ software is organizedin separate modules, there are a lot of historic interdepen-dencies between the different modules and classes. As aresult, it is very difficult to effectively test the differentcomponents of the DAQ software without integrating thefull software distribution (both DAQ software, and off-lineanalysis packages). The DAQ Virtual Machine helped totest
Controllers in a “standardized” environment. It pro-vides a pre-installed reduced collection of H.E.S.S. DAQsoftware modules and their dependencies for easy develop-ment and testing. Thus it alleviated some of the mentionedproblems. For future projects each software module shouldprovide automated tests that can be run without the needto integrate other modules. This helps to ensure a mod-ules functionality and also reduces the coupling betweenmodules, making maintenance a lot easier.The abstraction of internal hardware states to the sim-plified linear state machine proved to be very useful interms of controlling the whole telescope array or varioussubsets. This is also true for the design decision to delegatethe safety of each piece of hardware to its firmware, andonly having one
Error flag in the
Controller , which tells thecentral DAQ whether a piece of hardware is working or not.With these simplifications it is easier to keep the wholeDAQ in a consistent state and to use automatic procedureslike the controlled emergency stopping of a run in case ofhardware failures.The virtualization of the legacy boot-servers (describedin subsection 2.2) for the Cherenkov cameras and CentralTrigger devices has proven to be very useful. It allowed toreplace the outdated special purpose machines with new off-the-shelf hardware while keeping the specialized operatingsystem with all its customizations without any additionalreconfiguration or development. For future projects it is agood idea to decouple the configuration of the DAQ systemfrom the physical set-up of the cluster computing hardware.This could be achieved by creating virtual machines for allcritical tasks (e.g. database servers, boot servers, comput-ing nodes, etc.) and distribute them dynamically on theavailable physical machines. This would also reduce thetime to reconfigure the system after a physical server orcomputing node failure during data-taking.Another lesson learned concerns the graphical user in-terfaces in the control room. The control and monitoringinterfaces are very valuable to the Shifters and enablethem to take charge of observations after a few nights oftraining. On the technical side it would have been bet-ter to decouple the control functionality from the actualdisplay code. Quite some issues had to be resolved involv-ing multi-threaded interprocess communication mixed withmulti-threaded graphics display code. Our suggestion forfuture projects is to provide user interfaces using standardweb interfaces. The main advantage is that the system be-comes almost independent of the actual display machines.The web interfaces can be displayed on any current orfuture operating system or device as long it provides aweb-browser. Moreover, it would simplify remote operation15nd monitoring by providing remote access to the webinterfaces allowing the same interactions with the systemthat would be possible on-site.Due to the high flexibility and scalability of the designof the central DAQ, the inevitable changes to the centralDAQ necessary due to upgrades of the array could allbe resolved in an evolutionary fashion in contrast to acostly redesign. This is especially true in the light of thecommissioning of the fifth telescope of H.E.S.S. Phase II.
Acknowledgements
The authors would like to acknowledge the support of their hostinstitutions. We want to thank the whole H.E.S.S. Collaborationfor their support.The support of the Namibian authorities and of the Univer-sity of Namibia in facilitating the construction and operationof H.E.S.S. is gratefully acknowledged, as is the support by theGerman Ministry for Education and Research (BMBF), theMax Planck Society, the German Research Foundation (DFG),the French Ministry for Research, the CNRS-IN2P3 and theAstroparticle Interdisciplinary Programme of the CNRS, theU.K. Science and Technology Facilities Council (STFC), theIPNP of the Charles University, the Czech Science Foundation,the Polish Ministry of Science and Higher Education, the SouthAfrican Department of Science and Technology and NationalResearch Foundation, and by the University of Namibia. Weappreciate the excellent work of the technical support staff inBerlin, Durham, Hamburg, Heidelberg, Palaiseau, Paris, Saclay,and in Namibia in the construction and operation of the equip-ment.
References [1] F. Aharonian, et al., Observations of the Crab nebula withHESS, Astronomy and Astrophysics 457 (2006) 899–915.[2] J. Hinton, W. Hofmann, Teraelectronvolt astronomy, AnnualReview of Astronomy and Astrophysics 47 (2009) 523–565.[3] P. Vincent for the H.E.S.S. Collaboration, H.E.S.S. Phase II, in:Proc. 29 Int. Cosmic Ray Conference, Vol. 5, 2005, pp. 163–166.[4] M. Punch for the H.E.S.S. Collaboration, H.E.S.S. II: Expansionof H.E.S.S. for higher sensitivity and lower energy, in: Towardsa Network of Atmospheric Cherenkov Detector VII, 2005, pp.379–391.[5] Y. Becherini, M. Punch, H. E. S. S. Collaboration, Performanceof HESS-II in multi-telescope mode with a multi-variate anal-ysis, in: F. A. Aharonian, W. Hofmann, F. M. Rieger (Eds.),American Institute of Physics Conference Series, volume 1505of
American Institute of Physics Conference Series , 2012, pp.741–744. doi: .[6] M. Holler, A. Balzer, Y. Becherini, S. Klepser, T. Murach, M. deNaurois, R. Parsons, for the H. E. S. S. Collaboration, Status ofthe Monoscopic Analysis Chains for H.E.S.S. II, ArXiv e-prints(2013).[7] C. Borgmeier for the H.E.S.S. Collaboration, The central dataacquisition system of the H.E.S.S. telescope system, in: Proc.28th Int. Cosmic Ray Conference, 2003, p. 2891.[8] E. Hays, VERITAS Data Acquisition, in: International Cos-mic Ray Conference, volume 3 of
International Cosmic RayConference , 2008, pp. 1543–1546.[9] D. Tescaro, J. Aleksic, M. Barcelo, M. Bitossi, J. Cortina,M. Fras, D. Hadasch, J. M. Illa, M. Martinez, D. Mazin, R. Pao-letti, R. Pegna, for the MAGIC Collaboration, The readout system of the MAGIC-II Cherenkov Telescope, ArXiv e-prints(2009).[10] M. Actis, G. Agnetta, F. Aharonian, A. Akhperjanian, J. Aleksi´c,E. Aliu, D. Allan, I. Allekotte, F. Antico, L. A. Antonelli, et al.,Design concepts for the Cherenkov Telescope Array CTA: anadvanced facility for ground-based high-energy gamma-ray as-tronomy, Experimental Astronomy 32 (2011) 193–316.[11] J. Bolmont, et al., The camera of the fifth H.E.S.S. telescope.Part I: System description., 2013.[12] S. Funk, G. Hermann, J. Hinton, D. Berge, K. Bernl¨ohr, W. Hof-mann, P. Nayman, F. Toussenel, P. Vincent, The trigger systemof the H.E.S.S. telescope array, Astroparticle Physics 22 (2004)285–296.[13] Wikipedia, Standard RAID levels — Wikipedia, The Free Ency-clopedia, 2013. URL: http://en.wikipedia.org/w/index.php?title=Standard_RAID_levels&oldid=538005252 , [Online; ac-cessed 13-February-2013].[14] R. Y. Wang, T. E. Anderson., xFS: A Wide Area Mass StorageFile System., 1993.[15] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,Design and Implementation or the Sun Network Filesystem,1985.[16] GlusterFS, Clustered File Storage that can scale to petabytes,2013. URL: , [Online; accessed 26-June-2013].[17] Wikipedia, Linear Tape-Open — Wikipedia, The Free Ency-clopedia, 2013. URL: http://en.wikipedia.org/w/index.php?title=Linear_Tape-Open&oldid=534100440 , [Online; accessed24-January-2013].[18] Wikipedia, Uninterruptible power supply — Wikipedia, The FreeEncyclopedia, 2012. URL: http://en.wikipedia.org/w/index.php?title=Uninterruptible_power_supply&oldid=528806006 ,[Online; accessed 16-January-2013].[19] M. L. Massie, B. N. Chun, D. E. Culler, The Ganglia DistributedMonitoring System: Design, Implementation And Experience,Parallel Computing 30 (2003) 1.[20] Collectd, collectd – The system statistics collection daemon, 2013.URL: http://collectd.org/ , [Online; accessed 26-June-2013].[21] B. E. Finley, VA systemimager, in: Proceedings of the 4th annualLinux Showcase & Conference - Volume 4, ALS’00, USENIXAssociation, Berkeley, CA, USA, 2000, pp. 11–11.[22] R. Brun, F. Rademakers, ROOT - An Object Oriented DataAnalysis Framework, in: AIHENP’96 Workshop, Lausane, vol-ume 389, 1996, pp. 81–86.[23] OMG, The Common Object Request Broker: Architecture andSpecification, Technical Report 2.0, Object Management Group,1995.[24] The ROOT Team, ROOT version 3.02/06 Release Notes, 2001.URL: .[25] C. Borgmeier, K. Mauritz, C. Stegmann, M. de Naurois, DAQand Analysis at H.E.S.S., 2001. URL: .[26] S. lai Lo, S. Pope, The Implementation of a High PerformanceORB over Multiple Network Transports, in: In Middleware 98:IFIP International Conference on Distributed Systems Platformsand Open Distributed Processing, 1998, pp. 157–172.[27] Wikipedia, Round-robin scheduling — Wikipedia, The FreeEncyclopedia, 2013. URL: http://en.wikipedia.org/w/index.php?title=Round-robin_scheduling&oldid=557149038 ,[Online; accessed 4-June-2013].[28] K. Bernl¨ohr, DBTools - Database tools for H.E.S.S. v1.3.3,H.E.S.S., 2009. Internal manual.[29] F. Aharonian, et al., Calibration of cameras of the H.E.S.S. de-tector, Astroparticle Physics 22 (2004) 109–125.[30] K. Bernl¨ohr, The AutoScheduler - An Automated Target Schedul-ing Tool For H.E.S.S. v1.05, H.E.S.S., 2006. Internal manual.[31] PyGTK, Python wrappers for the GTK+ graphical user interfacelibrary, 2013. URL: , [Online; accessed26-June-2013].[32] S. Ohm, C. van Eldik, K. Egberts, γ /hadron separation in ery-high-energy γ -ray astronomy using a multivariate analysismethod, Astroparticle Physics 31 (2009) 383–391.[33] Sebastian, Funk, Online Analysis of Gamma-ray Sources withH.E.S.S., Diploma thesis, Humboldt University Berlin, 2005.[34] A. Balzer, Systematic studies of the H.E.S.S. camera calibra-tion, Diploma thesis, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, 2010.[35] D. Berge, S. Funk, J. Hinton, Background modelling in very-high-energy γ -ray astronomy, Astronomy and Astrophysics 466(2007) 1219–1229.[36] MediaWiki, MediaWiki — MediaWiki, The Free Wiki Engine,2012. URL: , [Online; accessed 24-January-2013].[37] IPMI, The Intelligent Platform Management Interface (IPMI)is a standardised interface used to monitor the operation ofcomputer systems, 2013. URL: , [Online; accessed 26-June-2013].[38] Wikipedia, Virtual Network Computing — Wikipedia, The FreeEncyclopedia, 2013. URL: http://en.wikipedia.org/w/index.php?title=Virtual_Network_Computing&oldid=532570491 ,[Online; accessed 24-January-2013].[39] OpenVPN, OpenVPN is an open source solution to establishvirtual private networks (VPN), 2013. URL: http://openvpn.net/index.php/open-source.html , [Online; accessed 26-June-2013].[40] F. Aharonian, et al., HESS observations of γ -ray bursts in2003-2007, Astronomy and Astrophysics 495 (2009) 505–512.[41] D. Lennarz, P. Chadwick, W. Domainko, R. D. Parsons, G. Row-ell, P. H.T. Tam, for the H.E.S.S. collaboration, Searching forTeV emission from GRBs: the status of the H.E.S.S. GRB pro-gramme, in: Proceedings of the 7th Huntsville Gamma RayBurst Symposium, Nashville, 2013.[42] P. Hofverberg, R. Kankanyan, M. Panter, G. Hermann, W. Hof-mann, C. Deil, F. A. Benkhali, H.E.S.S. Collaboration, Commis-sioning and initial performance of the H.E.S.S. II drive system,in: Proceedings of the 33rd International Cosmic Ray Conference(ICRC2013), Rio de Janeiro (Brazil), 2013. arXiv:1307.4550 .[43] Make, Make is a tool which controls the generation of executablesand other non-source files of a program from the program’ssource files, 2013. URL: ,[Online; accessed 26-June-2013].[44] SCons, SCons is an Open Source software construction tool—that is, a next-generation build tool, 2013. URL: , [Online; accessed 26-June-2013].[45] Bugzilla, Bugzilla is server software designed to help you managesoftware development, 2013. URL: ,[Online; accessed 26-June-2013].[46] V. Fusion, VMware Fusion is a software hypervisor developedby VMware for Macintosh computers with Intel processors,2013. URL: https://my.vmware.com/web/vmware/info/slug/desktop_end_user_computing/vmware_fusion/3_0 , [Online; ac-cessed 26-June-2013]., [Online; ac-cessed 26-June-2013].