Optimised access to user analysis data using the gLite DPM
Sam Skipsey, Greig Cowan, Mike Kenyon, Stuart Purdie, Graeme Stewart
aa r X i v : . [ c s . D C ] O c t GLAS-PPE/2009-08 th May 2009
Department of Physics and AstronomyExperimental Particle Physics Group
Kelvin Building, University of Glasgow,Glasgow, G12 8QQ, ScotlandTelephone: +44 (0)141 330 2000 Fax: +44 (0)141 330 5881
Optimised access to user analysis data using the gLite DPM
Sam Skipsey , Greig Cowan , Mike Kenyon ,Stuart Purdie , Graeme Stewart University of Glasgow, Glasgow, G12 8QQ, Scotland Department of Physics, University of Edinburgh, Edinburgh, EH9 3JZ
Abstract
The ScotGrid distributed Tier-2 now provides more that 4MSI2K and 500TB for LHC computing, whichis spread across three sites at Durham, Edinburgh and Glasgow. Tier-2 sites have a dual role to play inthe computing models of the LHC VOs. Firstly, their CPU resources are used for the generation of MonteCarlo event data. Secondly, the end user analysis data is distributed across the grid to the site’s storagesystem and held on disk ready for processing by physicists’ analysis jobs. In this paper we show how wehave designed the ScotGrid storage and data management resources in order to optimise access by physiciststo LHC data. Within ScotGrid, all sites use the gLite DPM storage manager middleware. Using the EGEEgrid to submit real ATLAS analysis code to process VO data stored on the ScotGrid sites, we presentan analysis of the performance of the architecture at one site, and procedures that may be undertaken toimprove such. The results will be presented from the point of view of the end user (in terms of numberof events processed/second) and from the point of view of the site, which wishes to minimise load and theimpact that analysis activity has on other users of the system. th International Conference on Computing in High Energy and Nuclear PhysicsPrague, Czech Republic
Introduction
In general, WLCG VOs (ATLAS, CMS, LHCb and ALICE) have two main uses for supporting sites:
MonteCarlo production , in which simulation data is produced for use in event selection and analysis, and user analysis ,in which real detector data from the LHC is analysed . Production jobs have run on the Grid for several years,building up stocks of simulation data in preparation for the LHC turn on. As a result, sites and experimentshave optimised their existing infrastructure and best practice for production work. Sites supporting ATLASProduction, for example, have well-understood requirements on data storage and transfer required for a giventotal computational power, and procure (within constraints from other VOs) to provide infrastructure matchingthose requirements. User analysis is, by comparison, poorly understood, mainly because “true” user analysis iscontingent on the generation of actual data at the LHC. It is widely accepted that better understanding of thebehavior of such jobs is essential to providing best practice and infrastructure recommendations for sites, andfor experiments to properly provide advice to their members.The ATLAS VO has begun test simulations of user analysis patterns against Tier-2 sites in order to probetheir performance against this kind of workload. The “HammerCloud”[1] framework, which utlises the Ganga[2]Python-based grid user interface to automate job submission and statistics calculation, is capable of submittinghundreds of analysis jobs (based on a single real analysis workflow) against a site or group of sites. UKI-SCOTGRID-GLASGOW, the Glasgow Tier-2 site, has used regular HammerCloud tests, consisting of approx-imately 300 jobs per run, to test the performance of their storage infrastructure against user analysis load, andto monitor the effect of improvements made.Glasgow’s storage infrastructure, before optimisation, consisted of • DPM[3] head node (svr018.gla.scotgrid.ac.uk) running on a dual-core 2.5GHz AMD Opteron with 8Gb ofRAM. The MySQL database backend for the DPM services was hosted on 2 Hitachi 10,000RPM Ultrastarharddisks in a RAID 1 configuration, in order to provide fast, low-latency access with some data security. •
18 disk pool servers in a mix of configurations. The majority (newer) disks have 20 SATA disks with aPCI-X hardware controller, configured in a single RAID 6 array and partitioned across 5 volumes. • Gigabit ethernet links between all services.This configuration was more than sufficient to support ATLAS production work, never achieving high load onthe head node nor the disk servers even with a full cluster (of 1920) jobs.
HammerCloud test 38 (hence referred to as HC38) is a typical HammerCloud test run against Glasgow beforeany optimisations were performed. Figure 2 shows the CPU load on the DPM head node approaching 100%,clearly bottlenecking the performance of the rest of the infrastructure. By comparison, the disk pool nodes arebarely loaded. The HammerCloud statistics (figure 3) show that we achieved an event-rate of almost 10Hz.Initial testing showed that performance of the DPM storage was noticably lacking, barely achieving event-rates of 10Hz, with a concomitant limitation on the mean and maximum job efficiency. The load on the DPMhead node itself implies that it is clearly the effective bottleneck in this case.
As the bottleneck appeared to be the available computational power on the head node, we considered movingthe DPM services to newer hardware. However, on investigation of the load pattern on the head node, wedetermined that all the CPU load was due to the “srmv2.2” and “dpm” processes, whilst the MySQL backendwas engaging in large amounts of IO activity. Considering the limitations on available hardware, we decidedto adopt a split head node configuration: the MySQL database would remain on the old hardware, to takeadvantage of the fast disks there, whilst the DPM services would be moved onto a repurposed worker node.In the process of this move we renamed the machines to keep the DPM services associated with the sameDNS name. Hence, “svr015.gla.scotgrid.ac.uk” is the new name for the original hardware, and the worker node“node310” was renamed “svr018.gla.scotgrid.ac.uk”. The partitioning process, including suitable reimaging ofthe worker node, and reconfiguration of the services, took less than 2 hours, with the significant advantage of We note here that the LHCb computing model foresees analysis only at Tier-1 centres and reserves Tier-2 CPU capacity forsimulation
1) Successful transfers (600 second intervals) byDN b) Failed transfers (600 second intervals) by DNFigure 1: DPM statistics for HammerCloud test “HC38”.a) CPU load on DPM Head node (percentage offull load)b) CPU load on DPM pool (average, percentageof full load) c) Network load on DPM pool (total, bytes/sec)Figure 2: Load on services during HammerCloud test “HC38”.a) CPU efficiency(cputime/walltime) b) Event rate (events persecond per job) c) Simultaneous running jobsFigure 3: Statistics generated by HammerCloud test “HC38” for the Glasgow site2) Successful transfers (600 second intervals) byDN b) Failed transfers (600 second intervals) by DNFigure 4: DPM statistics for HammerCloud test “HC135”.a) CPU load on MySQL server (percentage of fullload) b) CPU load on DPM services node (percentageof full load)c) CPU load on DPM pool (average, percentageof full load) d) Network load on DPM pool (total, bytes/sec)Figure 5: Load on services during HammerCloud test “HC135”.Glasgow’s cfengine-based cluster configuration management system. The new configuration of the DPM headwas: • MySQL server on old head node dual-core Opteron, 8GB memory with 2 10,00RPM disk RAID 1 array • DPM services on dual quad-core 2.5GHz Intel Xeon with 16Gb memory, small 7200RPM ”energy efficient”hard disk
The next HammerCloud test, HC135, ran against the reconfigured hardware, producing the results in figures 4 to6. As can be seen, the cpu load on the DPM head node is roughly comparable to the previous load, consideringthe quadrupling of the head node’s compute power. As before, the disk pool servers are barely loaded; however,the MySQL server component now has considerable IOWait showing. With this reconfiguration, our event-rateincreased by 40%, to around 14Hz, with a concomitant increase in our job efficiency (as most of the inefficiencyof an analysis job is caused by waits for data to arrive). This is also reflected in the doubling of the successfultransfers logged by DPM. However, as the pool servers were still relatively unloaded, it seemed reasonable toassume that further efficiency gains could be achieved with optimisations on the MySQL server.3) CPU efficiency(cputime/walltime) b) Event rate (events persecond per job) c) Simultaneous running jobsFigure 6: Statistics generated by HammerCloud test “HC135” for the Glasgow site.
Analysis of the points of stress by enabling slow query logging in MySQL showed that the most common querieswhich increased load were on unindexed, or non-optimally indexed, quantities in tables mainly in the dpm db used by DPM to manage requests. The SRM[4] protocol requires that a storage system manage all requests ina resilient and stateful manner, thus resulting in each transfer request to the head node causing writes to anappropriate table ( dpm get filereq for get requests, for example, which are the most common requests againsta Tier-2 storage system in general), and additional writes on each completion. Whilst the tables used by DPMhave generally good indexing, we identified a few cases which benefited from additional index creation.We added the following index to the database: create index pfn_lifetime on dpm_get_filereq (pfn(255), lifetime); this actually modifies an existing index (only on pfn) to a composite index on pfn, lifetime. The associatedqueries are extremely common on DPMs servicing data requests, and so the small gain in efficiency from thecompound index adds up over the large number of such queries produced by HammerCloud style user analysis.We also discovered that monitoring tools deployed on our DPM produced spikes of high IO load whenquerying the request tables. The slow queries log again provided hints as to the guilty queries, allowing us toremove most of the load by adding the indices: create index status_idx on dpm_put_filereq(status);create index stime_idx on dpm_req(stime);
Finally, and related to the above, we determined that the MonAMI[5] DPM plugin was performing fairlyfrequent queries against the cns db (which stores the DPM namespace, and thus information about all the filesstored) in order to obtain filesystem information. Whilst the cns db is extremely well indexed, we determinedthat in this case, the addition of create index usage_by_group Cns_file_metadata(gid, filesize); removed almost all the load produced by MonAMI’s frequent global queries. Most of the “noise” visible in theload plot for the MySQL server in figure 3 appears to have been due to queries of this type, and effectivelyvanishes with the implementation of these queries.
Because DPM needs to write to one of dpm db tables for every request it receives, one cannot build indices onthe tables in the most straightforward way. MySQL locks InnoDB tables while building indices on them, andthis will cause the relevant kind of request on the DPM to fail. We avoided this by the following procedure: • Examine the table to determine the highest-numbered row which refers to an “historical” request - onewhich is not undergoing any changes anymore as it is complete and the requestor has finished with it. • Clone the historical part of the table to a new table (called, say dpm req copy ), with indices (in theseexamples, FOO is the last “historical” record. 4
REATE TABLE dpm_putfilereq_copy LIKE dpm_put_filereq;CREATE INDEX status_idx ON dpl_put_filereq_copy;INSERT INTO dpm_put_filereq_copy SELECT * FROM dpm_put_filereq WHERE rowid < FOO; • In general, the number of “non-historical” events in the original table is small. This step may take hoursas, if you have a DPM which has been in production for a long time, there are vastly more historicalentries than current ones. • Once completed, stop the dpm and dpnsdaemon processes on the DPM headnode. service dpm stopservice dpnsdaemon stop
This is because MySQL cannot rename locked tables, so one has to stop the tables being altered bystopping the services which can write to them. • Copy the rest of the rows to the copy table (they will be indexed automatically).
INSERT INTO dpm_put_filereq_copy SELECT * FROM dpm_put_filereq WHERE rowid > FOO-1; • Switch the table names, so that the “copy” is now the “real” table.
RENAME TABLE dpm_put_filereq TO dpm_put_filereq_oldRENAME TABLE dpm_put_filereq_copy TO dpm_put_filereq • Start up dpm and dpnsdaemon processes. service dpm startservice dpnsdaemon start
We discovered, after performing these optimisations, that the default MySQL configuration for DPM barelyassigns any memory to the InnoDB bufferpool, which is used by the InnoDB engine to cache and buffer commontable reads and writes. As a result, the hit-rate on the buffer was around 97% under some load. (Thus, 3%of reads hit the disk.) We increased the bufferpool size to half the physical RAM available on the server. Theresulting effect on the hit-rate was to increase it to 99.9%, reducing effective disk load on reads to 1/30th of itsprevious value. Writes are less easy to optimise in this way, due to the need for fsyncs to be called on writetransactions for data integrity.
Figures 7 to 9 show the performance of the DPM after MySQL optimisation. The load on the MySQL serveris clearly somewhat reduced, and significantly less ”noisy” than it was previously, and the cpu/io load on thepool nodes still hasn’t increased. There is a small increase in the successful transfer rate - from 600 to 700 perminute; however, this is not reflected in the event-rate, which appears unexpectedly low (although still betterthan the unoptimised case). HC135 occurred during a time when the cluster did not have enough free job slotsfor all analysis jobs to start simultaneously, whilst HC193 arrived on an almost empty cluster. The event-rateappears lower for HC193 because the higher transfer rate is spread over significantly more simultaneous jobs. Ofconcern is the significant increase in transfer failures after optimisation. It appears that this is due to saturationof the network connections of the pool servers.
By a combination of partitioning of hardware and the application of a small number of tweaks to MySQLconfiguration, we have significantly improved the performance of the DPM storage infrastructure at UKI-SCOTGRID-GLASGOW with respect to “typical” user analysis jobs as represented by HammerCloud tests.Further improvements would require upgrading of physical hardware, moving from Gigabit ethernet to 10GigE(or equivalent bandwidth solutions, such as SDR Infiniband), resulting in a significant expense (although itis certain that such upgrades will happen before the LHC is turned on). User analysis stresses storage in a5) Successful transfers (600 second intervals) byDN b) Failed transfers (600 second intervals) by DNFigure 7: DPM statistics for HammerCloud test “HC193”a) CPU load on MySQL server (percentage of fullload) b) CPU load on DPM services node (percentageof full load)c) CPU load on DPM pool (average, percentageof full load) d) Network load on DPM pool (total, bytes/sec)Figure 8: Load on services during HammerCloud test “HC193”.a) CPU efficiency(cputime/walltime) b) Event rate (events persecond per job) c) Simultaneous running jobsFigure 9: Statistics generated by HammerCloud test “HC193” for the Glasgow site.6ay that production use does not - requiring transfers of multiple small (AOD) files for each job, with very fewpauses between transfers . Thus, for a cluster with many fast worker nodes, like UKI-SCOTGRID-GLASGOW,it is important to have as fast a DPM head node as possible. IOwait on the database backend is a significantlimiter of total performance, and reducing this significantly affects the total performance. Ultimately, however,you become limited by the capacity of your networking, and the seek rates on the disk servers themselves (aswell as the overhead of GSI authentication per request). Acknowledgements
The authors would like to acknowledge the prompt help of Johannes Elmsheuser and Dan van der Ster inscheduling the HammerCloud tests needed to perform this optimisation process.This work was supported by the GridPP project, funded by the UK Science and Technologies FacilitiesCouncil. Stuart Purdie was also supported by the Enabling Grids for E-sciencE, funded by the EU.
ReferencesReferences [1] van der Ster D et al
J. Phys.: Conf. Series
This journal [2] Brochu F et al , 2009 Ganga: a tool for computational-task management and easy access to Grid resources
Preprint arXiv:0902.2685v1 [cs.DC][3] DPM: LCG Disk Pool Manager. http://twiki.cern.ch/twiki/bin/view/LCG/DpmAdminGuide , r80, 21 August2008[4] The Storage Resource Manager Interface Specification, Version 2.2 http://sdm.lbl.gov/srm-wg/doc/SRM.v2.2.html , 24 May 2008[5] MonAMI website http://monami.sourceforge.net/index.html , May 2009 ATLAS are now distributing merged AOD files to Tier-2 sites which will reduce the number of open() calls the headnode needsto support and should allow us, with the optimisations in place, to support even more user analysis jobs on the cluster.calls the headnode needsto support and should allow us, with the optimisations in place, to support even more user analysis jobs on the cluster.