[PDF] The IceProd Framework: Distributed Data Processing for the IceCube Neutrino Observatory

Abstract

IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos, iden- tify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCube detector and processing of data require a significant amount of computational resources. IceProd is a distributed management system based on Python, XML-RPC and GridFTP. It is driven by a central database in order to coordinate and admin- ister production of simulations and processing of data produced by the IceCube detector. IceProd runs as a separate layer on top of other middleware and can take advantage of a variety of computing resources, including grids and batch systems such as CREAM, Condor, and PBS. This is accomplished by a set of dedicated daemons that process job submission in a coordinated fashion through the use of middleware plugins that serve to abstract the details of job submission and job management from the framework.

Full PDF

aa r X i v : . [ c s . D C ] A ug The IceProd Framework:Distributed Data Processing for the IceCube Neutrino Observatory

M. G. Aartsen b , R. Abbasi ac , M. Ackermann at , J. Adams o , J. A. Aguilar w , M. Ahlers ac , D. Altmann v , C. Arguelles ac ,J. Au ﬀ enberg ac , X. Bai ah,1 , M. Baker ac , S. W. Barwick y , V. Baum ae , R. Bay g , J. J. Beatty q,r , J. Becker Tjus j ,K.-H. Becker as , S. BenZvi ac , P. Berghaus at , D. Berley p , E. Bernardini at , A. Bernhard ag , D. Z. Besson aa , G. Binder h,g ,D. Bindig as , M. Bissok a , E. Blaufuss p , J. Blumenthal a , D. J. Boersma ar , C. Bohm ak , D. Bose am , S. B¨oser k ,O. Botner ar , L. Brayeur m , H.-P. Bretz at , A. M. Brown o , R. Bruijn z , J. Casey e , M. Casier m , D. Chirkin ac , A. Christov w ,B. Christy p , K. Clark an , L. Classen v , F. Clevermann t , S. Coenders a , S. Cohen z , D. F. Cowen aq,ap , A. H. Cruz Silva at ,M. Danninger ak , J. Daughhetee e , J. C. Davis q , M. Day ac , C. De Clercq m , S. De Ridder x , P. Desiati ac, ∗ ,K. D. de Vries m , M. de With i , T. DeYoung aq , J. C. D´ıaz-V´elez ac, ∗∗ , M. Dunkman aq , R. Eagan aq , B. Eberhardt ae ,B. Eichmann j , J. Eisch ac , S. Euler a , P. A. Evenson ah , O. Fadiran ac, ∗ , A. R. Fazely f , A. Fedynitch j , J. Feintzeig ac ,T. Feusels x , K. Filimonov g , C. Finley ak , T. Fischer-Wasels as , S. Flis ak , A. Franckowiak k , K. Frantzen t , T. Fuchs t ,T. K. Gaisser ah , J. Gallagher ab , L. Gerhardt h,g , L. Gladstone ac , T. Gl¨usenkamp at , A. Goldschmidt h , G. Golup m ,J. G. Gonzalez ah , J. A. Goodman p , D. G´ora v , D. T. Grandmont u , D. Grant u , P. Gretskov a , J. C. Groh aq , A. Groß ag ,C. Ha h,g , A. Haj Ismail x , P. Hallen a , A. Hallgren ar , F. Halzen ac , K. Hanson l , D. Hebecker k , D. Heereman l ,D. Heinen a , K. Helbing as , R. Hellauer p , S. Hickford o , G. C. Hill b , K. D. Ho ﬀ man p , R. Ho ﬀ mann as , A. Homeier k ,K. Hoshina ac , F. Huang aq , W. Huelsnitz p , P. O. Hulth ak , K. Hultqvist ak , S. Hussain ah , A. Ishihara n , E. Jacobi at ,J. Jacobsen ac , K. Jagielski a , G. S. Japaridze d , K. Jero ac , O. Jlelati x , B. Kaminsky at , A. Kappes v , T. Karg at , A. Karle ac ,M. Kauer ac , J. L. Kelley ac , J. Kiryluk al , J. Kl¨as as , S. R. Klein h,g , J.-H. K¨ohne t , G. Kohnen af , H. Kolanoski i ,L. K¨opke ae , C. Kopper ac , S. Kopper as , D. J. Koskinen s , M. Kowalski k , M. Krasberg ac , A. Kriesten a , K. Krings a ,G. Kroll ae , J. Kunnen m , N. Kurahashi ac , T. Kuwabara ah , M. Labare x , H. Landsman ac , M. J. Larson ao ,M. Lesiak-Bzdak al , M. Leuermann a , J. Leute ag , J. L¨unemann ae , O. Mac´ıas o , J. Madsen aj , G. Maggi m ,R. Maruyama ac , K. Mase n , H. S. Matis h , F. McNally ac , K. Meagher p , M. Merck ac , G. Merino ac , T. Meures l ,S. Miarecki h,g , E. Middell at , N. Milke t , J. Miller m , L. Mohrmann at , T. Montaruli w,2 , R. Morse ac , R. Nahnhauer at ,U. Naumann as , H. Niederhausen al , S. C. Nowicki u , D. R. Nygren h , A. Obertacke as , S. Odrowski u , A. Olivas p ,A. Omairat as , A. O’Murchadha l , L. Paul a , J. A. Pepper ao , C. P´erez de los Heros ar , C. Pfendner q , D. Pieloth t , E. Pinat l ,J. Posselt as , P. B. Price g , G. T. Przybylski h , M. Quinnan aq , L. R¨adel a , I. Rae ad, ∗ , M. Rameez w , K. Rawlins c , P. Redl p ,R. Reimann a , E. Resconi ag , W. Rhode t , M. Ribordy z , M. Richman p , B. Riedel ac , J. P. Rodrigues ac , C. Rott am ,T. Ruhe t , B. Ruzybayev ah , D. Ryckbosch x , S. M. Saba j , H.-G. Sander ae , M. Santander ac , S. Sarkar s,ai , K. Schatto ae ,F. Scheriau t , T. Schmidt p , M. Schmitz t , S. Schoenen a , S. Sch¨oneberg j , A. Sch¨onwald at , A. Schukraft a , L. Schulte k ,D. Schultz ac, ∗ , O. Schulz ag , D. Seckel ah , Y. Sestayo ag , S. Seunarine aj , R. Shanidze at , C. Sheremata u , M. W. E. Smith aq ,D. Soldin as , G. M. Spiczak aj , C. Spiering at , M. Stamatikos q,3 , T. Stanev ah , N. A. Stanisha aq , A. Stasik k ,T. Stezelberger h , R. G. Stokstad h , A. St¨oßl at , E. A. Strahler m , R. Str¨om ar , N. L. Strotjohann k , G. W. Sullivan p ,H. Taavola ar , I. Taboada e , A. Tamburro ah , A. Tepe as , S. Ter-Antonyan f , G. Teˇsi´c aq , S. Tilav ah , P. A. Toale ao ,M. N. Tobin ac , S. Toscano ac , M. Tselengidou v , E. Unger j , M. Usner k , S. Vallecorsa w , N. van Eijndhoven m ,A. Van Overloop x , J. van Santen ac , M. Vehring a , M. Voge k , M. Vraeghe x , C. Walck ak , T. Waldenmaier i , M. Wallra ﬀ a ,Ch. Weaver ac , M. Wellons ac , C. Wendt ac , S. Westerho ﬀ ac , N. Whitehorn ac , K. Wiebe ae , C. H. Wiebusch a ,D. R. Williams ao , H. Wissing p , M. Wolf ak , T. R. Wood u , K. Woschnagg g , D. L. Xu ao , X. W. Xu f , J. P. Yanez at ,G. Yodh y , S. Yoshida n , P. Zarzhitsky ao , J. Ziemann t , S. Zierke a , M. Zoll ak a III. Physikalisches Institut, RWTH Aachen University, D-52056 Aachen, Germany b School of Chemistry & Physics, University of Adelaide, Adelaide SA, 5005 Australia ∗ Corresponding author ∗∗ Principal corresponding author

Email addresses: [email protected] (P. Desiati), [email protected] (J. C. D´ıaz-V´elez), [email protected] (O. Fadiran), [email protected] (I. Rae), [email protected] (D. Schultz) Physics Department, South Dakota School of Mines and Technology, Rapid City, SD 57701, USA also Sezione INFN, Dipartimento di Fisica, I-70126, Bari, Italy NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA

Preprint submitted to Journal of Parallel and Distributed Computing August 26, 2014

Dept. of Physics and Astronomy, University of Alaska Anchorage, 3211 Providence Dr., Anchorage, AK 99508, USA d CTSPS, Clark-Atlanta University, Atlanta, GA 30314, USA e School of Physics and Center for Relativistic Astrophysics, Georgia Institute of Technology, Atlanta, GA 30332, USA f Dept. of Physics, Southern University, Baton Rouge, LA 70813, USA g Dept. of Physics, University of California, Berkeley, CA 94720, USA h Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA i Institut f¨ur Physik, Humboldt-Universit¨at zu Berlin, D-12489 Berlin, Germany j Fakult¨at f¨ur Physik & Astronomie, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany k Physikalisches Institut, Universit¨at Bonn, Nussallee 12, D-53115 Bonn, Germany l Universit´e Libre de Bruxelles, Science Faculty CP230, B-1050 Brussels, Belgium m Vrije Universiteit Brussel, Dienst ELEM, B-1050 Brussels, Belgium n Dept. of Physics, Chiba University, Chiba 263-8522, Japan o Dept. of Physics and Astronomy, University of Canterbury, Private Bag 4800, Christchurch, New Zealand p Dept. of Physics, University of Maryland, College Park, MD 20742, USA q Dept. of Physics and Center for Cosmology and Astro-Particle Physics, Ohio State University, Columbus, OH 43210, USA r Dept. of Astronomy, Ohio State University, Columbus, OH 43210, USA s Niels Bohr Institute, University of Copenhagen, DK-2100 Copenhagen, Denmark t Dept. of Physics, TU Dortmund University, D-44221 Dortmund, Germany u Dept. of Physics, University of Alberta, Edmonton, Alberta, Canada T6G 2E1 v Erlangen Centre for Astroparticle Physics, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, D-91058 Erlangen, Germany w D´epartement de physique nucl´eaire et corpusculaire, Universit´e de Gen`eve, CH-1211 Gen`eve, Switzerland x Dept. of Physics and Astronomy, University of Gent, B-9000 Gent, Belgium y Dept. of Physics and Astronomy, University of California, Irvine, CA 92697, USA z Laboratory for High Energy Physics, ´Ecole Polytechnique F´ed´erale, CH-1015 Lausanne, Switzerland aa Dept. of Physics and Astronomy, University of Kansas, Lawrence, KS 66045, USA ab Dept. of Astronomy, University of Wisconsin, Madison, WI 53706, USA ac Dept. of Physics and Wisconsin IceCube Particle Astrophysics Center, University of Wisconsin, Madison, WI 53706, USA ad Dept. of Computer Science, University of Wisconsin, Madison, WI 53706, USA ae Institute of Physics, University of Mainz, Staudinger Weg 7, D-55099 Mainz, Germany af Universit´e de Mons, 7000 Mons, Belgium ag T.U. Munich, D-85748 Garching, Germany ah Bartol Research Institute and Dept. of Physics and Astronomy, University of Delaware, Newark, DE 19716, USA ai Dept. of Physics, University of Oxford, 1 Keble Road, Oxford OX1 3NP, UK aj Dept. of Physics, University of Wisconsin, River Falls, WI 54022, USA ak Oskar Klein Centre and Dept. of Physics, Stockholm University, SE-10691 Stockholm, Sweden al Dept. of Physics and Astronomy, Stony Brook University, Stony Brook, NY 11794-3800, USA am Dept. of Physics, Sungkyunkwan University, Suwon 440-746, Korea an Dept. of Physics, University of Toronto, Toronto, Ontario, Canada, M5S 1A7 ao Dept. of Physics and Astronomy, University of Alabama, Tuscaloosa, AL 35487, USA ap Dept. of Astronomy and Astrophysics, Pennsylvania State University, University Park, PA 16802, USA aq Dept. of Physics, Pennsylvania State University, University Park, PA 16802, USA ar Dept. of Physics and Astronomy, Uppsala University, Box 516, S-75120 Uppsala, Sweden as Dept. of Physics, University of Wuppertal, D-42119 Wuppertal, Germany at DESY, D-15735 Zeuthen, Germany

Abstract

IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos,identify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCubedetector and processing of data require a signiﬁcant amount of computational resources. This paper presents the ﬁrstdetailed description of IceProd, a lightweight distributed management system designed to meet these requirements. Itis driven by a central database in order to manage mass production of simulations and analysis of data produced by theIceCube detector. IceProd runs as a separate layer on top of other middleware and can take advantage of a variety ofcomputing resources, including grids and batch systems such as CREAM, HTCondor, and PBS. This is accomplishedby a set of dedicated daemons that process job submission in a coordinated fashion through the use of middlewareplugins that serve to abstract the details of job submission and job management from the framework.

Keywords:

Data Management, Grid Computing, Monitoring, Distributed Computing2 . Introduction

Large experimental collaborations often need to pro-duce extensive volumes of computationally intensiveMonte Carlo simulations and process vast amounts ofdata. These tasks are usually farmed out to large com-puting clusters or grids. For such large datasets, it is im-portant to be able to document details associated witheach task, such as software versions and parameters likethe pseudo-random number generator seeds used foreach dataset. Individual members of such collaborationsmight have access to modest computational resourcesthat need to be coordinated for production. Such com-putational resources could also potentially be pooled inorder to provide a single, more powerful, and more pro-ductive system that can be used by the entire collabo-ration. This article describes the design of a softwarepackage meant to address all of these concerns. It pro-vides a simple way to coordinate processing and storageof large datasets by integrating grids and small clusters.

The IceCube detector shown in Figure 1 is locatedat the geographic South Pole and was completed at theend of 2010 [1, 2]. It consists of 5160 optical sensorsburied between 1450 and 2450 meters below the sur-face of the South Pole ice sheet and is designed to de-tect interactions of neutrinos of astrophysical origin [1].However, it is also sensitive to downward-going highlyenergetic muons and neutrinos produced in cosmic-ray-induced air showers. IceCube records ∼ cosmic-rayevents per year. The cosmic-ray-induced muons out-number neutrino-induced events (including ones fromatmospheric origin) by about 500,000:1. They repre-sent a background for most IceCube analyses and areﬁltered prior to transfer to the data processing centerin the Northern Hemisphere. Filtering at the data col-lection source is required because of bandwidth limi-tations on the satellite connection between the detectorand the processing location [3]. About 100 GB of datafrom the IceCube detector is transferred to the main datastorage facility daily. In order to facilitate record keep-ing, the data is divided into runs, and each run is furthersubdivided into multiple ﬁles. The size of each ﬁle isdictated by what is considered optimal for storage andaccess. Each run typically consists of hundreds of ﬁles,resulting in ∼ / run Total per yearLevel1 2400 h 2 . × hLevel2 9500 h 1 . × hLevel3 15 h 1 . × hTable 2: Runtime of various Monte Carlo simulationsof background cosmic-ray shower events and neutrinosignal with di ﬀ erent energy distributions. The medianenergy is based on the distribution of events that triggerthe detector. The number of events reﬂects the typicalper-year requirements for IceCube analyses.Simulation Med. Energy t / event eventsAir showers 1 . × GeV 5 ms ∼ Neutrinos 3 . × GeV 316 ms ∼ Neutrinos 8 . × GeV 53 ms ∼ The IceCube collaboration is comprised of 43 re-search institutions from Europe, North America, Japan,Australia, and New Zealand. Members of the collab-oration have access to 25 di ﬀ erent computing clustersand grids in Europe, Japan, Canada and the U.S. Theserange from small computer farms of 30 nodes to largegrids, such as the European Grid Infrastructure (EGI),Swedish Grid Initiative (SweGrid), Canada’s WestGridand the Open Science Grid (OSG), that may each have = electronvolts (unit of energy) IceTop , a surface air-shower subdetector.4housands of computing nodes. The total number ofnodes available to IceCube member institutions varieswith time since much of our use is opportunistic andavailability depends on the usage by other projects andexperiments. In total, IceCube simulation has run onmore than 11,000 distinct multicore computing nodes.On average, IceCube simulation production has runconcurrently on ∼ ,

000 cores at any given time sincedeployment, and it is anticipated to run on ∼ ,

000 coressimultaneously during upcoming productions.

2. IceProd

The IceProd framework is a software package devel-oped for IceCube with the goal of managing productionsacross distributed systems and pooling together isolatedcomputing resources that are scattered across memberinstitutions of the Collaboration and beyond. It consistsof a central database and a set of daemons that are re-sponsible for the management of grid jobs and data han-dling through the use of existing grid technology andnetwork protocols.IceProd makes job scripting easier and sharing pro-ductions more e ﬃ cient. In many ways it is similar toPANDA Grid, the analysis framework for the PANDAexperiment [4], in that both tools are distributed sys-tems based on a central database and an interface tolocal batch systems. Unlike PANDA Grid which de-pends heavily on AliEn, the grid middleware for the AL-ICE experiment [5], and on the ROOT analysis frame-work [6], IceProd was built in-house with minimal soft-ware requirements and is not dependent on any particu-lar middleware or analysis framework. It is designed torun completely in user space with no administrative ac-cess, allowing greater ﬂexibility in installation. IceProdalso includes a built-in monitoring system with no de-pendencies on any external tools for this purpose. Theseproperties make IceProd a very lightweight yet powerfultool and give it a greater scope beyond IceCube-speciﬁcapplications.The software package includes a set of libraries, ex-ecutables and daemons that communicate with the cen-tral database and coordinate to share responsibility forthe completion of tasks. The details of job submissionand management in di ﬀ erent grid environments are ab-stracted through the use of plugin modules that will bediscussed in Section 3.2.1.IceProd can be used to integrate an arbitrary num-ber of sites including clusters and grids. It is, however,not a replacement for other cluster and grid managementtools or any other middleware. Instead, it runs on top of these as a separate layer providing additional function-ality. IceProd ﬁlls a gap between the user or productionmanager and the powerful middleware and batch systemtools available on computing clusters and grids.Many of the existing middleware tools, includingCondor-C, Globus and CREAM, make it possible to in-terface any number of computing clusters into a largerpool. However, most of these tools need to be installedand conﬁgured by system administrators and, in somecases, customization for general purpose applications isnot feasible. In contrast to most of these applications,IceProd runs at the user level and does not require ad-ministrator privileges. This makes it possible for indi-vidual users to build large production systems by pool-ing small computational resources together.Security and data integrity are concerns in any soft-ware architecture that depends heavily on communi-cation through the Internet. IceProd includes featuresaimed at minimizing security and data corruption risks.Security and data integrity are addressed in Section 3.8.The IceProd client provides a graphical user interface(GUI) for conﬁguring simulations and submitting jobsthrough a “production server.” It provides a method forrecording all the software versions, physics parameters,system settings, and other steering parameters associ-ated with a job in a central production database. IceProdalso includes a web interface for visualization and livemonitoring of datasets. Details about the GUI client anda text-based client are discussed in Section 3.5.

3. Design Elements of IceProd

The IceProd software package can be logically di-vided into the following components or software li-braries: • iceprod-core —a set of modules and libraries ofcommon use throughout IceProd. • iceprod-server —a collection of daemons and li-braries to manage and schedule job submission andmonitoring. • iceprod-modules —a collection of predeﬁnedclasses that provide an interface between IceProdand an arbitrary task to be performed on acomputing node, as deﬁned in Section 3.3. • iceprod-client —a client (both graphical and text)that can download, edit, and submit dataset steer-ing ﬁles to be processed.5 A database that stores conﬁgured parameters, li-braries (including version information), job infor-mation, and performance statistics. • A web application for monitoring and controllingdataset processing.These components are described in further detail in thefollowing sections.

The iceprod-core package contains modules and li-braries common to all other IceProd packages. These in-clude classes and methods for writing and parsing XMLﬁles and transporting data. The classes that deﬁne jobexecution on a host are contained in this package. The iceprod-core also includes an interpreter (Section 3.1.3)for a simple scripting language that provides some ﬂex-ibility for parsing XML steering ﬁles.

One of the complications of operating on heteroge-neous systems is the diversity of architectures, operat-ing systems, and compilers. IceProd uses HTCondor’sNMI-Metronome build and test system [7] for buildingthe IceCube software on a variety of platforms and stor-ing the built packages on a server. As part of the man-agement of each job, IceProd submits a Job ExecutionPilot (JEP) to the cluster / grid queue. This script deter-mines what platform a job is running on and, after con-tacting the monitoring server, which software packageto download and execute. During runtime, the JEP per-forms status updates through the monitoring server viaremote procedure calls using XML-RPC [8]. This in-formation is updated on the database and is displayedon the monitoring web interface. Upon completion,the JEP removes temporary ﬁles and directories cre-ated for the job. Depending on the conﬁguration, itwill also cache a copy of the software used, makingit available for future JEPs. When caching is enabled,an MD5 checksum is performed on the cached softwareand compared to what is stored on the server in order toavoid using corrupted or outdated software.Jobs can fail under many circumstances. These fail-ures include failed submissions due to transient sys-tem problems and execution failures due to problemswith the execution host. At a higher level, errors spe-ciﬁc to IceProd include communication problems withthe monitoring daemon or the data repository. In orderto account for possible transient errors, the design ofIceProd includes a set of states through which a job willtransition in order to guarantee successful completion of WAITING QUEUEINGRESET QUEUED

False

PROCESSING

TrueFalse ok?ok?

True

Move data to disk

False requeue ok?

True

COPIEDERROR CLEANING OK SubmitMax. time reached

SUSPENDEDCLEANING

Start

Figure 2: State diagram for the JEP. Each of the non-error states through which a job passes includes a con-ﬁgurable timeout. The purpose of this timeout is to ac-count for any communication errors that may have pre-vented a job from setting its status correctly.a well-conﬁgured job. The state diagram for an IceProdjob is depicted in Figure 2.

In the context of this document, a dataset is deﬁned asa collection of jobs that share a basic set of scripts andsoftware but whose input parameters depend on the IDof each individual job. A conﬁguration or steering ﬁledescribes the tasks to be executed for an entire dataset.IceProd steering ﬁles are XML documents with a de-ﬁned schema. These steering ﬁles include informationabout the speciﬁc software versions used for each of thesections, known as trays (a term borrowed from IceTray,the C ++ software framework used by the IceCube Col-laboration [9]). An IceProd tray represents an instanceof an environment corresponding to a set of librariesand executables and a chain of conﬁgurable moduleswith corresponding parameters and input ﬁles neededfor the job. In addition, there is a header section foruser-deﬁned parameters and expressions that are glob-ally accessible by di ﬀ erent modules. A limited programming language was developed inorder to allow more scripting ﬂexibility that depends onruntime parameters such as job ID, run ID, and datasetID. This lightweight, embedded, domain-speciﬁc lan-guage (DSL) allows for a single XML job descriptionto be applied to an entire dataset following an SPMD(single process, multiple data) paradigm. It is power-ful enough to give some ﬂexibility but su ﬃ ciently re-6trictive to limit abuse. Examples of valid expressionsinclude the following: • $args() —a command line argumentpassed to the job (such as job ID or dataset ID). • $steering() —a user deﬁned variable. • $system() —a system-speciﬁc parameterdeﬁned by the server. • $eval() —a mathematical or logical ex-pression (in Python). • $sprintf(,) —string format-ting. • $choice() —random choice of an ele-ment from the list.The evaluation of such expressions is recursive andallows for some complexity. However, there are limita-tions in place that prevent abuse of this feature. As anexample, $eval() statements prohibit such things asloops and import statements that would allow the userto write an entire program within an expression. Thereis also a limit on the number of recursions in order toprevent closed loops in recursive statements. The iceprod-server package is comprised of fourdaemons and their respective libraries:1. soaptray —an HTTP server that receives clientXML-RPC requests for scheduling jobs and steer-ing information which then uploaded to thedatabase.2. soapqueue —a daemon that queries the databasefor available tasks to be submitted to a particularcluster or grid. This daemon is also responsible forsubmitting jobs to the cluster or grid through a setof plugin classes.3. soapmon —a monitoring HTTP server that receivesXML-RPC updates from jobs during execution andperforms status updates to the database.4. soapdh —a data handling / garbage collection dae-mon that removes temporary ﬁles and performsany postprocessing tasks. The preﬁx soap is used for historical reasons. The original im-plementation of IceProd relied on SOAP for remote procedure calls.This was replaced by XML-RPC which has better support in Python.

There are two modes of operation. The ﬁrst is an un-monitored mode in which jobs are simply sent to thequeue of a particular system. This mode provides a toolfor scheduling jobs that don’t need to be recorded anddoes not require a database. In the second mode, allparameters are stored in a database that also tracks theprogress of each job. The soapqueue daemon runningat each of the participating sites periodically queries thedatabase to check if any tasks have been assigned to it.It then downloads the steering conﬁguration and sub-mits a given number of jobs to the cluster or grid whereit is running. The number of jobs that IceProd maintainsin the queue at each site can be conﬁgured individuallyaccording to the speciﬁcs of each cluster, including thesize of the cluster and local queuing policies. Figure 3is a graphical representation that describes the interrela-tion of these daemons. The state diagram in Figure 4illustrates the role of the daemons in dataset submis-sion while Figure 5 illustrates the ﬂow of informationthrough the various protocols.

Clientsoaptray Database Jobsoapqueue jobs?TrueFalse soapmon

TrueFalse monitor? Trueok? True

Move data to data warehouse

False requeue monitor?

Figure 4: State diagram of queuing algorithm. Theiceprod-client sends requests to the soaptray serverwhich then loads the information to the database (in pro-duction mode) or directly submits jobs to the cluster (inunmonitored mode). The soapqueue daemons periodi-cally query the database for pending requests and handlejob submission in the local cluster.

In order to abstract the process of job submissionfrom the framework for the various types of systems,IceProd deﬁnes a Grid base class that provides an inter-face for queuing jobs. The Grid base class interface in-cludes a set of methods for queuing and removing jobs,performing status checks, and setting attributes such asjob priority and maximum allowed wall time and job re-quirements such as disk space and memory usage. The7igure 3: Network diagram of IceProd system. The IceProd clients and JEPs communicate with iceprod-server modules via XML-RPC. Database calls are restricted to iceprod-server modules. Queueing daemons called soapqueue are installed at each site and periodically query the database for pending job requests. The soapmon server receivesmonitoring update from the jobs. An instance of soapdh handles garbage collection and any post processing tasksafter job completion. *Condor, PBS, SGE, CREAM, GLite client soaptray soapqueue batchsystem runningjob soapmon

MySQLXML-RPC batch system* submit cmd batch system protocol* XML-RPC soapdh

MySQLMySQLbatch system submit cmd* (unmonitored) enqueue datasetsubmit submit job

XML-RPC check status batch system protocol*/shell schedule job on cluster status update status updatestatus updateremove completed job and clean up ﬁlessubmit jobsremove failed job and clean up ﬁles batch systemprotocol*/shell

Figure 5: Data ﬂow for job submission, monitoring and removal. Communication between server instances (labeled“soap*”) is handled through a database. Client / server communication and monitoring updates are handled via XML-RPC. Interaction with the grid or cluster is handled through a set of plugin modules and depends on the speciﬁcs ofthe system.set of methods deﬁned by this base class include but are not limited to:8 WriteConﬁg : write protocol-speciﬁc submissionscripts (i.e., a JDL job description ﬁle in the caseof CREAM or gLite or a shell script with the ap-propriate PBS / SGE headers). • Submit : submit jobs and record the job ID in thelocal queue. • CheckJobStatus : query job status from the queue. • Remove : cancel / abort a job. • CleanQ : remove any orphan jobs that might be leftin the queue.The actual implementation of these methods is done bya set of plugin subclasses that launch the correspondingcommands or library calls, as the case may be. In thecase of PBS and SGE, most of these methods result inthe appropriate system calls to qsub , qstat , qdel , etc. Forother systems, these can be direct library calls through aPython API. IceProd contains a growing library of plug-ins, including classes for interfacing with batch systemssuch as HTCondor, PBS and SGE as well as grid sys-tems like Globus, gLite, EDG, CREAM and ARC. Inaddition, one can easily implement user-deﬁned pluginsfor any new type of system that is not included in thislist. The iceprod-modules package is a collection of con-ﬁgurable modules with a common interface. These rep-resent the atomic tasks to be performed as part of thejob. They are derived from a base class

IPModule andprovide a standard interface that allows for an arbitraryset of parameters to be conﬁgured in the XML docu-ment and passed from the IceProd framework. In turn,the module returns a set of statistics in the form of astring-to-ﬂoat dictionary back to the framework so thatit can be recorded in the database and displayed on themonitoring web interface. By default, the base classwill report the module’s CPU usage, but the user candeﬁne any set of values to be reported, such as numberof events that pass a given processing ﬁlter. IceProd alsoincludes a library of predeﬁned modules for performingcommon tasks such as ﬁle transfers through GridFTP,tarball manipulation, etc.

Included in the library of predeﬁned modules is a spe-cial module that has two parameters: class and

URL .The ﬁrst is a string that deﬁnes the name of an externalIceProd module and the second speciﬁes a URL for a (preferably version-controlled) repository where the ex-ternal module code can be found. Any other parameterspassed to this module are assumed to belong to the re-ferred external module and will be ignored. This allowsfor the use of user-deﬁned modules without the needto install them at each IceProd site. External modulesshare the same interface as any other IceProd module.External modules are retrieved and cached by the serverat the time of submission. These modules are then in-cluded as ﬁle dependencies for the jobs, thus preventingthe need for jobs to directly access the ﬁle code reposi-tory. Additional precautions, such as enforcing the useof secure protocols for URLs, must be taken to avoidsecurity risks.

The iceprod-client package contains two applicationsfor interacting with the server and submitting datasets.One is a PyGTK-based GUI (see Figure 6) and the otheris a text-based application that can run as a command-line executable or as an interactive shell. Both ofthese applications allow the user to download, edit, andsubmit steering conﬁguration ﬁles as well as controldatasets running on the IceProd-controlled grid. Thegraphical interface includes drag and drop features formoving modules around and provides the user with alist of valid parameters for known modules. Informa-tion about parameters for external modules is not in-cluded since these are not known a priori. The interac-tive shell also allows the user to perform grid manage-ment tasks such as starting and stopping a remote serverand adding and removing production sites participatingin the processing of a dataset. The user can also performjob-speciﬁc actions such as suspension and resetting ofjobs.

At the time of this writing, the current implemen-tation of IceProd works exclusively with a MySQLdatabase, but all database calls are handled by adatabase module that abstracts queries from the frame-work and could be easily replaced by a di ﬀ erent rela-tional database. This section describes the relationalstructure of the IceProd database.Each dataset is deﬁned by a set of modules and pa-rameters that operate on separate data (single process,multiple data). At the top level of the database struc-ture is the dataset table. The dataset ID is the uniqueidentiﬁer for each dataset, though it is possible to as-sign a mnemonic string alias. The tables in the IceProddatabase are logically divided into two distinct classes9igure 6: The iceprod-client uses pyGtk and provides a graphical user interface to IceProd. It is both a graphicaleditor of XML steering ﬁles and an XML-RPC client for dataset submission.that could in principle be entirely di ﬀ erent databases.The ﬁrst describes a steering ﬁle or dataset conﬁgura-tion (items 1–6 and 9 in the list below) and the secondis a job-monitoring database (items 7 and 8). The mostimportant tables are described below.1. dataset : contains a unique identiﬁer as well as at-tributes to describe and categorize the dataset, in-cluding a textual description.2. steering-parameter : describes general globalvariables that can be referenced from any module.3. meta-project : describes a software environmentincluding libraries and executables.4. tray : describes a grouping of modules that willexecute given the same software environment or metaproject .5. module : speciﬁes an instance of an IceProd Mod-ule class.6. cparameter : contains all the conﬁgured parame-ters associated with a module.7. job : describes each job in the queue related to adataset, including the state and host where the jobis executed. 8. task : keeps track of the state of a task in a way sim-ilar to what is done in the jobs table. A task repre-sents a subprocess for a job in a process workﬂow.More details on this will be provided in Section 4.9. task-rel : describes the hierarchical relationshipbetween tasks. The status updates and statistics are reported bythe JEP via XML-RPC to soapmon and stored in thedatabase, and provide useful information for monitor-ing the progress of processing datasets and for detectingerrors. The updates include status changes and infor-mation about the execution host as well as job statistics.This is a multi-threaded server that can run as a stand-alone daemon or as a CGI script within a more robustweb server. The data collected from each job are madeavailable for analysis, and patterns can be detected withthe aid of visualization tools as described in the follow-ing section.

The current web interface for IceProd was designedto work independently of the IceProd framework but10igure 7: A screen capture of the web interface that allows the monitoring of ongoing jobs and datasets. The moni-toring web interface has a number of views with di ﬀ erent levels of detail. The view shown displays the job progressfor active jobs within a dataset. The web interface provides authenticated users with buttons to control datasets andindividual jobs.utilizes the same database. It is written in PHP andmakes use of the CodeIgniter framework [10]. Eachof the simulation and data-processing web-monitoringtools provide di ﬀ erent views, which include, from toplevel downward: • general view: displays all datasets ﬁltered by sta-tus, type, grid, etc. • grid view: shows all datasets running on a particu-lar site. • dataset view: displays all jobs and accompanyingstatistics for a given dataset, including every sitethat it is running on. • job view: shows each individual job, including thestatus, job statistics, execution host, and possibleerrors.There are some additional views that are applicable onlyto the processing of real IceCube detector data: • calendar view: displays a calendar with a colorcoding that indicates the status of jobs associatedwith data taken on a particular date. • day view: shows the status of jobs associated witha given calendar day of data taking. • run view: displays the status of jobs associatedwith a particular detector run.The web interface also provides the functionality tocontrol jobs and datasets by authenticated users. This isdone by sending commands to the soaptray daemon us-ing the XML-RPC protocol. Other features of the inter-face include graphs displaying completion rates, errorsand number of jobs in various states. Figure 7 showsa screen capture of one of a number of views from theweb interface. One aspect of IceProd that is not found in most gridmiddleware is the built-in collection of user-deﬁned sta-tistical data. Each IPModule instance is passed a string-to-ﬂoat dictionary to which the JEP can add entries orincrement a given value. IceProd collects these data inthe central database and displays them on the monitor-ing page. Statistics are reported individually for eachjob and collectively for the whole dataset as a sum, av-erage and standard deviation. The typical types of infor-mation collected on IceCube jobs include CPU usage,number of events meeting predeﬁned physics criteria,and number of calls to a particular module.11 .8. Security and Data Integrity

When dealing with network applications, one mustalways be concerned with security and data integrity inorder to avoid compromising privacy and the validity ofscientiﬁc results. Some e ﬀ ort has been made to min-imize security risks in the design and implementationof IceProd. This section will summarize the most sig-niﬁcant of these. Figure 3 shows the various types ofnetwork communication between the client, server, andworker node. Authentication in IceProd can be handled in twoways: IceProd can authenticate dataset submissionagainst an LDAP server or, if one is not available, au-thentication is handled by means of direct database au-thentication. LDAP authentication allows the IceProdadministrator to restrict usage to individual users thatare responsible for job submissions and are account-able for improper use so direct database authentica-tion should be disabled whenever LDAP is available.This setup also precludes the need to distribute databasepasswords and thus prevents users from being able todirectly query the database via a MySQL client.When dealing with databases, one also needs to beconcerned about allowing direct access to the databaseand passing login credentials to jobs running on remotesites. For this reason, all monitoring calls are done viaXML-RPC, and the only direct queries are performedby the server, which typically operates behind a ﬁrewallon a trusted system. The current web interface doesmake direct queries to the database; a dedicated read-only account is used for this purpose.

Both soaptray and soapmon can be conﬁgured to useSSL certiﬁcates in order to encrypt all data communica-tion between client and server. The encryption is doneby the HTTPS server with either a self-signed certiﬁcateor, preferably, with a certiﬁcate signed by a trusted Cer-tiﬁcate Authority (CA). This is recommended for client-server communication for soaptray but is generally notconsidered necessary for monitoring information sent to soapmon by the JEP as this is not considered sensitiveenough to justify the additional system CPU resourcesrequired for encryption.

In order to guarantee data integrity, an MD5 check-sum or digest is generated for each ﬁle that is trans-mitted. This information is stored in the database and is checked against the ﬁle after transfer. IceProd datatransfers support several protocols, but the preference isto rely primarily on GridFTP, which makes use of GSIauthentication [11, 12].An additional security measure is the use of a tem-porary passkey that is assigned to each job at the timeof submission. This passkey is used for authenticat-ing communication between the job and the monitoringserver and is only valid during the duration of the job.If the job is reset, this passkey will be changed beforea new job is submitted. This prevents stale jobs thatmight be left running from making monitoring updatesafter the job has been reassigned.

4. Intrajob Parallelism

As described in Section 3.1.2, a single IceProd jobconsists of a number of trays and modules that exe-cute di ﬀ erent parts of the job, for example, a simula-tion chain. These trays and modules describe a work-ﬂow with a set of interdependencies, where the outputfrom some modules and trays is used as input to others.Initial versions of IceProd ran jobs solely as monolithicscripts that executed these modules serially on a singlemachine. This approach was not very e ﬃ cient becauseit did not take advantage of the workﬂow structure im-plicit in the job description.To address this issue, IceProd includes a representa-tion of a job as a directed, acyclic graph (DAG) of tasks.Jobs are recharacterized as groups of arbitrary tasks andmodules that are deﬁned by users in a job’s XML steer-ing ﬁle, and each task can depend on any number ofother tasks in the job. This workﬂow is encoded in aDAG, where each vertex represents a single instance ofa task to be executed on a computing node, and edgesin the graph indicate dependencies between tasks (seeFigures 8 and 9). DAG jobs on the cluster are exe-cuted by means of the HTCondor DAGMan which isa workﬂow manager developed by the HTCondor groupat the University of Wisconsin–Madison and includedwith the HTCondor batch system [13].For IceCube simulation production, IceProd has uti-lized the DAG support in two speciﬁc cases: improv-ing task-level parallelism and running jobs that utilizegraphics processing units (GPUs) for portions of theirprocessing. In addition to problems caused by coarse-grained re-quirements speciﬁcations, monolithic jobs also under-utilize cluster resources. As shown in Figure 8, portions12 " ! *"/"0345/.&0 " Figure 9: A more complicated DAG in IceProd with multiple inputs and multiple outputs that are eventually mergedinto a single output. The vertices in the second level run on computing nodes equipped with GPUs. backgroundGPUdetector A detector Bgarbage collectionsignal

Figure 8: A simple DAG in IceProd. This DAG corre-sponds to a typical IceCube simulation. The two rootvertices require standard computing hardware and pro-duce di ﬀ erent types of signal. Their output is then com-bined and processed on GPUs. The output is then usedas input for two di ﬀ erent detector simulations.of the workﬂow within a job are independent; however,if a job is monolithic, these portions will be run seri-ally instead of in parallel. Therefore, although the en-tire simulation can be parallelized by submitting multi-ple jobs to di ﬀ erent machines, this opportunity for addi-tional parallelism is not exploited by monolithic jobs.Support for breaking a job into discrete tasks is nowincluded in the HTCondor IceProd plugin as describedabove, and similar features have been developed for the PBS and Sun Grid Engine plugins. This enables fasterexecution of individual jobs by utilizing more comput-ing nodes; however, one limitation of this implementa-tion is that DAG jobs are restricted to a speciﬁc typeof cluster, and DAG jobs cannot distribute tasks acrossmultiple sites. Individual parts of a job may have di ﬀ erent systemhardware and software requirements. Breaking these upinto tasks that run on separate nodes allows for betterutilization of resources. The IceCube detector simula-tion chain is a good example of this scenario in whichtasks are distributed across computing nodes with dif-ferent hardware resources.Light propagation in the instrumented volume of iceat the South Pole is di ﬃ cult to model, but recent devel-opments in IceCube’s simulation include a much fasterapproach for simulating direct propagation of photonsin the optically complex Antarctic ice [14, 15] by usinggeneral-purpose GPUs. This new simulation moduleis much faster than a CPU-based implementation andmore accurate than using parametrization tables [16],but the rest of the simulation requires standard CPUs.When executing an IceProd job monolithically, only oneset of cluster requirements can be applied when it is sub-mitted to the cluster. Accordingly, if any part of the jobrequires use of a GPU, the entire monolithic job mustbe scheduled on a cluster machine with the appropriatehardware.As of this writing, IceCube has the potential to access ∼ ,

000 CPU cores distributed throughout the world,but only a small number of these nodes are equippedwith GPU cards. Because the simulation is primarily13PU bound, the pool of GPU-equipped nodes is not suf-ﬁcient to run all simulation jobs in an acceptable amountof time. Additionally, this would be an ine ﬃ cient use ofresources, since executing the CPU-oriented portions ofmonolithic jobs would leave the GPU idle for periods oftime. In order to solve this problem, the modular designof the IceCube simulation design is used to divide theCPU- and GPU-oriented portions of jobs into separatetasks in a DAG. Since each task in a DAG is submit-ted separately to the cluster, their requirements can bespeciﬁed independently and CPU-oriented tasks can beexecuted on general-purpose grid nodes while photonpropagation tasks can be executed on GPU-enabled ma-chines, as depicted in Figure 9.

5. Applications

IceProd’s highly conﬁgurable nature lets it serve theneeds of many di ﬀ erent applications, both inside andbeyond the IceCube Collaboration. The IceCube simulations are based on a modular soft-ware framework called

IceTray in which modules areexecuted in sequential order. Data is passed betweenmodules in the form of a “frame” object. IceCube sim-ulation modules represent di ﬀ erent steps in the gener-ation and propagation of particles, in-ice light prop-agation, signal detection, and simulation of the elec-tronics and data acquisition hardware. These modulesare “chained” together in a single IceTray instance butcan also be broken into separate instances conﬁgured towrite intermediate data ﬁles. This allows for breakingup the simulation chain into multiple IceProd tasks inorder to optimize the use of resources as described inSection 4.For IceCube, Monte Carlo simulations are the mostcomputationally intensive task, which is dominated bythe production of background cosmic-ray showers (seeTable 2). A typical Monte Carlo simulation lasts on theorder of 8 hours but corresponds to only four seconds ofdetector livetime. In order to generate su ﬃ cient statis-tics, IceCube simulation production needs to make useof available computing resources which are distributedacross the world. Table 3 lists all of the sites that haveparticipated in Monte Carlo production. ﬀ -line Processing of the IceCube Detector Data IceProd was designed primarily for managing theproduction of Monte Carlo simulations for IceCube, Table 3: Sites participating in IceCube Monte Carlo pro-duction by country.

Country Queue Type No. of SitesSweden ARC 2Canada PBS 2Germany SGE 1PBS 3CREAM 4Belgium PBS 2USA HTCondor 4PBS 3SGE 4Japan HTCondor 1 but it has also been successfully adopted for manag-ing the processing and reconstruction of experimentaldata collected by the detector. This data collected byIceCube and previously described in Section 1.1 mustundergo multiple steps of processing, including calibra-tion, multiple-event track reconstructions, and sortinginto various analysis channels based on predeﬁned cri-teria. IceProd has proved to be an ideal framework forprocessing this large volume of data.For o ﬀ -line data processing, the existing features inIceProd are used for job submission, monitoring, datatransfer, veriﬁcation, and error handling. However, incontrast to a Monte Carlo production dataset where thenumber of jobs are deﬁned a priori, a conﬁguration foro ﬀ -line processing of experimental data initiates with anempty dataset of zero jobs. A separate script is then runover the data in order to map a job to a particular ﬁle (orgroup of ﬁles) and to generate MD5 checksums for eachinput ﬁle.Additional minor modiﬁcations were needed in orderto support the desired features in o ﬀ -line processing. Inaddition to the tables described in section 3.6, a run ta-ble was created to keep records of runs and dates associ-ated with each ﬁle and unique to the data storage struc-ture. All data collected during a season (or a one yearcycle) are processed as a single IceProd dataset. This isbecause, for each IceCube season, all the data collectedis processed with the same set of scripts, thus follow-ing the SPMD model. A job for such a dataset consistsof all the tasks needed to complete the processing of asingle data ﬁle.O ﬀ -line processing takes advantage of the IceProdbuilt-in system for collecting statistics in order to pro-vide information through web interface about the num-ber of events that pass di ﬀ erent quality selection criteriafrom completed jobs. Troubleshooting and error cor-14ection of jobs during processing is also facilitated byIceProd’s real-time feedback system accessible throughthe web interface. The data integrity checks discussed inSection 3.8.3 also provide a convenient way to validatedata written to storage and to check for errors during theﬁle transfer task. ﬀ -line Event Reconstruction for the HAWCGamma-Ray Observatory IceProd’s scope is not limited to IceCube. Its designis general enough to be used for other applications. TheHigh-Altitude Water Cherenkov (HAWC) Observatory[17] has recently begun using IceProd for its own o ﬀ -line event reconstruction and data transfer [18]. HAWChas two main computing centers, one located at the Uni-versity of Maryland and one at UNAM in Mexico City.Data is collected from the detector in Mexico and thenreplicated to UMD. The event reconstruction for HAWCis similar in nature to IceCube’s data processing. UnlikeIceCube’s Monte Carlo production, it is I / O bound andbetter suited for a local cluster rather than a distributedgrid environment. The HAWC Collaboration has madeimportant contributions to the development of IceProdand maintained active collaboration with the develop-ment team.

Deployment of an IceProd instance is relatively easy.Installation of the software packages is handled throughPython’s built-in Module Distribution Utilities pack-age. If the intent is to create a stand-alone instanceor to start a new grid, the software distribution also in-cludes scripts that deﬁne the MySQL tables required forIceProd.After the software is installed, the server needs to beconﬁgured through an INI-style ﬁle. This conﬁgurationﬁle contains three main sections: general queueing op-tions, site-speciﬁc system parameters, and job environ-ment. The queueing options are used by the server plu-gin to help conﬁgure submission (e.g. selecting a queueor passing custom directives to the queueing system).System parameters can be used to deﬁne the location ofa download directory on a shared ﬁlesystem or a scratchdirectory to write temporary ﬁles. The job environmentcan be modiﬁed by the server conﬁguration to modifypaths appropriately or set other environment variables.If the type of grid / batch system for the new site is al-ready supported, the IceProd instance can be conﬁguredto use an existing server plugin, with the appropriate lo-cal queuing options. Otherwise, the server plugin mustbe written, as described in Section 3.2.1. The ease of adaptation of the framework for the ap-plications discussed in Sections 5.2 and 5.3 illustrateshow IceProd can be ported to other projects with min-imal customization, which is facilitated by its Pythoncode base.There are a couple of simple ways in which func-tionality can be extended: One is through the imple-mentation of additional IceProd Modules as describedin Section 3.3. Another is by adding XML-RPC meth-ods to the soapmon module in order to provide a wayfor jobs to communicate with the server. There are, ofcourse, more intrusive ways of extending functionality,but those require a greater familiarity with the frame-work.

6. Performance

Since its initial deployment in 2006, the IceProdframework has been instrumental in generating MonteCarlo simulations for the IceCube collaboration. TheIceCube Monte Carlo production has utilized more thanthree thousand CPU-core hours distributed between col-laborating institutions at an increasing rate and pro-duced nearly two petabytes of data distributed betweenthe two principal storage sites in the U.S. and Germany.Figure 10 shows the relative share of CPU resourcescontributed towards simulation production. The Ice-Cube IceProd grid has grown from 8 sites to 25 over theyears and incorporated new computing resources. In-corporating new sites is trivial since each set of daemonsacts as a volunteer that operates opportunistically on aset of job / tasks independent of other sites. There is nocentral manager that needs to scale with the number ofcomputing sites. The central database is the one compo-nent that does need to scale up and can also be a singlepoint of failure. Plans to address this weakness will bediscussed in Section 7.The IceProd framework has also been successfullyused for the o ﬀ -line processing of data collected fromthe IceCube detector over a 4-year period beginning inthe Spring of 2010. This corresponds to 500 terabytesof data and over 3 × event reconstructions. Table 4summarizes the resources utilized by IceProd for simu-lation production and o ﬀ -line processing.

7. Future Work

Development of IceProd is an ongoing e ﬀ ort. Oneimportant area of current development is the implemen-tation of workﬂow management capabilities like HT-Condor’s DAGMan but in a way that is independent of15igure 10: Share of CPU resources contributed by mem-bers of the IceCube Collaboration towards simulationproduction. The relative contributions are integratedover the lifetime of the experiment. The size of the sec-tor reﬂects both the size of the pool and how long a sitehas participated in simulation production.Table 4: IceCube simulation production and o ﬀ -lineprocessing resource utilization. The production rate hassteadily increased since initial deployment. The num-bers reﬂect utilization of owned computing resourcesand opportunistic ones. Simulation O ﬀ -lineComputing centers 25 1CPU-core time ∼ ∼

160 yrCPU-cores ∼ . × . × No. of tasks 2 . × . × Data volume 1 . . ﬀ erent job subtasks on di ﬀ erent nodes.Work is also ongoing on a second generation ofIceProd designed to be more robust and ﬂexible. Thedatabase will be partially distributed to prevent it frombeing a single point of failure and to better handle higherloads. Caching of ﬁles will be more prevalent and eas-ier to implement to optimize bandwidth usage. TheJEP will be made more versatile by executing ordinaryscripts in addition to modules. Tasks will become a fun-damental part of the design rather than an added fea-ture and will therefore be fully supported throughout theframework. Improvements in the new design are basedon lessons learned from the ﬁrst generation IceProd andprovide a better foundation on which to continue devel-opment.

8. Conclusions

IceProd has proven to be very successful for man-aging IceCube simulation production and data process-ing across a heterogeneous collection of individual gridsites and batch computing clusters.With few software dependencies, IceProd can be de-ployed and administered with little e ﬀ ort. It makes useof existing trusted grid technology and network proto-cols, which help to minimize security and data integrityconcerns that are common to any software that dependsheavily on communication through the Internet.Two important features in the design of this frame-work are the iceprod-modules and iceprod-server plu-gins, which allow users to easily extend the function-ality of the code. The former provide an interface be-tween the IceProd framework and user scripts and ap-plications. The latter provide an interface that abstractsthe details of job submission and management in dif-ferent grid environments from the framework. IceProdcontains a growing library of plugins that support mostmajor grid and batch system protocols.Though it was originally developed for managing Ice-Cube simulation production, IceProd is general enoughfor many types of grid applications and there are plansto make it generally available to the scientiﬁc commu-nity in the near future. Acknowledgements

We acknowledge the support from the followingagencies: U.S. National Science Foundation-O ﬃ ce of16olar Programs, U.S. National Science Foundation-Physics Division, University of Wisconsin Alumni Re-search Foundation, the Grid Laboratory Of Wiscon-sin (GLOW) grid infrastructure at the University ofWisconsin–Madison, the Open Science Grid (OSG)grid infrastructure; U.S. Department of Energy, andNational Energy Research Scientiﬁc Computing Cen-ter, the Louisiana Optical Network Initiative (LONI)grid computing resources; Natural Sciences and En-gineering Research Council of Canada, WestGrid andCompute / Calcul Canada; Swedish Research Council,Swedish Polar Research Secretariat, Swedish NationalInfrastructure for Computing (SNIC), and Knut andAlice Wallenberg Foundation, Sweden; German Min-istry for Education and Research (BMBF), DeutscheForschungsgemeinschaft (DFG), Helmholtz Alliancefor Astroparticle Physics (HAP), Research Departmentof Plasmas with Complex Interactions (Bochum), Ger-many; Fund for Scientiﬁc Research (FNRS-FWO),FWO Odysseus programme, Flanders Institute to en-courage scientiﬁc and technological research in industry(IWT), Belgian Federal Science Policy O ﬃ ce (Belspo);University of Oxford, United Kingdom; Marsden Fund,New Zealand; Australian Research Council; Japan So-ciety for Promotion of Science (JSPS); the Swiss Na-tional Science Foundation (SNSF), Switzerland; Na-tional Research Foundation of Korea (NRF); DanishNational Research Foundation, Denmark (DNRF) Theauthors would like to also thank T. Weisgarber from theHAWC collaboration for his contributions to IceProddevelopment. Appendix

The following is a comprehensive list of sitesparticipating in IceCube Monte Carlo production:Uppsala University (SweGrid), Stockholm Univer-sity (SweGrid), University of Alberta (WestGrid),TU Dortmund (PHiDO, LIDO), Ruhr-Uni Bochum(LiDO), University of Mainz, Universit´e Libre deBruxelles / Vrije Universiteit Brussel, Universiteit Gent(Trillian) Southern University (LONI), PennsylvaniaState University (LIONX), University of Wisconsin(CHTC, GLOW, NPX4), Open Science Grid, RWTHAachen University (EGI), Universit¨at Dortmund (EGI),Deutsches Elektronen-Synchrotron (EGI, DESY), Uni-versit¨at Wuppertal (EGI), University of Delaware,Lawrence Berkeley National Laboratory (PDSF, Dirac,Carver), University of Maryland.

References [1] F. Halzen, IceCube A Kilometer-Scale Neutrino Observatory atthe South Pole, IAU XXV General Assembly, ASP ConferenceSeries 13 (2003) 13–16.[2] M. G. Aartsen, et al., Search for Galactic PeV gamma rayswith the IceCube Neutrino Observatory, Phys. Rev. D 87 (2013)62002.[3] F. Halzen, S. R. Klein, IceCube: An Instrument for NeutrinoAstronomy, Invited Review Article: Rev. Sci. Inst. 81 (2010)081101.[4] D. Protopopescu, K. Schwarz, PANDA Grid–a Tool for Physics,J. Phys.: Conf. Ser. 331 (2011) 072028.[5] P. Buncic, A. Peters, P. Saiz, The AliEn system, sta-tus and perspectives, eConf C0303241 (2003) MOAT004. arXiv:cs/0306067 .[6] R. Brun, F. Rademakers, ROOT - An Object Oriented DataAnalysis Framework, Nuclear Inst. and Meth. in Phys. Res., A389 (1997) 81–86.[7] A. Pavlo, P. Couvares, R. Gietzel, A. Karp, I. D. Alderman,M. Livny, The NMI build and test laboratory: Continuous in-tegration framework for distributed computing software, Proc.USENIX / SAGE Large Installation System Administration Con-ference (2006) 263–273.[8] D. Winer, XML / RPC Speciﬁcation, (1999).[9] T. DeYoung, IceTray: A software framework for IceCube,Int. Conf. on Comp. in High-Energy Phys. and Nucl. Phys.(CHEP2004) (2005) 463–466.[10] R. Ellis, the ExpressionEngine Development Team, CodeIgniterUser Guide, http://codeigniter.com (online manual).[11] W. Allcock, et al., GridFTP: Proto-col extensions to FTP for the Grid, (April 2003).[12] The Globus Security Team, Globus Toolkit Version 4 Grid Se-curity Infrastructure: A Standards Perspective (2005).[13] P. Couvares, T. Kosar, A. Roy, J. Weber, K. Wenger, Work-ﬂow Management in Condor, In Workﬂows for e-Science partIII (2007) 357–375.[14] M. G. Aartsen, et al., Measurement of South Pole ice trans-parency with the IceCube LED calibration system, Nuclear Inst.and Meth. in Phys. Res., A A711 (2013) 73–89.[15] D. Chirkin, Study of South Pole ice transparency with IceCubeﬂashers, Proc. International Cosmic Ray Conference 4 (2011)161.[16] D. Chirkin, Photon tracking with GPUs in IceCube, NuclearInst. and Meth. in Phys. Res., A 725 (2013) 141–143.[17] A. U. Abeysekara, et al., On the sensitivity of the HAWC ob-servatory to gamma-ray bursts HAWC Collaboration, Astropart.Phys. 35 (2012) 641–650.[18] T. Weisgarber, Production Reconstruction, HAWC Collabora-tion Meeting, May, 2013 (Unpublished results).(April 2003).[12] The Globus Security Team, Globus Toolkit Version 4 Grid Se-curity Infrastructure: A Standards Perspective (2005).[13] P. Couvares, T. Kosar, A. Roy, J. Weber, K. Wenger, Work-ﬂow Management in Condor, In Workﬂows for e-Science partIII (2007) 357–375.[14] M. G. Aartsen, et al., Measurement of South Pole ice trans-parency with the IceCube LED calibration system, Nuclear Inst.and Meth. in Phys. Res., A A711 (2013) 73–89.[15] D. Chirkin, Study of South Pole ice transparency with IceCubeﬂashers, Proc. International Cosmic Ray Conference 4 (2011)161.[16] D. Chirkin, Photon tracking with GPUs in IceCube, NuclearInst. and Meth. in Phys. Res., A 725 (2013) 141–143.[17] A. U. Abeysekara, et al., On the sensitivity of the HAWC ob-servatory to gamma-ray bursts HAWC Collaboration, Astropart.Phys. 35 (2012) 641–650.[18] T. Weisgarber, Production Reconstruction, HAWC Collabora-tion Meeting, May, 2013 (Unpublished results).