Benchmarking the computing resources at the Instituto de Astrofísica de Canarias
aa r X i v : . [ c s . PF ] F e b Benchmarking the computing resources at theInstituto de Astrof´ısica de Canarias
Nicola Caon *, Antonio J. Dorta , Juan Carlos Trelles Arjona Abstract
The aim of this study is the characterization of the computing resources used by researchers at the ”Institutode Astrof´ısica de Canarias” (IAC). Since there is a huge demand of computing time and we use tools suchas HTCondor to implement High Throughput Computing (HTC) across all available PCs, it is essential for usto assess in a quantitative way, using objective parameters, the performances of our computing nodes. Inorder to achieve that, we have run a set of benchmark tests on a number of different desktop and laptop PCmodels among those used in our institution. In particular, we run the ”Polyhedron Fortran Benchmarks” suite,using three different compilers: GNU Fortran Compiler, Intel Fortran Compiler and the PGI Fortran Compiler;execution times are then normalized to the reference values published by Polyhedron. The same tests wererun multiple times on a same PCs, and on 3 to 5 PCs of the same model (whenever possible) to check forrepeatability and consistency of the results. We found that in general execution times, for a given PC model,are consistent within an uncertainty of about 10%, and show a gain in CPU speed of a factor of about 3 betweenthe oldest PCs used at the IAC (7-8 years old) and the newest ones.
Keywords
Benchmarks — Supercomputing Instituto de Astrof´ısica de Canarias, E-38205 La Laguna, Tenerife, Spain Universidad de La Laguna, Dpto. Astrof´ısica, E-38206 La Laguna, Tenerife, Spain * Corresponding author : [email protected]
Contents
Acknowledgments 4
1. The IAC computing resources
At the Instituto de Astrof´ısica de Canarias there are about 250desktop PCs with Linux installed, used by scientists and en-gineers. These PCs cover a wide span of models and ages,from 8-years old Dell Optiplex to recently bought ”Nausi-caA” models. There are also several more powerful comput-ers, mainly rack-mounted but also a few desktop models, ded-icated to large, demanding jobs that exceed the capabilities ofa ”regular” (consumer) PC, such as massive data reductionand analysis, simulations, and other CPU intensive jobs.While it is clear that newer PCs are faster and more ef-ficient than older models, so far this was more a perceptionthan solid evidence supported by data. A user may observethat her office-mate’s latest-model PC is more responsive, or faster when executing some tasks, but cannot say by howmuch, nor can she estimates the gain in time obtained byrunning her applications in the office-mate’s PC instead ofher own older PC. This could be a key factor when prepar-ing the remote executions of a program using the availableHTC tools, since it is possible to specify a list of preferencesor ranks that will be used to choose the target machines onwhich the code will be executed.For these reasons we decided to run a set of benchmark-ing tests on all the different, available desktop and rack mod-els, with also a few laptops, as part of a month-long ”ProyectoPr´actica de Empresa” (Student Internship), carried out at theIAC by a 4th-year student of Astronomy (JCTA) .
2. Running the benchmark tools
After considering a number of possible benchmarks, we fi-nally selected the ”Polyhedron Fortran Benchmarks” suite[1], since it is one of the most comprehensive set of bench-marks that matches our requirements: it provides tools to au-tomatically run the tests, compute the CPU time used by eachexecutable, validate and save the results in tabular form in asimple text file. The tests were run in November/December 2014, with some additionalruns in January-March 2015 enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 2/12
This benchmark consists of 17 independent Fortran pro-grams. While this suite was devised to compare the perfor-mances of 10 different Fortran compilers on a same machine,it can be used as well to compare the performances of a samecompiler on a variety of hardware.The way it operates is controlled by a couple of parameterfiles, one (general) listing the tests to be run, the desired ac-curacy to be achieved, the minimum and maximum numberof runs for each test, and the maximum execution time per-mitted. (A detailed explanation of how the test suite worksis provided in [2]). Then there is a parameter file tailoredto each compiler, with the specific command and flags to beused.We made only minor changes in the parameter files pro-vided by Polyhedron, by increasing the tolerance on the ex-ecution time from 0.1 to 0.2, setting the maximum numberof runs for each test to 20, and limiting the maximum execu-tion time to 4000 seconds. These changes do not affect thereliability of our results, but allow a shorter overall time forrunning the whole suite of tests (typically from two to sevenhours). The limits on the maximum number of tests and max-imum execution time (per test) prevented jobs from gettingout of control and using up the CPU for hours or even days(which happened a few times, especially with the Intel com-piler).To check the consistency of the results, we:a) ran the benchmarks multiple times on a same machine, andb) ran the tests on 3 to 5 different PCs of a same desktopmodel. However, we typically have only one single modelof the more powerful machines dedicated to CPU-intensivejobs, so we could only perform the consistency check a). Thissame limitation applies to laptops.Table 1 lists the hardware on which the Polyhedron testsuite was run, together with the main data about their CPUand RAM.Table 2 lists the compilers installed at the IAC and usedfor the benchmark tests. All the computers on which the testswere run had Linux Fedora 19 installed, and all have the sameexact version of the three compilers used.
Ideally, the benchmarks tests should be run on a dedicatedmachine, with no other processes running, in order to mini-mize the CPU load and guarantee that the results reflect thebest performances the hardware can deliver.However, we could not afford to take PCs away from theirusers, so the tests were run on production PCs, i.e. PC used(generally during the day) by their users. So we had a twofoldproblem: on one side, we wanted our tests not to interferewith the usage of the PCs by their users; on the other side,we did not want to run the tests on a PC with a high CPUload which can obviously affect the results.HTCondor provides a nice and efficient solution to thisproblem. HTCondor is a distributed job scheduler developedby the University of Wisconsin-Madison, which allows users to run their applications in other users’ machines when theyare not being used (for details about HTCondor, see [4, 5]).We first made a initial selection of machines where to runthe tests, choosing whenever possible, among all the avail-able desktop models, those we knew were less heavily used.This information was gathered by using
ConGUSTo [6], atool that provides real-time and historical usage data aboutthe machines forming the HTCondor pool. Based on thesedata, the final list comprises about 60 PCs.In order not to run two or more benchmark instances on asame PC (HTCondor tries to use all the available ”slots”, thatis CPU cores), we restricted our jobs to run only on ”slot1”.The list of target machines was included in the requirementsof our HTCondor submit files.HTCondor only runs its jobs on those PCs that are not be-ing used and that have a CPU load below a certain threshold(so with no CPU- or memory-heavy background jobs). If theCPU load rises, or the user goes back to work interactively,the HTCondor job is killed and rescheduled for the next avail-able opportunity (on any of the target machines).Thus the first thing we did was to submit via HTCondora batch of benchmarks jobs (each job is the complete suite oftests for a specific compiler) to all the targets machines.If the required number of benchmarks executions was ob-tained for a specific PC, it was removed from the machinestarget list and a batch of HTCondors jobs was submitted again.A few iterations were generally sufficient to complete thebenchmark runs on most PCs, while for a few of them it wasnecessary to prepare and submit HTCondor jobs restrictingthe targets list to that specific PC.In all PCs used for the benchmarks Hyper-threading wasdisabled. Moreover, as all the benchmarks run sequentiallyon just one single core of the CPU, the results do not dependon how many core are in the CPU.Listing 1 is an example of the HTCondor submissionfiles we used, with detailed comments about the various set-tings and commands.
Listing 1.
HTCondor submit file
N = 5ID = $ ( C l u s t e r ) . $ ( Proces s )FNAME = condor pb11 output = $ (FNAME) . $ ( ID ) . out error = $ (FNAME) . $ ( ID ) . e r r log = $ (FNAME) . $ ( C l u s t e r ) . log univers e = v a n i l l a s h o u l d t r a n s f e r f i l e s = YES when to trans fer output = ON EXIT enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 3/12
PC model Type Date Processor Type Cache
RAM
Number
Dell Precision WS T7400 Desktop late 2007 Intel(R) Xeon(R) CPU X5472 @ 3.00GHz 6144 KB 32 GB 2Dell Optiplex 740 ( a ) Desktop late 2007 AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ 1024 KB 4 GB 5Dell Optiplex 740 ( a ) Desktop early 2008 AMD Athlon(tm) 64 X2 Dual Core Processor 5600+ 512 KB 4 GB 4Dell Optiplex 740 ( a ) Desktop mid 2009 AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ 1024 KB 4 GB 5Dell Optiplex 780 ( b ) Desktop late 2009 Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz 3072 KB 8 GB 8Dell Optiplex 780 ( b ) Desktop late 2009 Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz 3072 KB 8 GB 4Dell Precision WS T3500 Desktop early 2011 Intel(R) Xeon(R) CPU W3565 @ 3.20GHz 8192 KB 12 GB 1Dell Precision WS T3600 Desktop mid 2012 Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz 12288 KB 32 GB 1Dell Precision WS T5600 Desktop mid 2012 Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz 20480 KB 128 GB 1Dell Optiplex 7010 Desktop mid 2012 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz 6144 KB 8 GB 5ALDA+ Desktop mid 2014 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz 8192 KB 8 GB 5NausicaA Desktop mid 2014 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 8192 KB 8 GB 4Dell Precision WS T5400 Rack late 2007 Intel(R) Xeon(R) CPU X5450 @ 3.00GHz 6144 KB 8 GB 1Dell Precision WS-690 Rack early 2008 Intel(R) Xeon(R) CPU X5355 @ 2.66GHz 4096 KB 32 GB 2Dell PowerEdge-2970 Rack late 2008 Quad-Core AMD Opteron(tm) Processor 2358 SE 512 KB 64 GB 1Dell PowerEdge R410 Rack late 2009 Intel(R) Xeon(R) CPU X5570 @ 2.93GHz 8192 KB 16 GB 1Tecal RH5885 V3 Rack mid 2014 Intel(R) Xeon(R) CPU E7-4820 v2 @ 2.00GHz 16384 KB 256 GB 1Dell Latitude E6500 Laptop mid 2008 Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz 6144 KB 4 GB 1Dell Latitude E4200 Laptop late 2008 Intel(R) Core(TM)2 Duo CPU U9600 @ 1.60GHz 3072 KB 3 GB 1Dell Latitude E4300 Laptop late 2008 Intel(R) Core(TM)2 Duo CPU P9300 @ 2.26GHz 6144 KB 4 GB 1Dell Latitude E6320 Laptop early 2011 Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz 6144 KB 4 GB 1Dell Latitude E6520 Laptop early 2011 Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz 4096 KB 8 GB 2Lenovo L440 Laptop late 2014 Intel(R) Core(TM) i5-4200M CPU @ 2.50GHz 3072 KB 8 GB 1
Table 1.
Column ”Date” show the production date of that specific model as retrieved from the corresponding Dell page [3]after supplying the service tag. For some of the models the Dell website does not provide useful information, so we listed theapproximate time the PC model was bought (this also applies to non-Dell models).Within each ”Type” group, the list is ordered chronologically. ( a ) : The Dell Optiplex 740 model actually came with three CPU variants; ( b ) : The Dell Optiplex 780 model actually camewith two CPU variants. Column ”Number” shows the number of PCs of that model on which the benchmark tests were run. compiler version compilation flags gfortran GNU Fortran (GCC) 4.8.3 20140911 -march=native -ffast-math -funroll-loops -O3Intel ifort (IFORT) 14.0.2 20140120 -O3 -fast -parallel -ipo -no-prec-divPGI pgf90 14.10-0 64-bit target on x86-64 Linux -tp penryn -V -fastsse -Munroll=n:4 -Mipa=fast,inline Table 2.
Compiler version and flags used for the benchmarks tests. Our compiler versions are slightly different from thoseused in the Polyhedron Suite (listed as gfortran 4.9, Intel 15.0, PGI 14.9). enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 4/12 getenv = True
BASE = . . . / pb11 / l i n t r a n s f e r i n p u t f i l e s = $ (BASE) t r a n s f e r o u t p u t f i l e s = \ $ (BASE ) / s ource / b l d t i m e s . t x t , \ $ (BASE ) / s ource / e x e s i z e s . t x t , \ $ (BASE ) / s ource / runtimes . t x t , \ $ (BASE ) / s ource / p g i 1 1 8 l i n S B . sum trans fer output remaps = \ ” runtimes . t x t = runtimes −− $$ (NAME) −− $ ( ID ) . t x t ; . . . ” s l o t = s u b s t r ( toLower ( T arget . Name ) , 0 , 6)machines = ”mach1 , mach2 , mach3 , . . . , machN” requirements = ( $ ( s l o t ) == ” slot1@ ” ) ) && \ s tringL is tMember ( UtsnameNodename , $ ( machines ) ) executable = $ (BASE ) / Condor / bnchmrk − Polyhedron − pgi . bash arguments = ”” queue $ (N) A minimum of 3, and up to 9 runs per compiler and per PCswere obtained in order to check for consistency and repeata-bility of the results. We found that, for a same machine, theexecution times varied within a few percentage points. Wethen took the minimum value for each test as our final resultfor each PC and compiler. Figure 1 illustrates a few exampleof how the execution times vary across the various runs in asame PC.In a few cases some runs produced weird results, with ex-ecution times much higher than expected (often only for justsome specific tests). For some reason, this happened morefrequently with the Intel compiler. Those runs were excluded,and new runs submitted if necessary to meet the minimumnumber of runs we set.The next step was to compare these results on all the PCsof a same model. Figure 2 shows six examples where thebenchmark run-times are compared with the best results, thatis the minimum run-time, for all PCs of a same model. Witha few exceptions, the run-times agree to better than 20 %.Again, we took the minimum values as representative for thatPC model, which should be a good approximation to the the-oretical limit that can be achieved on that kind of hardware.
At this stage, for each PC model we have the best (that is,shortest across all PCs of that model) benchmark times foreach test and for each compiler. As already mentioned, for most of the powerful machinesand for laptops we have only one instance of that specificmodel available, so no comparisons with other machines ofthe same type were possible. The benchmark data for thesemachines carry then a larger uncertainty.To provide a homogeneous set of comparisons, we tookas reference the benchmark times published on [7], whichwere measured on a ”machine with a Core i5 2500k 3.30GHzprocessor, running at stock speed, with 16 GBytes memory,and running 64-bit Scientific Linux 6 (a near-clone of RedHat Enterprise Linux 6)”.The set of Figures 3 shows, for each PC model, the bench-mark run-times normalized to the values listed in the abovewebsite.
Following the scheme implemented by Polyhedron, we com-puted for each PC model and each compiler the geometri-cal mean of the 17 execution times. The geometrical meansare then compared with those published by Polyhedron, andshown in Figure 5. As test N. 2 (aermod) failed for the PGIFortran compiler in our benchmarks, we computed the ge-ometrical mean excluding this test, and the Polyhedron geo-metrical mean for PGI was recomputed as well excluding testN. 2.The graph clearly shows that CPUs in recent models havebecome about three times faster than 7–8 years ago. On theother hand, laptops are in general about as fast as a desktopPC of a same age, with the fastest laptop only slightly slowerthan the fastest desktop PC. Overall there are no significantspeed differences between the three compilers we tested, ex-cept in the ”Dell Optiplex 740” desktops family (with AMDprocessors) where the Intel compiler was about 20% slowerthat the PGI and gfortran compilers.The results of this study will be especially useful to HT-Condor users, as they permit to restrict the list of target ma-chines to those with the shortest execution times, which willmaximize the probability that the submitted job are completedand not evicted, for instance, by the user logging in on themachine. Furthermore, the information gathered here willhelp plan the upgrade of our computing nodes. Finally, thebenchmark results will allow users to quickly assess the per-formances of laptops, as compared to desktop or rack PCs,and quickly determine whether their laptops can satisfy theircomputing needs.
Acknowledgments
We thank our colleagues in the IT Department who helped uswith many small problems related to HTCondor, especiallyissues with firewalls, HTCondor start-up files, etc. A bigthank is due to ´Angel de Vicente, who was the first to in-stall and manage HTCondor in our Institute, and the mainresponsible for the big popularity it is having, in terms of us-age, among our researchers. We also thank Ubay Dorta, Justo enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 5/12
Luna and Cristina Zurita who kindly ran the benchmark suiteon their laptops.
References [1] [2] [3] [4] http://research.cs.wisc.edu/htcondor/ [5]
Jim Basney, Miron Livny, and Todd Tannenbaum, ”HighThroughput Computing with Condor”, HPCU news, Vol-ume 1(2), June 1997. [6]
Antonio Dorta, Nicola Caon, and Jorge A. P´erez Prieto,2014, https://arxiv.org/abs/1412.5847 [7] enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 6/12Figure 1.
This panel shows, for six different individual PCs, the scatter in run-times after normalizing them to the best(minimum) value for each benchmark test and compiler. Green points refer to the gfortran compiler, red points to ifort, bluepoints to pgf90. The benchmark test N. 2 (aermod) fails when compiled with pgf90, and is thus omitted. enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 7/12Figure 2.
This panel shows, for six different PC models, the scatter in run-times after normalizing them to the best(minimum) value achieved for each model and benchmark test. The color scheme is the same as in Figure 1. enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 8/12Figure 3.
The best (minimum) run-times obtained in our tests for desktop and rack PCs are shown normalized to the valuepublished by Polyhedron for a ”Sandy Bridge Intel Core i5 2500k” CPU. Again, green, red and blue indicates the gfortran,Intel Fortran and PGI Fortran compilers respectively. enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 9/12Figure 3. continued enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 10/12Figure 3. continued enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 11/12Figure 4.
The best (minimum) run-times obtained in our tests for laptops are shown normalized to the value published byPolyhedron for a ”Sandy Bridge Intel Core i5 2500k” CPU. Due to license issues, only the gfortran tests were run on laptops. enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 12/12Figure 5.enchmarking the computing resources at the Instituto de Astrof´ısica de Canarias — 12/12Figure 5.