An Efficient OpenMP Runtime System for Hierarchical Arch
Samuel Thibault, François Broquedis, Brice Goglin, Raymond Namyst, Pierre-André Wacrenier
aa r X i v : . [ c s . P L ] J un An E(cid:30) ient OpenMP Runtime Systemfor Hierar hi al Ar hite turesSamuel Thibault, François Broquedis, Bri e Goglin,Raymond Namyst, and Pierre-André Wa renierINRIA Futurs - LaBRI351 ours de la libération33405 Talen e edex, Fran e{thibault,goglin,namyst,wa renier}labri.fr,fran ois.broquedisetu.u-bordeaux1.frAbstra t. Exploiting the full omputational power of always deeper hi-erar hi al multipro essor ma hines requires a very areful distribution ofthreads and data among the underlying non-uniform ar hite ture. Theemergen e of multi- ore hips and NUMA ma hines makes it impor-tant to minimize the number of remote memory a esses, to favor a hea(cid:30)nities, and to guarantee fast ompletion of syn hronization steps. Byusing the BubbleS hed platform as a threading ba kend for the GOMPOpenMP ompiler, we are able to easily transpose a(cid:30)nities of threadteams into s heduling hints using abstra tions alled bubbles. We thenpropose a s heduling strategy suited to nested OpenMP parallelism. Theresulting preliminary performan e evaluations show an important im-provement of the speedup on a typi al NAS OpenMP ben hmark appli- ation.Keywords: OpenMP, Nested Parallelism, Hierar hi al Thread S hedul-ing, Bubbles, Multi-Core, NUMA, SMP.1 Introdu tionThe emergen e of deeply hierar hi al ar hite tures based on multi-threadedmulti- ore hips and NUMA ma hines raises the need for a areful distribution ofthreads and data. Indeed, a he misses and NUMA penalties be ome more andmore important with the omplexity of the ma hine, making these onstraints asimportant as parallelization. They require some new programming models andnew tools to make the most out of these underlying ar hite tures.As quoted by Gao et al. [GSS + +
00℄ or expli itely binding threadgroups to pro essors [Zha06℄.Nevertheless, there exists some very good implementations of OpenMP nestedparallelism, su h as Omni/ST [TTSY00℄ for instan e. Su h implementations aretypi ally based on a (cid:28)ne-grain thread management system that uses a (cid:28)xednumber of threads to exe ute an arbitrary number of (cid:28)laments, as done in theCilk multithreaded system [FLR98℄. The performan e obtained over symmetri almultipro essors is often very good, mostly be ause many tasks an be exe utedsequentially with almost no overhead when all pro essors are busy. However,sin e these systems provide no support for atta hing high level information su has memory a(cid:30)nity to the generated tasks, many appli ations will a tually a hievepoor performan e on hierar hi al, NUMA multipro essors.One ould probably enhan e these OpenMP implementations to use a(cid:30)nityinformation extra ted by the ompiler so as to better distribute tasks or threadsover the underlying pro essors. However, sin e only the underlying thread s hed-uler has omplete ontrol over s heduling events su h as pro essor idleness, blo k-ing sys all or even thread preemption, this information ould only be used toin(cid:29)uen e task allo ation at the beginning of ea h parallel se tion.We believe that a better solution would be to transmit information extra tedby the ompiler to the underlying thread s heduler in a persistent manner, andthat only a tight integration of appli ation-provided meta-data and ar hite turedes ription an let the underlying s heduler take appropriate de isions duringthe whole appli ation run time. In other words, one an see this on(cid:28)gurables heduler framework as a domain-spe i(cid:28) language enabling s ientists to transfertheir knowledge to the runtime system [GSS + achinerunqueuerunqueuesChiprunqueuesCore Fig. 2. S heduling of bubbles and threads on the runqueues of a hierar hi alma hine.late threads with a high level of abstra tion by de iding the pla ement of bubbleson runqueues, or even temporarily putting some bubbles aside (by de(cid:28)ning theirown runqueues that the basi Self-S heduler will not look at).3.2 Generating Bubbles Out of OpenMP Parallel Se tionsThe GNU OpenMP ompiler[gom℄, GOMP, is based on an extension of theGCC 4.2 ompiler that onverts OpenMP pragmas into threading alls. The reation of threads and teams is a tually delegated to a shared library, libgomp,whi h ontains an abstra tion layer to map OpenMP threads onto various threadimplementations. This way, any appli ation previously ompiled by GOMP maybe relinked against an implementation of libgomp on another thread type andtransparently work the same.We used this (cid:29)exible design to develop MaGOMP, a port of GOMP on topof the Mar el threading library in whi h BubbleS hed is implemented. To do so,a Mar el adaptation of libgomp threads has been added to the existing abstra -tion layer. We rely on Mar el 's fully POSIX ompatible interfa e to guaranteethat MaGOMP will behave as well as GOMP on pthreads. Then, it be omes pos-sible to run any existing OpenMP appli ation on top of BubbleS hed by simplyrelinking it.On e Mar el threads are reated they basi ally behave by default as nativepthreads without any notion of team or memory a(cid:30)nity. BubbleS hed hooks havebeen added in the libgomp ode to provide information about thread teams by reating bubbles a ordingly.Therefore, when a thread en ounters a nested parallel region and be omesthe master of a new team, it reates a bubble within its urrently holding bubble.Then, it moves itself into this new bubble and reates the team's slave threadsinside it. Finally, the master dispat hes the workload a ross the team. On e theirwork is ompleted, slave threads die while the master destroys the bubble andreturns to its original team. As shown on Figure 3, only a few lines of ode areneeded to asso iate a nested team hierar hy with a bubble hierar hy.oid gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,stru t gomp_work_share *work_share) {stru t gomp_team *team;team = new_team (nthreads, work_share);... /* Pa k 'fn' and 'data' into the 'start_data' stru ture */if (nthreads > 1 && team->prev_ts.team != NULL) {/* nested parallelism, insert a mar el bubble */mar el_bubble_t *holder = mar el_bubble_holding_task (thr->tid);mar el_bubble_init (&team->bubble);mar el_bubble_insertbubble (holder, &team->bubble);mar el_bubble_inserttask (&team->bubble, thr->tid);mar el_attr_setinitbubble (&gomp_thread_attr, &team->bubble);}for(int i=1; i < nbthreads; i++) {pthread_ reate (NULL, &gomp_thread_attr,gomp_thread_start, start_data);...}}Fig. 3. One-to-One orresponden e between Mar el 's bubble and GOMP's teamhierar hies.3.3 A S heduling Strategy Suited to OpenMP Nested ParallelismThe hallenge of a s heduler for the nested parallelism of OpenMP resides inhow to distribute the threads over the ma hine. This must be done in a waythat favors both a good balan ing of the omputation and, in the ase of multi- ore and NUMA ma hines, a good a(cid:30)nity of threads, for better a he e(cid:27)e tsand avoiding the remote memory a ess penalty.For a hieving this, we wrote a bubble spread s heduler onsisting of amere re ursive fun tion that uses the API des ribed in se tion 3.1 to greedilydistribute the hierar hy of bubbles and threads over the hierar hy of runqueues.This fun tion takes in an array of (cid:16) urrent entities(cid:17) and an array of (cid:16) urrentrunqueues(cid:17). It (cid:28)rst sorts the list of urrent entities a ording to their omputationload (either expli itly spe i(cid:28)ed by the programmer, or inferred from the numberof threads). It then greedily distributes them onto the urrent runqueues bykeeping assigning the biggest entity to the least loaded runqueue2, and re urseseparately into the sub-runqueues of ea h urrent runqueue.It often happens that an entity is mu h more loaded than others (be ause itis a very deep hierar hi al bubble for instan e). In su h a ase, a re ursive all ismade with this bubble (cid:16)exploded(cid:17): the bubble is removed from the (cid:16) urrent enti-ties(cid:17) and repla ed by its ontent (bubbles and threads). How big a bubble needs2 This algorithm omes from the greedy algorithm typi ally used for resolving thebi-partition problem.o be for being exploded is a parameter that has to be tuned. This may dependon the appli ation itself, sin e it permits to hoose between respe ting a(cid:30)nities(by pulling inta t bubbles as low as possible) and balan ing the omputationload (by exploding bubbles for having small entities for better distribution).This way, a(cid:30)nities between threads are taken into a ount: sin e they are by onstru tion in the same bubble hierar hy, the threads of the same external loopiterations are spread together on the same NUMA node or the same multi ore hip for instan e, thus redu ing the NUMA penalty and enhan ing a he e(cid:27)e ts.Other repartition algorithms are of ourse possible, we are urrently workingon a even more a(cid:30)nity-based algorithm that avoids bubble explosions as mu has possible.4 Performan e EvaluationWe validated our approa h by experimenting with the BT-MZ appli ation. It isone of the 3D Fluid-Dynami s simulation appli ations of the Multi-Zone versionof the NAS Parallel Ben hmark [dWJ03℄ 3.2. In this version, the mesh is splitin the x and y dire tions into zones. Parallelization is then performed twi e:simulation an be performed rather independently on the di(cid:27)erent zones withperiodi fa e data ex hange ( oarse grain outer parallelization), and simulationitself an be parallelized among the z axis ((cid:28)ne grain inner parallelization).As opposed to other Multi-Zone NAS Parallel Ben hmarks, the BT-MZ ase isinteresting be ause zones have very irregular sizes: the size of the biggest zone an be as big as 25 times the size of the smallest one. In the original SMP sour e ode, outer parallelization is a hieved by using Unix pro esses while the innerparallelization is a hieved through an OpenMP stati parallel se tion. Similarlyto Ayguade et al. [AGMJ04℄, we modi(cid:28)ed this to use two nested OpenMP stati parallel se tions instead, using n o ∗ n i threads.The target ma hine holds 8 dual- ore AMD Opteron 1.8GHz NUMA hips(hen e a total of 16 ores) and 64GB of memory. The measured NUMA fa torbetween hips3 varies from 1.06 (for neighbor hips) to 1.4 (for most distant hips). We used the lass A problem, omposed of 16 zones. We tested boththe Native POSIX Thread Library of Linux 2.6 (NPTL) and the Mar el library,before trying the Mar el library with our bubble spread s heduler.We (cid:28)rst tried non-nested approa hes by only enabling either outer parallelismor inner parallelism, as shown in Figure 4:Outer parallelism( n o ∗ ): Zones themselves are distributed among the pro- essors. Due to the irregular sizes of zones and the fa t that there is only afew of them, the omputation is not well balan ed, and hen e the a hievedspeedup is limited by the biggest zones.Inner parallelism( ∗ n i ): Simulation in zones are performed sequentially, butsimulations themselves are parallelized among the z axis. The omputation3 The NUMA fa tor is the ratio between remote memory a ess and lo al memorya ess times. S peedup Number of ThreadsNPTL innerMarcel innerNPTL outerMarcel outer
Fig. 4. Outer parallelism ( n o ∗ ) and inner parallelism ( ∗ n i ).balan e is ex ellent, but the nature of the simulation introdu es a lot ofinter-pro essor data ex hange. Parti ularly be ause of the NUMA nature ofthe ma hine, the speedup is hen e limited to 7.So as to get the bene(cid:28)ts of both approa hes (lo ality and balan e), we thentried the nested approa h by enabling both parallelisms. As dis ussed by Duranet al. [DGC05℄, the a hieved speedup depends on the relative number of threads reated by the inner and the outer parallelisms, so we tried up to 16 threads forthe outer parallelism (i.e. the maximum sin e there are 16 zones), and up to 8threads for the inner parallelism. The results are shown on Figure 5. The nestedspeedup a hieved by NPTL is very limited (up to . , and is a tually worsethan what pure inner parallelism an a hieve (almost , not represented herebe ause the "Inner" axis maximum was trun ated to 8 for better readability).Mar el behaves better (probably be ause user threads are more lightweight),but it still an not a hieve a better speedup than . . This is due to the fa tthat neither NPTL nor Mar el takes a(cid:30)nities of threads into a ount, leadingto very frequent remote memory a esses, a he invalidation, et . We hen eused our bubble strategy to distribute the bubble hierar hy orresponding tothe nested OpenMP parallelism over the whole ma hine, and ould then a hievebetter results (up to . speedup with ∗ threads). This improvement is dueto the fa t that the bubble strategy arefully distribute the omputation overthe ma hine (on runqueues) in an a(cid:30)nity-aware way (the bubble hierar hy).It must be noted that for a hieving the latter result, the only addition wehad to do to the BT-MZ sour e ode is the following line: all mar el_set_load(int(pro _zone_size(myid+1)))
1 2 3 4 5 6 7 8 9 3 4 5 6 7 8 2 4 6 8 10 6.28 5.63Outer Inner (a) NPTL
1 2 3 4 5 6 7 8 9 3 4 5 6 7 8 2 4 6 8 10 8.16Outer Inner (b) Mar el
1 2 3 4 5 6 7 8 9 3 4 5 6 7 8 2 4 6 8 10 10.2Outer Inner ( ) BubblesFig. 5. Nested parallelism.hat expli itly tells the bubble spread s heduler the load of ea h zone, so thatthey an be properly distributed over the ma hine. Su h a lue (whi h ouldeven be dynami ) is very pre ious for permitting the runtime environment tomake appropriate de isions, and should probably be added as an extension tothe OpenMP standard. Another way to a hieve load balan ing would be to reatemore or less threads a ording to the zone size [AGMJ04℄. This is however a bitmore di(cid:30) ult to implement than the mere fun tion all above.5 Con lusionIn this paper, we dis ussed the importan e of establishing a persistent oop-eration between an OpenMP ompiler and the underlying runtime system fora hieving high performan e on nowadays multi- ore NUMA ma hines. We showedhow we extended the GNU OpenMP implementation, GOMP, for making use ofthe (cid:29)exible Mar el thread library and its high-level bubble abstra tion. This per-mitted us to implement a s heduling strategy that is suited to OpenMP nestedparallelism. The preliminary results show that it improves the a hieved speedupa lot.At this point, we are enhan ing our implementation so as to introdu e just-in-time allo ation for Mar el threads, bringing in the notion of (cid:16)ghost(cid:17) threads,that would only be allo ated when (cid:28)rst run by a pro essor. In the short term,we will keep validating the obtained results over several other OpenMP appli a-tions, su h as Ondes3D (Fren h Atomi Energy Commission). We will omparethe resulting performan e with other OpenMP ompilers and runtimes. We alsointend to develop an extension to the OpenMP standard that will provide pro-grammers with the ability to spe ify load information in their appli ations, whi hthe runtime will be able to use to e(cid:30) iently distribute threads.In the longer run, we plan to extra t the properties of memory a(cid:30)nity at the ompiler level, and express them by inje ting gathered information into morea urate attributes within the bubble abstra tion. These properties may be ob-tained either thanks to new dire tives à la UPC 4 [CDC +
99℄ or be omputedautomati ally via stati analysis [SGDA05℄. For instan e, this kind of infor-mation is helpful for a bubble-spreading s heduler, as we want to determinewhi h bubbles to explode or to de ide whether or not it is interesting to apply amigrate-on-next-tou h me anism [NLRH06℄ upon a s heduler de ision. All theseextensions will rely on a memory management library that atta hes informationto bubbles a ording to memory a(cid:30)nity, so that, when migrating bubbles, theruntime system an migrate not only threads but also the orresponding data.4 The UPC forall statement adds to the traditional for statement a fourth (cid:28)eldthat des ribes the a(cid:30)nity under whi h to exe ute the loop Software AvailabilityMar el and BubbleS hed are available for download within the PM2 distri-bution at http://runtime.futurs.inria.fr/Runtime/logi iels.html underthe GPL li ense. The MaGOMP port of libgomp will be available soon and maybe obtained on demand in the meantime.Referen esAGMJ04. Eduard Ayguade, Mar Gonzalez, Xavier Martorell, and Gabriele Jost. Em-ploying Nested OpenMP for the Parallelization of Multi-Zone Computa-tional Fluid Dynami s Appli ations. In 18th International Parallel andDistributed Pro essing Symposium (IPDPS), 2004.BS05. R. Blikberg and T Sørevik. Load balan ing and OpenMP implementation ofnested parallelism. Parallel Computing, 31(10-12):984(cid:21)998, O tober 2005.CDC +
99. W. Carlson, J.M. Draper, D.E. Culler, K. Yeli k, E. Brooks, and K. Warren.Introdu tion to UPC and Language Spe i(cid:28) ation. Te hni al Report CCS-TR-99-157, George Mason University, May 1999.DGC05. Alejandro Duran, Mar Gonzàles, and Julita Corbalán. Automati ThreadDistribution for Nested Parallelism in OpenMP. In 19th ACM InternationalConferen e on Super omputing, pages 121(cid:21)130, Cambridge, MA, USA, June2005.dWJ03. Rob F. Van der Wijngaart and Haoqiang Jin. NAS Parallel Ben hmarks,Multi-Zone Versions. Te hni al Report NAS-03-010, NASA Advan ed Su-per omputing (NAS) Division, 2003.FLR98. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementa-tion of the Cilk-5 Multithreaded Language. In ACM SIGPLAN Conferen eon Programming Language Design and Implementation (PLDI), Montreal,Canada, June 1998. http://theory.l s.mit.edu/pub/ ilk/ ilk5.ps.gz.gom. GOMP (cid:21) An OpenMP implementation for GCC.http://g .gnu.org/proje ts/gomp/.GOM +
00. Mar Gonzalez, Jose Oliver, Xavier Martorell, Eduard Ayguade, JesusLabarta, and Na ho Navarro. OpenMP Extensions for Thread Groupsand Their Run-Time Support. In Languages and Compilers for ParallelComputing. Springer Verlag, 2000.GSS ++