[PDF] A Proof of Concept for Optimizing Task Parallelism by Locality Queues

Abstract

Task parallelism as employed by the OpenMP task construct, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance bottlenecks if locality of data access is important, which is typically the case for memory-bound code on ccNUMA systems. We present a programming technique which ameliorates adverse effects of dynamic task distribution by sorting tasks into locality queues, each of which is preferably processed by threads that belong to the same locality domain. Dynamic scheduling is fully preserved inside each domain, and is preferred over possible load imbalance even if non-local access is required. The effectiveness of the approach is demonstrated using a blocked six-point stencil solver as a toy model.

Full PDF

aa r X i v : . [ c s . PF ] F e b A Proof of Concept for Optimizing TaskParallelism by Locality Queues

Markus Wittmann and Georg Hager

Erlangen Regional Computing Center (RRZE),University of Erlangen-Nuremberg, Martensstraße 1, 91058 Erlangen, Germany [email protected]

Abstract.

Task parallelism as employed by the OpenMP task con-struct, although ideal for tackling irregular problems or typical pro-ducer/consumer schemes, bears some potential for performance bottle-necks if locality of data access is important, which is typically the case formemory-bound code on ccNUMA systems. We present a programmingtechnique which ameliorates adverse eﬀects of dynamic task distributionby sorting tasks into locality queues, each of which is preferably processedby threads that belong to the same locality domain. Dynamic schedulingis fully preserved inside each domain, and is preferred over possible loadimbalance even if non-local access is required. The eﬀectiveness of theapproach is demonstrated using a blocked six-point stencil solver as atoy model.

Dynamic scheduling is the preferred method for solving load imbalance prob-lems with shared memory parallelization. The OpenMP standard provides the dynamic and guided schedulings for worksharing loops, and the task constructfor task-based parallelism. If the additional overhead for dynamic scheduling isnegligible for the application at hand, these approaches are ideal on UMA (Uni-form Memory Access) systems like the now outdated single-core multi-socketSMP nodes, or multi-core chips with “isotropic” caches, i. e. where each cachelevel is either exclusive to one core or shared among all cores on a chip.If, however, data access locality is important for performance and scalability,dynamic scheduling of any kind is usually ruled out. The most prominent exam-ple are memory-bound applications on ccNUMA-type systems: Even if memorypages are carefully placed into the NUMA domains by parallel ﬁrst-touch initial-ization, peak memory bandwidth cannot be reached if cores access the NUMAdomains in a random manner, although this is still far better than serial initial-ization if there are no other choices.As a simple benchmark we choose a 3D six-point Jacobi solver with constantcoeﬃcients as recently studied extensively by Datta et al. [1]. The site updatefunction, F t +1 ( i, j, k ) = c F t ( i, j, k ) + c [ F t ( i − , j, k ) + F t ( i + 1 , j, k ) + F t ( i, j − , k )+ F t ( i, j + 1 , k ) + F t ( i, j, k −

1) + F t ( i, j, k + 1)] , is evaluated for each lattice site in a 3D loop nest, and the memory layout ischosen so that i is the fast index. Each site update (in the following called “LUP”)incurs seven loads and one store, of which, at large problem sizes, one load andone store cause main memory traﬃc if suitable spatial blocking is applied. Thisleads to a code balance of 3 bytes per ﬂop (assuming that non-temporal storesare not used so that a store miss causes a cache line read for ownership), sothe code is clearly memory-bound on all current cache-based architectures. Inwhat follows we use a problem size of 600 sites ( ≈ × ×

10 ( d i × d j × d k ) sites,unless otherwise noted. The update loop nest iterates over all blocks in turn,and standard worksharing loop parallelization is done over the outer ( k -blocking)loop (initialization is performed via the identical scheme): Note that with the standard i block size equal to the extent of the lattice inthat direction (which is required to make best use of the hardware prefetchingcapabilities on the processors used), number_of_i_blocks is equal to one. The jacobi_sweep_block() function performs one Jacobi sweep, i.e. one update perlattice site, over all sites in the block determined by its parameters. In case ofdynamic loop scheduling, parallel ﬁrst-touch initialization is done via a static,1 (round robin) loop schedule, whereas plain static scheduling is used otherwise.Fig. 1 illustrates the impact of dynamic scheduling on the solver’s scalabilityfor two benchmark systems: – “Dunnington” is an EA Intel UMA server system (“Caneland” chipset) withfour six-core Intel Xeon 7460 processor chips at 2.66 GHz. Data for thissystem is included for illustrative purposes. – “Opteron” is an HP DL585 G5 ccNUMA server with four locality domains(LDs), one per socket, and four dual-core AMD Opteron 8220 SE processorchips at 2.8 GHz. The processors are connected via HyperTransport 1.0 GHzlinks (4 GB/s per direction).Both systems ran current Linux kernels, and the Intel C++ compiler, version11.0.074 was used for all benchmarks. As we are mostly interested in scalabilitydata, detailed performance characteristics for the platforms under considerationare omitted. One should note, however, that there is signiﬁcant optimization po-tential in stencil codes like the one we use here. The block size we have chosen isclose to optimal from a data transfer perspective [1], and the performance data Sockets P e rf o r m a n ce [ M L U P / s ] Dunnington staticDunnington dynamicOpteron static parInitOpteron dynamic parInitOpteron static LD0

Fig. 1.

Peformance in million lattice site updates per second (MLUP/s) versus numberof sockets for an OpenMP parallel 6-point stencil solver on a UMA (solid bars) and accNUMA (hatched bars) system, using standard worksharing loop parallelization(seetext for details). “parInit” denotes parallel ﬁrst-touch data initialization. The “LD0”data set was obtained by forcing all memory pages to reside in locality domain 0. obtained is in line with STREAM COPY scalability on the same systems. In allcases, the number of threads per socket was chosen so that the local memory buscould be saturated, which happens to be the case for two threads on both plat-forms. Core-thread aﬃnity was enforced by overloading the pthread_create() call and using sched_setaffinity() in turn for each newly created thread,skipping the OpenMP shepherd thread(s).The performance results and parallel eﬃciency numbers in Figs. 1 and 2show that dynamic scheduling has negligible impact on the UMA system for thechosen problem and block sizes, although one may of course expect a noticeableperformance hit if OpenMP startup and scheduling overhead become dominantwith small data sets and block sizes. If static scheduling and proper parallelinitialization are employed, the ccNUMA system shows similar characteristicsas the UMA node (bandwidth scalability with four sockets is not ideal for cur-rent Opteron-based systems because of protocol overhead). Dynamic scheduling,however, has a catastrophic eﬀect on parallel eﬃciency as remote accesses andcontention on the HyperTransport network dominate performance. Moreover,there is a noticeable statistical performance variation because access patternsvary from sweep to sweep. Nevertheless, due to the round-robin page placementas described above, there is still some parallelism available. If we force all mem-ory pages to be mapped in the ﬁrst locality domain (LD0), all parallelism is lostand performance is limited by the single-domain memory bandwidth, which isalready saturated with the two local threads. Execution is hence serialized.

Sockets P a r a ll e l e ff i c i e n c y Dunnington staticDunnington dynamicOpteron static parInitOpteron dynamic parInitOpteron static LD0

Fig. 2.

Parallel eﬃciency ε versus number of sockets for the same data sets as in Fig. 1. In contrast to standard worksharing loop parallelization, tasking in OpenMPrequires to split the problem into a number of work “packages”, called tasks ,each of which must be submitted to an internal pool by the omp task directive.For the Jacobi solver we deﬁne one task to be a single block of the size speciﬁedabove. This is in contrast to loop worksharing where one parallelized outer loopiteration consisted of all blocks with the same kb coordinate. Using the collapse clause on the parallel loop nest would correct this discrepancy, but there is nofurther insight gained. The tasks are produced (submitted) by a single thread and consumed by allthreads and in a 3D loop nest:

Table 1.

Performance of the Jacobi solver in MLUP/s on 8 threads of the Opteron(ccNUMA) platform with two diﬀerent schedulings for the block initialization loop(rows), comparing standard tasking and tasking with locality queues, and the twopossible choices for the submit loop nest order. tasking tasking + queuessubmit order kji jki kji jkistatic init . ± . . ± . . ± . . ± . static,1 init . ± . . ± . . ± . . ± . Submitting the tasks in parallel is possible but did not make any diﬀerence in theparameter ranges considered here (but see below for the impact of submissionorder). There is still a choice as to how ﬁrst-touch initialization should be per-formed, so we compare static and static,1 scheduling for loop initialization.Table 1 shows performance results (columns labeled “tasking”) on the ccNUMAplatform, using eight threads and two diﬀerent loop orderings for the submissionloop. With static initialization and the standard kji loop order as shown above,performance is roughly equal to the results obtained in the previous section withLD0 enforcement (“Opteron static LD0” data in Figs. 1 and 2), i.e. executionis serialized. Performance is slightly improved to roughly the 4-thread dynamicscheduling level (“Opteron dynamic parInit” data for two sockets in Fig. 1) bychoosing the jki loop order for submission. Going to static,1 initialization,the 8-thread dynamic scheduling performance can be matched.The large impact of submit and initialization orders can be explained by as-suming that there is only a limited number of “queued”, i.e. unprocessed tasksallowed at any time. In the course of executing the submission loop, this limitis reached very quickly and the submitting thread is used for processing tasksfor some time. From our measurements, the limit is equal to 257 tasks with thecompiler used. One ib - jb layer of our grid comprises 60 tasks (with the chosenproblem and block sizes), and 60 layers are available, which amounts to 3600tasks in total. With static scheduling, one block of 257 consecutive tasks is usu-ally associated with a single locality domain (rarely two), hence the serializationof memory access. Choosing static,1 scheduling for initialization, each consec-utive layer is placed into a diﬀerent locality domain, but 257 tasks comprise onlyslightly more than four layers. Assuming that the order of execution for tasksresembles static,1 loop workshare scheduling because each thread is served atask in turn, the number of LDs to be accessed in parallel is limited (althoughit is hard to predict the actual level of parallelism). Finally, by choosing the jki submission loop order, consecutive tasks cycle through locality domains,and parallelism is as expected from dynamic loop scheduling. The statisticalvariation is surprisingly small, however.These observations document that it is nontrivial to employ tasking on cc-NUMA systems and reach at least the performance level of standard dynamic loop scheduling. In the next section we will demonstrate how task schedulingunder locality constraints can be optimized by substituting part of the OpenMPscheduler by user program logic. Each task, which equals one lattice block (or tile) in our case, is associated witha C++ object (of type

BlockObject ) and equipped with an integer localityvariable which denotes the locality domain it was placed in upon initialization.The submission loop now takes the following form:

The queues object is basically a std::vector<> of std::queue<> objects, eachassociated with one locality domain, and each protected from concurrent accesswith an OpenMP lock. Calling the Enqueue() method of a queue appends ablock object to it. As shown above, blocks are sorted into those locality queues according to their respective locality variables. One OpenMP task, executed bythe process_block_from_queue() function, now consists of two parts:1. Figuring out which LD the executing thread belongs to2. Dequeuing the oldest waiting block in the locality queue belonging to thisdomain and calling jacobi_sweep_block() for itIf the local queue of a thread is empty, other queues are tried in a spin loop untila block is found: void process_block_from_queue(LocalityQueues &queues) {// ...bool found=false;BlockObject p;int ld = ld_ID[omp_get_thread_num()];while (!found) {found = queues[LD].Dequeue(p);if (!found) { ld = (ld + 1) % num_of_lds;}}jacobi_sweep_block(p.ib,p.jb,p.kb);}

The global ld_ID vector must be preset with a correct thread-to-LD mapping.Enqueuing and dequeuing blocks from a queue is made thread-safe by protectingthe queue with an OpenMP lock.Note that scanning other queues if a thread’s local queue is empty gives loadbalancing priority over strict access locality, which may or may not be desirabledepending on the application. The team of threads in one locality domain sharesone queue, so scheduling is purely dynamic in that case.There is actually a “race condition” with the described scheme because it ispossible that some task executes a block just queued before the correspondingtask is actually submitted. This is however not a problem because the numberof submitted tasks is always equal to the number of queued blocks, and no taskwill ever be left waiting for new blocks forever.Table 1 shows performance results for eight threads with four locality queuesunder the columns labeled “tasking + queues”. For static initialization andthe kji submission order, the limited overall number of waiting tasks has thesame consequences as with plain tasking (see Sect. 2.1). In this case, althoughthe queuing mechanism is in eﬀect, a single queue holds most of the tasks atany point in time. All threads are served from this queue and thus execute ina single LD. However, using the alternate jki submission order or static,1 initialization, all queues are fed in parallel and threads can be served tasks fromtheir local queue. Performance then comes close to static scheduling within a10 % margin (see Fig. 1).One should note that a similar eﬀect could have been achieved with nestedparallelism, using one thread per LD in the outer parallel region and severalthreads (one per core) in the nested region. However, we believe our approach tobe more powerful and easier to apply if properly wrapped into C++ logic thattakes care of aﬃnity and work distribution. Moreover, the thread pooling strate-gies employed by many current compilers inhibit sensible aﬃnity mechanismswhen using nested OpenMP constructs.

We have demonstrated how locality queues can be employed to optimize parallelmemory access on ccNUMA systems when OpenMP tasking is used. Localityqueues substitute the uncontrolled, dynamic task scheduling by a static anda dynamic part. The latter is mostly restricted to the cores in one NUMA do-main, providing full dynamic load balancing on the LD level. Scheduling betweendomains is static, but load balancing can be given priority over strictly local ac-cess. The larger the number of threads per LD, the more dynamic the task distribution, so our scheme will get more interesting in view of future many-coreprocessors.Using locality queues to optimize real applications on ccNUMA systems isunder investigation, as well as a study of possible additional overhead of themethod for “small” tasks. The same methodology may also be applied for op-timizing parallel_while constructs in Intel Threading Building Blocks (TBB,[3]). Further potentials, not restricted to ccNUMA architectures, may be foundin the possibility to implement temporal blocking (doing more than one timestep on a block to reduce pressure on the memory subsystem [4]) by associatingone locality queue to a number of cores that share a cache level. As an advantageover static temporal blocking, no frequent global barriers would be required.

Acknowledgments

Fruitful discussions with Gerhard Wellein and Thomas Zeiser are gratefully ac-knowledged.

References

1. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patter-son, J. Shalf, and K. Yelick: Stencil Computation Optimization and Autotuningon State-of-the-art Multicore Architectures. Proceedings of SC08, Austin, TX,Nov. 15–21, 2008.2. H. Stengel: C++ programming techniques for High Performance Computing onsystems with non-uniform memory access using OpenMP. Diploma thesis, Univer-sity of Applied Sciences Nuremberg, 2007 (in German).