TEMANEJO - a debugger for task based parallel programming models
Steffen Brinkmann, José Gracia, Christoph Niethammer, Rainer Keller
TTemanejo - a debugger for task basedparallel programming models
Steffen BRINKMANN a , Jos´e GRACIA a , Christoph NIETHAMMER a andRainer KELLER aa High Performance Computing Centre Stuttgart (HLRS), University ofStuttgart, Germany
Abstract.
We present the program
Temanejo , a debugger for task basedparallelisation models such as StarSs. The challenge in debugging StarSsapplications lies in the fact that tasks are scheduled at runtime, i.edynamically in accordance to the data dependencies between them. Ourtool assists the programmer in the debugging process by visualising thetask dependency graph and allowing to control the scheduling of tasks.The toolset consists of the library
Ayudame which communicates withthe StarSs runtime on one side and of the debugger
Temanejo on theother side which communicates with
Ayudame . Temanejo provides agraphical user interface with which the application can be analysed andcontrolled.
Keywords.
Debugging, HPC, task parallelisation, StarSs, Temanejo
Introduction
Task based programming models have amplified the landscape of parallelisationparadigms. In these models, parallelism is not dictated on the level of operations.The programmer rather indicates data dependencies between parts of the codewhich become tasks and then are scheduled by an execution framework at runtime.This way the program can “react” on internal irregularities, e.g. differences inexecution time of each task, and external conditions, such as hybrid architectures.One family of these programming models are the StarSs models [1,2]. The“Ss” stands for superscalar and indicates that these models aim at scaling onmodern supercomputers consisting of millions of processing units. The “Star” inthe name paraphrases a wildcard which stands for the different implementationsof this model like
CellSs , GPUSs , GridSs , OMPSs and
SMPSs [2] among others.They differ in the supported programming languages, whether functions (respec-tively subroutines) or other blocks of code are parallelised and whether the targetplatform is, for instance, a grid or a node of CPUs or GPUs.Debugging a task parallel program differs from debugging an otherwise paral-lelised program as the parallelisation is determined by data dependencies insteadof explicit scheduling of tasks and synchronisation between them. Therefore, thedevelopment of new debugging tools capable of assisting the programmer with a r X i v : . [ c s . D C ] D ec ebugging task parallel programs is a crucial condition for effectively exploitingthe possibilities of these models.We present a debugging toolset for task parallel programs consisting of twocomponents: a library called Ayudame , acting as a thin communication layerand enabling the debugger to receive information about the program and sendcommands back to it, and the actual debugger Temanejo which enables the userto view the extracted information and issue commands via a highly interactivegraphical user interface.For this work, we used the StarSs implementations SMPSs and OMPSs. Nev-ertheless, the basic concepts of how to debug a task parallel program are the sameregardless of the specific framework.In section 1, we elaborate on the idea and further implications of task basedparallelism. We present different strategies and actions a debugger should pro-vide and discuss how these features were implemented in the presented debugger Temanejo in section 2. Finally, we summarise the results and give an outlookon future enhancements to the debugger in section 3.
1. Task based parallelism
In contrast to other task based parallel programming models, the focus in theStarSs family lies on the data dependencies between parts of the program whichare defined as tasks. In SMPSs, for example, the programmer marks functions orsubroutines as “potentially parallel” using a special syntax. A typicalfunction declaration in C looks like this: is the identifier of the pragma, the keyword task tells the precompilerthat the following function declaration is to be treated as an SMPSs task, and thekeywords input , inout and output specify the data dependencies for this task.These pragmas are read by a compiler wrapper ( smpss-cc ) which embeds theprogram code into a runtime framework. This framework will initialise a previ-ously defined number of threads and start the application. One of the threads willbe the master thread which creates all tasks and takes care of proper initialisationand finalisation of the program. Moreover the master thread will execute all codeoutside of tasks. The tasks will be executed by the other threads, called workerthreads During runtime the
SMPSs framework will assign all calls to functions markedby pragmas to a thread for execution depending on the specified dependencies. Spanish for help me Spanish for
I handle you css stands for cell super scalar and is a historical remnant of the first versions of SMPSs. igure 1. State transition diagram of a task . The task is always in one of the states “not queued” , “queued” , “running” or “finished” . The processing of the successors is not counted as an ownstate. The transition to a state is triggering an event indicated by the numbers. To achieve this, a task must be in one of four states (see figure 1): “notqueued” , “queued” , “running” or “finished” . When tasks are created they are notqueued in case they have pending dependencies, i.e., they read ( input ) from amemory address which has been declared output (or inout ) by another task at anearlier time. Otherwise, the task is queued directly after creation.When all tasks on which a task depends on are finished, the task is queued. Aqueued task can be run at any time by an idling thread. When a task is executedby a thread, its state changes to running . After finishing the taks, its state is finished .The transition between the states is a useful information for the programmer(see section 2.2). Furthermore, the state transition diagram (figure 1) indicatesthat there is more than one way to block or prioritise a task. In the case of block-ing, one can avoid for a task to be queued or avoid that it is ever dequeued forexecution by a thread. While a task is blocked, one can let the other threadscontinue executing tasks or interrupt the program execution entirely. Other possi-bilities for debugging include stopping only a certain function or adding artificialdependencies in order to force a certain order of subtree execution.Some of these possibilities are exploited by the debugger Temanejo . Howthis was done is explained in the following. An outlook on future features is givenin section 3. . The debugger Temanejo
In this section we will investigate the different requirements regarding a debuggingtool for task based problems and how the debugger
Temanejo fulfils them. Arather technical introduction to
Temanejo is given in the next section.For the programmer it is necessary to see (section 2.2) and be able to manip-ulate (sections 2.3.1, 2.3.2 and 2.3.3) the dependencies which lead to the condi-tions for executing tasks and thus alter the way the runtime will step through thetasks. As an additional information the duration of each task can be displayed(section 2.4).In contrast to other parallelisation strategies, when using task based modelsthe programmer needn’t (and cannot) know a priori when a certain task is exe-cuted. At compile time only the conditions for running a task are known. There-fore, classical debuggers which step through lines of code sequentially are not wellsuited for debugging task parallel programs. They form a part of the debuggingprocess (see section 2.3.4) but need to be complemented by other tools.
Temanejo is written in python and requires python version 2.6 or newer. Fur-thermore, it imports several modules, the most important being Networkx [3] forthe graph data structure and pygtk [4] for the visualisation and the graphical userinterface.The debugger
Temanejo is tightly linked to the small library
Ayudame .This library serves as a thin communication layer between the runtime and
Temanejo and has to be preloaded to the application. The SMPSs and OMPSsframework (thus indirectly the application) are instrumented with calls to anevent handler which passes the information to the debugger as a set of 8 inte-ger values containing an event id, the task id and, depending on the event, oneor more of the following: the task id of a dependency, the memory address ofa dependency, the id of the function assigned to the task, a thread id and/or atimestamp. This is done at certain crucial points in the program execution, forinstance initialisation, task generation, task execution etc., the event function iscalled and will react accordingly. In most cases, it will simply pass the data (taskid, thread id, dependencies and so forth) to the attached socket client (in thiscase
Temanejo ). Temanejo uses this information to build an internal data structure repre-senting the dependency graph. Each task is a node of this graph and each depen-dency an edge. Nodes and edges have properties such as the task status and thememory address of the dependency, respectively.The information is displayed by
Temanejo at the time it is received andthe user can access the debugging features instantly by pressing buttons or right-clicking on nodes. There are four distinct properties available for displaying in-formation: the node colour, the node-margin colour, the node shape and the edgecolour. This way one can produce different representations of the same graphdepending on what information is selected for visualisation at a given debuggingsession. igure 2. A simple dependency graph with two independent subtrees consisting of ten tasks eachvisualised by
Temanejo . Eight independent tasks are drawn in blue, two of them scheduled(green margin) and six of them queued (yellow margin). Three tasks with input and outputdependencies are shown in yellow, four reduction tasks in red and two tasks with only one inputdependency in green. The red margin indicates that a task has unfulfilled dependencies and cantherefore not be queued yet. The node shapes (triangle and box) denote two different workerthreads, the circle shape indicates that the task has not been assigned to a thread yet. The textlabels and colours of the edges indicate the memory addresses of the dependencies.
In the current version (0.5)
Temanejo relies on connecting to the
Ayudame library as a socket client when starting up. This means that either the parallelapplication has to run when
Temanejo is started or
Temanejo will start theapplication at the beginning of the debugging session.
The first step of data-flow oriented debugging is to visualise the dependencygraph. Often this already helps the programmer to spot errors or other short-comings of his program. An example of such a dependency graph as displayed by
Temanejo is shown in figure 2.It is important to generate and show the dependency graph while the programis running (“online”) rather than just collecting the dependency information forlater review (“offline”). Firstly the tasks are generated dynamically during run-time hence sometimes the solution to a problem lies in the order in which tasksare generated. Secondly viewing the graph after running the parallel programmakes the graph simply the result of a program rather than a tool to debug it.Therefore, passing information during runtime from the program to the debuggeris essential. For SMPSs and OMPSs, this is achieved by linking to the library
Ayudame . .3. Controlling Task execution The second requirement for debugging is for the programmer to be able to interactwith the running application. This means that the programmer can block a task,stop the program execution at some point, prioritise a task and attach a gdb session to the application. How these features are achieved by
Temanejo , isdescribed in the following.
In order to block a specific task, the threads have to be kept from scheduling itfor execution. At the same time it must be possible to unblock a task again. Weachieve this by building a ” to-be-blocked ” list in the runtime from the user inputto
Temanejo . Whenever a thread wants to execute a task, it checks whether thetask is in that list and if so he skips it and searches the queues for another one.
Stopping the program execution at a certain point means to avoid the dequeueingof tasks for any thread. This is implemented by a global flag which is checked byan idling thread before looking for a new task in the queues. If the flag is set, thethread is sent back to idle.
In order to prioritise a task, we make use of SPMSs’ “critical” queue which isalso accessible by adding the keyword highpriority to the pragma line beforethe function definition. With a right-click on the node in
Temanejo the user canprioritise and de prioritise any task before it queued. At the present stage it isnot possible to “requeue” a task, i.e. to take a task from one queue and put itinto another. In future versions this will be possible which will enable the userto prioritise a task which is already queued in order to execute a specific task asearly as possible. gdb Temanejo is equipped with the possibility of launching gdb at any point of thedebugging session. By default the terminal user interface of gdb starts and letsthe user debug any part of the application in the usual line based way.Another way to launch gdb is by specifying a function in
Temanejo at which gdb should set a breakpoint. Again, a gdb process is attached to the applicationprocess, a breakpoint is set at the specified function and the program executionis continued, i.e. the control is given back to
Temanejo . When the programexecution reaches the function, gdb takes control and the user can debug thefunction with gdb . In order to enable the StarSs framework to balance each thread’s load, the taskmust neither be to small nor too big. For SMPSs the optimal duration of a taskn order to keep all threads busy while avoiding too much overhead wasted ingenerating, queueing, scheduling and cleaning up tasks, was shown to be of theorder of hundreds of milliseconds. An application with tasks which run less than ≈ µs will suffer a huge overhead in the sense that the master thread will bebusy managing the tasks while the worker threads will spend most of their timewaiting for tasks. On the other hand, when some tasks are to large, the applicationruns into a load balancing problem because the worker thread with the large taskwill again keep other worker threads waiting.Therefore, it is important for the programmer to know how long the tasks last.In Temanejo the task durations can be shown in colour code for finished tasks.As the communication with the debugger alters the collection of time information,this can only be a rough guide and does not replace a profiling tool. Nevertheless,the ability of displaying timing data in the dependency graph effortlessly provedto be a powerful tool for programmers.
3. Summary and outlook
We presented
Temanejo , a debugger for task parallel applications. The StarSsimplementations SMPSs and OMPSs in conjunction with the library
Ayudame enables the programmer to write task parallel codes and debug them with
Temanejo . Temanejo is in constant development process but already provides powerfulfeatures which help to debug and enhance task parallel applications. Foremostit shows the dependency graph while it is built up and processed during theruntime of the application. The node colour, the node margin colour, the nodeshape and the edge colour are four properties which can assigned to display oneof the following information: the function which the task executes, the threadwhich executes the task, the status of the task, the duration of each task (all ofthe before mentioned for node properties) and the dependency address (for theedge colour).The presented version
Temanejo gdb .Finally, we point out that
Temanejo is, in spite of its stable operation andrichness of feature, still in the process of active development. We encourage thereader to use it and send feedback of any kind in order to enhance this powerfuland unique debugger in the most useful way.
Acknowledgements
This work was supported by the European Community’s Seventh Framework Pro-gramme [FP7-INFRASTRUCTURES-2010-2] project TEXT under grant agree-ment number 261580. The authors would like to thank Dr. C. Glass for his valu-able help to enhance this article.
References [1] Judit Planas, Rosa M. Badia, Eduard Ayguad, and Jesus Labarta. Hierarchical task basedprogramming with StarSs.
Int. J. of High Performance Computing Applications , 23(3):284–299, 2009.[2] Vladimir Marjanovi´c, Jes´us Labarta, Eduard Ayguad´e, and Mateo Valero. Overlappingcommunication and computation by using a hybrid MPI/SPMSs approach. In