AA Type-Oriented Graph500 Benchmark
Nick Brown
EPCC, Edinburgh University [email protected]
Abstract.
Data intensive workloads have become a popular use of HPCin recent years and the question of how data scientists, who might notbe HPC experts, can effectively program these machines is important toaddress. Whilst using models such as Partitioned Global Address Space(PGAS) is attractive from a simplicity point of view, the abstractionsthat these impose upon the programmer can impact performance. Wepropose an approach, type-oriented programming, where all aspects ofparallelism are encoded via types and the type system which allows forthe programmer to write simple PGAS data intensive HPC codes andthen, if they so wish, tune the fundamental aspects by modifying type in-formation. This paper considers the suitability of using type-oriented pro-gramming, with the PGAS memory model, in data intensive workloads.We compare a type-oriented implementation of the Graph500 benchmarkagainst MPI reference implementations both in terms of programmabilityand performance, and evaluate how orienting their parallel codes aroundtypes can assist in the data intensive HPC field.
Keywords:
Graph500, Mesham, type-oriented programming, data in-tensive workload, PGAS
The HPC community has traditionally concentrated on solving computationbased problems but in recent years data intensive workloads have also becomea popular use of these resources. Data intensive workloads often involve hugeamounts of data, each requiring small numbers of calculations per element andbecause of this the communication aspects of a system is critically important.This is in contrast with the more traditional computation based workloads, wheredata sizes tend to be smaller but much more computation per element is required.One of the challenges associated with HPC is programming models and the dataprocessing field is no exception. There is often a trade off between programmabil-ity and performance; those models which promote simplicity can impose choicesand restrictions upon the programmer in the name of abstraction which canharm performance. Whilst efficient communication is important to both com-putation and data intensive workloads, the fact that data intensive workloadsplace so much emphasis on communication makes it critically important thatthe programming models used do not sacrifice communication efficiency. a r X i v : . [ c s . P L ] S e p Using the PGAS memory model for solving data intensive problems, wheredata is equally accessible whether it is held in local or remote memory andthe programmer need not worry about the underlying implementation detail, isan attractive proposition from a programmability point of view. However, thishigher level of abstraction can typically result in a performance cost and existingPGAS languages either disallow or limit the options that the programmer hasto tune their code at the communication level. Type-oriented programming ad-dresses the PGAS trade off issue by providing the options to the end programmerto choose between explicit and implicit parallelism by using types which can becombined to form the semantics of data, governing parallelism. A programmermay choose to use these types or may choose not to use them and in the absenceof type information the compiler will use a well-documented set of default be-haviours. Additional type information can be used by the programmer to tune orspecialise many aspects of their code which guides the compiler to optimise andgenerate the required parallelism code. In short these types for parallelisationare issued by the programmer to instruct the compiler to perform the expectedactions during compilation and in code generation.The Graph500[1] benchmark is a popular, objective, way of determining hard-ware’s suitability to data processing but can also be used to benchmark otheraspects such as the languages, libraries and runtimes that are used in data in-tensive work. We compare an implementation of the Graph500 benchmark inMesham, our research type-oriented programming language, compared to theMPI reference versions both in terms of programmability and performance.
Type-oriented programming[2] allows the programmer to encode all variable in-formation via the type system by combining different types together to form theoverall meaning of variables. This is contrasted against a more traditional ap-proach where the programmer uses a type to govern the data that a variable willhold but additional information, such as whether a variable is read only or not,is applied via type qualifiers. Using the C programming language as an example,in order to declare a variable m to be a read only character where memory isallocated externally, the programmer writes extern const char m . Where char isthe type and both extern and const are inbuilt language type qualifiers. Whilstthis approach works well for sequential languages, in the parallel programmingdomain there are potentially many more attributes which might need to be as-sociated; such as where the data is located, how it is communicated and anyrestrictions placed upon this. Representing such a rich amount of informationvia multiple qualifiers would not only bloat the language, it might also introduceinconsistencies when qualifiers were used together with potentially conflictingbehaviours.Instead our approach is to allow for the programmer to combine differenttypes together to form the overall meaning. For instance, extern const char m becomes var m:Char::const::extern , where var m declares the variable, the oper-ator : specifies the type and the operator :: combines two types together. In thiscase, a type chain is formed by combining the types Char , const and extern .Precedence is from right to left where, for example, the read only properties ofthe const type override the default read & write properties of Char . It should benoted that some type coercions, such as
Int::Char are meaningless and so rulesexist within each type to govern which combinations are allowed.Within type-oriented programming the majority of the language complexityis removed from the core language and instead resides within the type system.The types themselves contain specific behaviour for different usages and situ-ations. The programmer, by using and combining types, has a high degree ofcontrol which is relatively simple to express and modify. Additionally, by writ-ing code in this high level way means that there is a rich amount of informationupon which the compiler can use to optimise the code. In the absence of detailedtype information the compiler can apply sensible, well documented, default be-haviour and the programmer can further specialise this using additional typesif required at a later date. The result is that programmers can get their coderunning and then further tune if needed by using additional types.Mesham[3] is a programming language that we have developed to researchand evaluate the type-oriented paradigm. It follows a simple imperative languagewith extensions to support this type-oriented paradigm and a Partitioned GlobalAddress Space memory model. Using the PGAS memory model, the entire globalmemory, which is accessible from every process, is partitioned and each blockhas an affinity with a distinct process. Reading from and writing to memory(either local or another processes’ chunk) is achieved via normal variable accessand assignment. The benefit of PGAS is its conceptual simplicity, where theprogrammer need not worry about the lower level and often tricky details ofcommunication. This makes it an ideal memory model for data scientists, whoare experts in using their own data but not necessarily HPC.Type-oriented programming provides the best of both worlds. In Meshamby default all communication is one sided however this can be overridden usingadditional type information which further tunes and specialises the communica-tion behaviour of specific variables. The programmer can get their codes workingand then tune, by using types, fundamental aspects such as communication toimprove performance and/or scalability. This is contrasted against traditionalHPC programming models, where fundamental aspects are often integral to thecode and changing them can require widespread changes or even a code rewrite.
As data intensive workloads are fundamentally limited by communication, ex-isting computation based benchmarks such as LINPACK are of limited use.Instead, the Graph500 benchmark[1] has been developed to stress the commu-nication aspects of a system. It consists of three phases; graph constructionwhich constructs graph in Compressed Sparse Row (CSR) format, a BreadthFirst Search (BFS) kernel and lastly the a validation of the BFS traversal. The
Graph500 problem size is represented using scale and edge factors. The scale isthe logarithm base two of the number of vertices and edge-factor is the ratio ofthe graphs edge count to its vertex count. Therefore a graph has 2 scale verticesand 2 scale × edge-factor edges. Performance of the BFS is measured in TraversedEdges Per Second (TEPS.) A number of reference implementations are providedand of these there are four MPI based codes, a one-sided implementation wherevertex communications are done using MPI-2 one-sided communications, an MPIsimple version which uses asynchronous point to point communications, an MPIreplicated compressed sparse row implementation and a MPI replicated com-pressed sparse column code. Whilst all of these implementations use the levelsynchronized BFS traversal algorithm [4] their implementations all require sepa-rate codes and to go from one to another has required substantial code rewritingof the BFS kernels. In this paper we concentrate on the one-sided and the simple(asynchronous point to point) versions which, at an algorithm level, only differin terms of their form of communication but require entirely separate kernelimplementations. In reality data scientists do not want to be working at thislower level; they want to be able to write a simple code and then easily modifyfundamental aspects, such as the form of communication, to experiment withparallelism. High Performance Fortran(HPF)[5] is a parallel extension to Fortran90. The pro-grammer specifies just the data partitioning and allocation, with the compilerresponsible for the placement of computation and communication. The type-oriented approach differs because programmer can, via types, control far moreaspects of parallelism. Alternatively, if not provided, the type system allows fora number of defaults to be used instead. Co-array Fortran (CAF)[6] provides theprogrammer with a greater degree of control than in HPF, but still the methodof communication is implicit and determined by the compiler whilst synchro-nisations are explicit. CAF uses syntactic shorthand communication commandsand synchronisation statements hard wiring these into a language is less flex-ible than our use of types. Chapel[8] is another PGAS language and supportsthe programmer controlling aspects of parallelism by providing higher and lowerlevels of abstractions. Many of these higher level constructs in Chapel, such asreduction are implemented via inbuilt operators and keywords, contrasted toMesham where they would be types in an independent library.Co-array C++ [7] integrates co-arrays into C++ using template libraries.The C++ programmer adds additional information to their source code throughthese template libraries which determine parallelism. Whilst the type library ofMesham has a far wider scope than the current co-array template library it wouldbe possible to encode our types as a C++ template library. This illustrates howthe core language itself is actually irrelevant and our approach could be applied toexisting languages, such as C++. The benefit of this is that programmers couldorient their parallelism around types within a familiar language. The downsideof this approach is that the actual C++ language is fixed and whilst template libraries are well integrated, the flexibility of our use of types would be limitedto the current C++ approach and compile time optimisation might be limited.Parallelizing ARRAYs (PARRAY)[9] extends the C and C++ languages withnew typed arrays that contain additional information such as the memory type,layout of data and the distribution over multiple memory devices. The centralidea is that a programmer need only learn one unified style of programming andthis applies equally to all major parallel architectures. The compiler will gener-ate code according to the typing information contained in the source. Whilst thisapproach is similar to the types used in Mesham, there are some important differ-ences. As a bolt on to existing languages, PARRAY uses its own syntax, similarto pre-processor directives, to declare arrays with types; for instance pinned , paged or dmem are used denote which device holds the array. In dealing withthe arrays PARRAY still requires a number of inbuilt commands to handle itsdata structures. Mesham takes this a stage further and types are fully integratedinto the language which means that, instead of requiring commands to copy ortranspose data, language operators such as assignment will automatically han-dle the operation according to the type information. This integration is centralto the data intensive example considered here where parallelism is entirely inte-grated in the language; for instance references to data may be local or global butin the simplest case the programmer need not worry about distinguishing theseaspects to get their code working. In our approach there many types which theprogrammer can use to tune their data structures but equally omitting these isfine which will result in some safe default behaviour being applied.The approach that PARRAY follows, bolting on parallelism using some syn-tax to differentiate it from the existing language is familiar. Solutions suchas OpenMP[10] allow for the programmer to direct parallelism through pre-processor directives which guide the compiler how to handle parallelism. Impor-tantly in our approach, types are first class citizens in the language so integratefully with the existing language semantics which means that the programmer hasthe flexibility to support aspects such as creating new types in their code andreasoning about type information using existing language constructs. Throughconstructing type chains we provide a mechanism for building up complex typeinformation in a structured manner and it is this type chain that provides thesemantics of operations performed on the variable. i f ( v e r t i c e s [ r o o t V e r t e x I n d e x ] . on == myPid ) { vertexQueue := r o o t; c h i l d r e n P a r e n t s [ r o o t . i d ] : = r o o t . i d ; } ;11 while ( g l o b a l N e x t V e r t i c i e s > { while ( ! vertexQueue . empty ) {
13 var s i n g l e V e r t e x : GraphVertex ;14 s i n g l e V e r t e x := vertexQueue ;15 i f ( s e a r c h T r e e [ s i n g l e V e r t e x . i d ] == − {
16 s e a r c h T r e e [ s i n g l e V e r t e x . i d ] : = c h i l d r e n P a r e n t s [s i n g l e V e r t e x . i d ] ;17 for i from 0 to s i n g l e V e r t e x . numChildren − {
18 var c h i l d V e r t e x := s i n g l e V e r t e x . c h i l d r e n [ i ] ;19 c h i l d r e n P a r e n t s [ c h i l d V e r t e x . i d ] : = s i n g l e V e r t e x .i d ;20 vertexQueueNext . on [ c h i l d V e r t e x . on ] : =c h i l d V e r t e x ;21 } ; } ; } ;22 vertexQueue := vertexQueueNext ;23 vertexQueueNext . c l e a r ;24 g l o b a l N e x t V e r t i c i e s : : a l l r e d u c e [ ”sum” ] : = vertexQueue . s i z e ; } ; Listing 1.1.
Default one sided communication
Listing 1.1 illustrates the Mesham source code for the BFS kernel and associatedvariable declarations. For clarity we concentrate on the core BFS kernel, hencesupporting functions such as building the Kronecker graph of vertices, construct-ing the edge list, finding search keys and validating the resulting search tree havebeen omitted from this paper. The typevar keyword at line 1 creates a new typewhich is called a
GraphVertex and is basically an alias for the referencerecord type which is a record and similar to a struct in C. This record is used to repre-sent a graph vertex and the referencerecord type supports members of a recordreferencing other records. An example of this is the array at line 1, children ,which contains references to other vertex records which are the children of thisvertex. The references themselves can either be to local data or point to globaldata which is held on another process. In terms of correctness, the programmerneed not distinguish between local and global data references as the type librarytakes care of the underlying communications required; although they might wantto differentiate for performance reasons.Lines 2, 4 and 5 set up arrays vertices , the actual graph vertices which havealready been built up, searchTree , the search tree to return from this kernelholding the resulting vertex parent ids and childrenParents which holds theparent ids of children vertices for the next level. Using the allocated type theprogrammer has provided some additional information to guide the compiler inhow to allocate this data. Combined with the single type it means that a singlecopy of the array exists globally, which is split up into numProcs partitions which are then evenly distributed amongst the distinct process memories via the evendist type. Because these three arrays are the same size they are distributedin the same manner with the same indexes (and hence vertices) on each process.Lines 7 and 8 create two queues, one for the current BFS level and one forthe next BFS level with each queue holding data of type
GraphVertex . The useof the multiple type as an argument to the allocated type informs the compilerthat each process will hold a distinct version of these variables in their ownmemory. Each type determines the behaviour whenever a variable is used, anexample of this is at line 10 where the referencerecord type implements the on operator which returns the process which holds the actual data which thereference is pointing to. In this case the effect is to add the root search vertexto the current level queue on that specific owning process. The effect of theassignment at line 14 will be to pop the top most vertex from the local queueand place it into the singleVertex variable, the fact that this is a pop is becauseof the appropriate types of the variables involved in the assignment and for thesame reason a queue addition is done at lines 20 and 10 but line 22 copies theentire vertexQueueNext into vertexQueue . For each level, a process will iteratethrough their vertex queue. For each vertex if it has not already been processed(line 15) then the process will iterate through each child. Each child is added tothe next level vertex queue on the appropriate holding process as well as thatchild’s parent id to the childrenParents array. In the absence of further typeinformation, how these communications occur is entirely abstracted away and ifthe child is on a different process to the parent then the default behaviour will beone sided communication. Part of a global reference is which process’s memoryactually holds the data. This means that a referencerecord ’s .on operation is alocal operation on the reference itself and the vertex communication requiredin this BFS implementation is to place remote vertices on their own processes’next level queue; much of this remote placing can be implemented with minimalcommunication. At line 24 there is a global all-reduce to determine the numberof vertices to be processed at the next level and the algorithm will terminateif, at the global level, there are none. This globalNextVerticies::allreduce[”sum”] illustrates an additional aspect of type-oriented programming where the typebehaviour of a variable can be overridden by the additional of extra types for aspecific expression; for this assignment only a blocking all-reduce is issued.It can be seen that this is a simple, high level algorithm with the underlyingtypes taking care of much of the lower level and tricky implementation details.Compared to the existing one sided MPI reference code this implementation isfar shorter, 28 lines of code compared to 205 in the reference implementation andallows the programmer to concentrate on the algorithmic and data structuresof their code. It is true that there are underlying Mesham types and runtimelibraries to support the code which amount to 500 lines of code, however, theseare very general an can be reused in multiple codes; the partition and distri-bution types we used for this benchmark were originally written for a Meshamasynchronous Jacobi implementation [11]. It illustrates how an HPC expert canconstruct these types once and these then can be used time and time again in different contexts. By simplifying the code, a data scientist, who might notbe an HPC expert, will be able to get their code working and to a reasonableperformance level. Whilst the default one sided communication is a simple andsafe behaviour it is often not particularly efficient. As already mentioned, oneof the MPI reference implementations is a point to point code, which replacesthe one sided communication with asynchronous point to point and greatly im-proves the efficiency. However, to achieve this the core BFS code has had to berewritten and additional low level issues such as matching asynchronous com-munications and buffer sizes which are complex and error prone has had to beconsidered. By orienting all aspects of parallelism around types the programmercan get their code working and then further tune for performance and listing 1.2sketches the modified Mesham code of listing 1.1 to use asynchronous point topoint communication rather than the default one-sided. var c h i l d r e n P a r e n t s : a r r a y [ Long , n v t x s c a l e ] : : a l l o c a t e d [p a r t i t i o n e d [ numProcs ] : : s i n g l e [ e v e n d i s t ] ] : : async ;var vertexQueue : queue [ GraphVertex ] : : a l l o c a t e d [ m u l t i p l e ] : :async ;var vertexQueueNext : queue [ GraphVertex ] : : a l l o c a t e d [ m u l t i p l e ]: : async ;. . . while ( g l o b a l N e x t V e r t i c i e s > { while ( ! vertexQueue . empty ) { . . . } ;sync ;vertexQueue := vertexQueueNext ;. . . } ; Listing 1.2.
Asynchronous p2p communications
The code structure has remained the same and minimal changes, mainly orientedaround the types, have been made to modify the underlying communicationmethod. The addition of the async type to the childrenParents variable andqueues guide the compiler to use asynchronous point to point communication forall communications involving these variables. The assignment at line 19 (listing1.1) now issues an asynchronous send of the parent vertex id and at line 20 anasynchronous send as part of the remote queue addition. The sync keyword hasbeen added at the end of processing the current level in listing 1.2 and waitsfor all outstanding asynchronous communications to complete, ready to proceedwith the next search level. Effectively the addition of these async types set upthe same asynchronous message listening and coalescing buffers that are presentin the reference MPI implementation but these low level details are abstractedaway from the programmer. In the absence of further type information the size ofthe coalescing buffer is set to be 256
GraphVertex elements, although this can befurther tuned by providing an argument to the async type such as async[128] .This is an example of where the meaning of the argument provided to typesdepends entirely on the type chain and it is within the context of, in this case,the queue and async types to interpret the arguments accordingly.
Fig. 1.
Performance of Mesham vs MPI benchmarks
Figure 1 illustrates the strong scaling characteristics of our Mesham and MPIBFS implementations on a Cray XE6 using a vertex scale of 29. The upperplots, labelled p2p , are the asynchronous Mesham point to point BFS imple-mentation and the reference MPI simple implementation. It can be seen thatthe performance of the Mesham and MPI versions are comparable, with theMesham version slightly under performing the MPI implementation but the dif-ference is small. Initially in both versions the TEPS increases as the numberof cores is increased but, commonly with strong scaling experiments, a point isreached where the cost of communication outweighs the benefits gained fromadditional parallelism and performance starts to degrade. In the results we cansee that the MPI implementation actually performs worse over 8192 cores thanon 4096 cores, and the Mesham version’s TEPS at 8192 cores is only a slightimprovement over the 4096 core run.Figure 1 also depicts the strong scaling performance of the default, one sided,Mesham implementation and the one sided MPI benchmark which are the lowertwo plots and labeled onesided . This illustrates that the safe and simple be-haviour incurs a performance hit which can then be tuned using additional typesto the performance in the p2p case. The Mesham one sided implementation outperforms the MPI one-sided implementation due to the compiler optimising com-munications and one sided epoch windows which is possible because of the richamount of type information available. This illustrates, in itself, a performance benefit of writing high level data intensive codes using Mesham default commu-nications compared to a lower level implementation. In this paper we have considered how type-oriented programming may be appliedto the data intensive HPC field. Aspects of this paradigm could, in the future,be used as part of existing languages to achieve the best of both worlds; theadvantages discussed in this paper along with the familiarity of existing modelsand languages. We have shown that, by using types, the programmer can writeconceptually simple PGAS style data processing codes at no significant hit inperformance compared to traditional implementations. Types provide the addi-tional benefit that the programmer can initially concentrate on the correctnessof their codes and then, once a simple working version exists, they can use typesto tune for performance as illustrated in the the Mesham one sided and pointto point BFS implementations. There is further work to be done understandingthe reasons behind the slight performance gap of the Mesham and MPI imple-mentations and based upon this work we are now looking to examples of dataintensive problems, rather than benchmarks, to understand how this paradigmcan help in solving real world data intensive workloads.