[PDF] Applying Type Oriented Programming to the PGAS Memory Model

Abstract

The Partitioned Global Address Space memory model has been popularised by a number of languages and applications. However this abstraction can often result in the programmer having to rely on some in built choices and with this implicit parallelism, with little assistance by the programmer, the scalability and performance of the code heavily depends on the compiler and choice of application. We propose an approach, type oriented programming, where all aspects of parallelism are encoded via types and the type system. The type information associated by the programmer will determine, for instance, how an array is allocated, partitioned and distributed. With this rich, high level of information the compiler can generate an efficient target executable. If the programmer wishes to omit detailed type information then the compiler will rely on well documented and safe default behaviour which can be tuned at a later date with the addition of types. The type oriented parallel programming language Mesham, which follows the PGAS memory model, is presented. We illustrate how, if so wished, with the use of types one can tune all parameters and options associated with this PGAS model in a clean and consistent manner without rewriting large portions of code. An FFT case study is presented and considered both in terms of programmability and performance - the latter we demonstrate by a comparison with an existing FFT solver.

Full PDF

AApplying Type Oriented Programming to thePGAS Memory Model

Nick Brown ∗ Edinburgh Parallel Computing Centre, James Clerk MaxwellBuilding, Kings Buildings, Edinburgh

Abstract

The Partitioned Global Address Space memory model has been popu-larised by a number of languages and applications. However this abstrac-tion can often result in the programmer having to rely on some in builtchoices and with this implicit parallelism, with little assistance by theprogrammer, the scalability and performance of the code heavily dependson the compiler and choice of application.We propose an approach, type oriented programming, where all as-pects of parallelism are encoded via types and the type system. The typeinformation associated by the programmer will determine, for instance,how an array is allocated, partitioned and distributed. With this rich,high level of information the compiler can generate an eﬃcient targetexecutable. If the programmer wishes to omit detailed type informationthen the compiler will rely on well documented and safe default behaviourwhich can be tuned at a later date with the addition of types.The type oriented parallel programming language Mesham, which fol-lows the PGAS memory model, is presented. We illustrate how, if sowished, with the use of types one can tune all parameters and optionsassociated with this PGAS model in a clean and consistent manner with-out rewriting large portions of code. An FFT case study is presented andconsidered both in terms of programmability and performance - the latterwe demonstrate by a comparison with an existing FFT solver.

As the problems that the HPC community looks to solve become more ambi-tious then the challenge will be to provide programmers, who might be nonHPC experts, with usable and consistent abstractions which still allow for scal-ability and performance. Partitioned Global Address Space is a memory modelproviding one such abstraction and allows for the programmer to consider the ∗ Corresponding author: +44 (0) 131 650 6420, [email protected] a r X i v : . [ c s . P L ] S e p ntire system as one entire global memory space which is partitioned and eachblock local to some process. Numerous languages and frameworks exist to sup-port this model but all, operating at this higher level, impose some choices andrestrictions upon the programmer in the name of abstraction.This paper proposes a trade-oﬀ between explicit parallelism, which can yieldgood performance and scalability if used correctly, and implicit parallelism whichpromotes simplicity and maintainability. Type oriented programming addressesthe issue by providing the options to the end programmer to choose betweenexplicit and implicit parallelism. The approach is to design new types, whichcan be combined to form the semantics of data governing parallelism. A pro-grammer may choose to use these types or may choose not to use them and inthe absence of type information the compiler will use a well-documented set ofdefault behaviours. Additional type information can be used by the program-mer to tune or specialise many aspects of their code which guides the compilerto optimise and generate the required parallelism code. In short these types forparallelisation are issued by the programmer to instruct the compiler to per-form the expected actions during compilation and in code generation. Theyare predeﬁned by expert HPC programmers in a type library and used by theapplication programmer who many not have specialist HPC knowledge.Programmer imposed information about parallelism only appears in typesat variable declaration and type coercions in expressions and assignments. Achange of data partition or communication pattern only require a change ofdata types, while the traditional approaches may require rewriting the entirestructure of the code. A parallel programming language, Mesham which followsthe PGAS memory model, has been developed which follows this paradigm andwe study a Fast Fourier Transformation (FFT) case study written in Meshamto evaluate the proposed approach. The pursuit for performance and scala-bility is a major objective of HPC and we compare the FFT Mesham versionwith that of an existing, mature solving framework and also consider issues ofprogrammability. The diﬃculty of programming has been a challenge to parallel computing overthe past several decades[8]. Whilst numerous languages and models have beenproposed, they mostly suﬀer from the same fundamental trade-oﬀ between sim-plicity and expressivity. Those languages which abstract the programmer suﬃ-ciently to allow for conceptual simplicity often far remove the programmer fromthe real world execution and impose upon them predeﬁned choices such as themethod of communication. The parallel programming solutions which providethe programmer with full control over their code often result in great amounts ofcomplexity which can be diﬃcult for even expert HPC programmers to masterfor non-trivial problems, let alone the non-expert scientiﬁc programmers whichoften require HPC.PGAS languages, which provide for the programing memory model abstrac-2ion of a global address space which is partitioned and each portion local to aprocess also suﬀers from this trade oﬀ. For instance, to achieve this memorymodel the programmer operates at a higher level far removed from the actualhardware and often key aspects, such as the form of data communication, areabstracted away with the programmer having no control upon these key at-tributes. Operating in a high level environment, without control of lower leveldecisions, can greatly aﬀect performance and scalability of codes with the pro-grammer reliant on the compiler “making the right choice” when it comes tosome critical aspects of parallelism.Whilst the PGAS memory abstraction is a powerful one, on its own it stillleaves complexity to the end programmer in many cases. For example changingthe distribution of data amongst the processes can still require the programmerto change numerous aspects of their code.

The concept of a type will be familiar to many programmers. A large subsetof languages follow the syntax

Type Variablename , such as int a or ﬂoat b ,which is used to declare a variable. Such statements aﬀect both the staticand dynamic semantics - the compiler can perform analysis and optimisation(such as type checking) and at runtime the variable has a speciﬁc size andformat. It can be thought that the programmer provides information, to thecompiler, via the type. However, there is only so much that one single type canreveal, and so languages often include numerous keywords in order to allow forthe programmer to specify additional information. Using the C programminglanguage as an example, in order to declare a variable m to be a read onlycharacter where memory is allocated externally, the programmer writes externconst char m . Where char is the type and both extern and const are inbuiltlanguage keywords. Whilst this approach works well for sequential languages,in the parallel programming domain there are potentially many more attributeswhich might need to be associated; such as where the data is located, how it iscommunicated and any restrictions placed upon this. Representing such a richamount of information via multiple keywords would not only bloat the language,it might also introduce inconsistencies when keywords were used together withpotentially conﬂicting behaviours.Instead our approach is to allow for the programmer to encode all vari-able information via the type system, by combining diﬀerent types togetherto form the overall meaning. For instance, extern const char m becomes varm:Char::const::extern , where var m declares the variable, the operator : spec-iﬁes the type and the operator :: combines two types together. In this case, a type chain is formed by combining the types Char , const and extern . Prece-dence is from right to left where, for example, the read only properties of the const type override the default read & write properties of Char . It should benoted that some type coercions, such as

Int::Char are meaningless and so rulesexist within each type to govern which combinations are allowed.3ithin type oriented programming the majority of the language complexityis removed from the core language and instead resides within the type system.The types themselves contain their speciﬁc behaviour for diﬀerent usages andsituations. The programmer, by using and combining types, has a high degree ofcontrol which is relatively simple to express and modify. Not only this, the highlevel of type information provides a rich amount of information upon whichthe compiler can use and optimise the code. In the absence of detailed typeinformation the compiler can apply sensible, well documented, default behaviourand the programmer can further specialise this using additional types if requiredat a later date. The result is that programmers can get their code running andthen further tune if needed by using additional types.Beneﬁts of writing type oriented parallel codes are as follows:1.

Simplicity - by providing a well documented, clean, type library theprogrammer can easily control all aspects of parallelism via types or relyon default well-documented behaviour.2.

Eﬃciency - due to the rich amount of high level information providedby the programmer the compiler can perform much optimisation upon thecode. The behaviour of types can control the tricky, low level, detailswhich are essential to performance and can be implemented by domainexperts which are then used by non-expert parallel programmers.3.

Flexibility - often initial choices made, such as the method of data de-composition, can retrospectively turn out to be inappropriate. However, ifone is not careful these choices can be diﬃcult to change once the code hasmatured. By using types the programmer can easily change fundamentalaspects by modifying the type with the compiler taking care of the rest.At a language level, containing the majority of the language complexity ina loosely coupled type library means that adding, removing or modifyingthe behaviour of types has no language wide side eﬀect and the “core”language is kept very simple.4.

Maintainability - the maintainability of parallel code is essential. Cur-rent production parallel programs are often very complex and diﬃcult tomaintain. By providing for simplicity and ﬂexibility it is relatively simplefor the code to be modiﬁed at a later stage.

A parallel programming language, Mesham[1], has been created based around animperative programming language with extensions to support the type orientedconcept. By default the language follows the Partitioned Global Address Spacememory model where the entire global memory, which is accessible from everyprocess, is partitioned and each block has an aﬃnity with a distinct process.Reading from and writing to memory (either local or another processes’ chunk)is achieved via normal variable access and assignment. By default, in the absence4f further types, communication is one sided but this can be overridden usingoptional additional type information.The language itself has ﬁfty types in the external type library. Around halfof these are similar in scope to the types introduced in the previous section andother types are more complex allowing one to control aspects such as explicitcommunication, data composition and data partitioning & distribution. In list-ing 1 the programmer is allocating two integers, a and b on lines one and tworespectively. They exist as a single copy in global memory and variable a isheld in the memory of process zero, b is in the memory associated with processtwo. At line three the assignment (using operator := in Mesham) will copy thevalue held in b at process two into variable a which resides in the memory ofprocess zero. In the absence of any further type information the communicationassociated with such an assignment is one-sided, which is guaranteed to be safeand consistent but might not be particularly performant. Listing 1: Default one sided communicationThe code in listing 2 looks very similar to that of listing 1 with one importantmodiﬁcation, at line one the type channel has been added into the type chainof variable a . This type will create an explicit point to point communicationlink between process two and zero which means that any assignments involvingvariable a between these processes will use the point to point link rather thanone-sided. By default the channel type is blocking and control ﬂow will pauseuntil the data has been received by the target process; the programmer couldfurther specialise this to use asynchronous (non-blocking) communication byappending the async type into variable a ’s type chain. In such, asynchronous,cases the semantics of the language is such that the programmer issues explicitsynchronisation points, either targeted at a speciﬁc variable or all variables,where it is guaranteed that outstanding asynchronous communications will becompleted. It can be seen that in the tuning discussed here the programmer,using additional type information, guides the compiler to override the defaultbehaviour. This can be done retrospectively once their parallel code is workingand allows one to tune certain aspects which might be crucial to performanceor scalability. Listing 2: Override communication to blocking point to pointThe code examples considered in this section demonstrate that, followingthe traditional PGAS memory model, using types one can either rely on thesimple, safe and well documented default behaviour, or associate additionalinformation and override the defaults as required. Types used to specialise the5ehaviour are themselves responsible for their speciﬁc actions. The beneﬁt ofthis is that by keeping the majority of the language complexity in the typescontained within a loosely coupled type library, it not only results in a muchsimpler “core” language but also experts can architect types which simply pluginto the language.

Uniﬁed Parallel C (UPC)[2] is an extension to C designed for parallelism andfollows the PGAS memory model. It does this with the addition of languagekeywords, such as shared to marked shared variables, and functions. Due to thelimited nature of associating attributes to data using keywords there are stilldecisions which the UPC programmer is stuck with such as one-sided commu-nication and the programmer is reliant upon the compiler to do the best job itcan of optimisation in this regard. Additionally, whilst the memory model isglobal and communication abstracted, the programmer is still stuck with havingto work with low level concepts such as pointers. As discussed, in the type ori-ented programming model, many additional attributes can be associated withvariables by the programmer if the defaults are not suitable. All this type in-formation supports a higher level view of the code because the types controlsthe behaviour of variables and allows for the elimination of many function callswhich are common in more traditional approaches.High Performance Fortran(HPF)[4] is a parallel extension of Fortran90. Theprogrammer speciﬁes just the data partitioning and allocation, with the com-piler responsible for the placement of computation and communication. Thetype oriented approach diﬀers because programmer can, via types, control farmore aspects of parallelism. Alternatively, if not provided, the type system al-lows for a number of defaults to be used instead. Co-array Fortran (CAF)[6]provides the programmer with a greater degree of control than in HPF, but stillthe method of communication is implicit and determined by the compiler whilstsynchronisations are explicit. CAF uses syntactically shorthanded communi-cation commands like Y[:]=X and synchronisation statements. Having thesecommands hard wired into the language is popular, not just with CAF butmany other parallel languages too, the result is less ﬂexible and more diﬃcultto implement.Titanium[3] is a PGAS extension to the Java programming language. ThePGAS memory model is followed as the implicit model but also allows the pro-grammer to use explicit message passing constructs by using additional languagefacilities. In this respect, providing for both a higher level implicit memorymodel and more detailed explicit message passing model, Titanium has somesimilarities to Mesham. However explicit control in Titanium relies on the pro-grammer issuing in built language keywords such as broadcast E from p and/orobject methods which results in language bloat. In Titanium moving from thedefault PGAS memory model to the more explicit message passing requiresrewriting portions of the code, whereas with our approach the programmer justneeds to modify the type which directs the compiler as to the appropriate way6f handing communication. The Mesham type system is designed such that itallows the compiler to generate all possible communication options just by usingadditional types.Chapel[7] has been designed, similar to Mesham and Titanium, to allowthe programmer to express diﬀerent abstractions of parallelism. It does thisby providing higher and lower levels of abstractions which support automatingthe common forms of parallel programming via the former and the optimisationand tuning of speciﬁc factors using the later. There are some critical diﬀerencesbetween Mesham and Chapel. Firstly, many of these higher level constructs inChapel, such as a reduction is implemented via an inbuilt operator, instead inMesham these would be types in an independent library. In Chapel, if one de-clares a single data variable and then writes to it from multiple parallel processesat the same time then this can result in a race condition. The solution is to usea synchronisation variable, via the sync keyword in the variables declaration. Inthe type based approach the Mesham programmer would be using a sync type,instead of an inbuilt language keyword, one beneﬁt of this is that if multiplesynchronisation constructs were being used (such as Chapel’s sync , single and atomic keywords) then the behaviour in a type chain where precedence is fromright to left is well deﬁned. Whilst languages such as Chapel might disallowcombinations of these keywords, supporting them in a type chain allows for theprogrammer to mix the behaviours of diﬀerent synchronisations in a predicablemanner which might be desirable. FFTs are of critical importance to a wide variety of scientiﬁc applications rang-ing from digital signal processing to solving partial diﬀerential equations. Paral-lelised 2D Fast Fourier Transformation (FFT) code is far more complicated thanthe equivalent sequential code. Direct message passing programming requiresthe end programmer to handle every detail of parallelisation including writingthe appropriate communication commands, synchronizations, and correct indexexpressions that delimit the range of every partitioned array slice. Whilst usingthe PGAS memory model can help abstract some of these details the program-mer is reliant upon assumptions imposed, in the name of abstraction, which canbe costly in terms of scalability and-or performance with other aspects such asthe details of data transposition still needing to be considered. A small changeof how the data is partitioned or distributed may result in code rewriting. Ori-enting parallelism around types, however, can relieve the end programmer fromwriting low level details of parallelisation if these can be derived from the typeinformation in code. ∗ var A : a r r a y [ complex , n , n ] : : a l l o c a t e d [ row [ ] : : h o r i z o n t a l [ p ] : :s i n g l e [ e v e n d i s t [ ] ] ] ;7 var B : a r r a y [ complex , n , n ] : : a l l o c a t e d [ c o l [ ] : : h o r i z o n t a l [ p ] : :s i n g l e [ e v e n d i s t [ ] ] ] ;8 var C : a r r a y [ complex , n , n ] : : a l l o c a t e d [ row [ ] : : v e r t i c a l [ p ] : :s i n g l e [ e v e n d i s t [ ] ] ] : : s h a r e [B ] ;910 var s i n s : a r r a y [ complex , n / 2 ] : : a l l o c a t e d [ m u l t i p l e [ ] ] ;11 computeSin ( s i n s ) ;12 proc 0 { r e a d f i l e ( S , ” image . dat ” ) } ;1314 A:=S ;1516 for j from 0 to A. l o c a l b l o c k s − {

17 var bid :=A. l o c a l b l o c k i d [ j ] ;18 for i from A[ bid ] . low to A[ bid ] . high FFT(A[ bid ] [ i − A[bid ] . low ] , s i n s ) ;19 } ;2021 B:=A;2223 for j from 0 to C. l o c a l b l o c k s − {

24 var bid :=C. l o c a l b l o c k i d [ j ] ;25 for i from C[ bid ] . low to C[ bid ] . high FFT(C[ bid ] [ i − C[bid ] . low ] , s i n s ) ;26 } ;2728 S:=C;29 proc 0 { w r i t e f i l e ( S , ” image . dat ” ) } ; Listing 3: 2D parallel FFT Mesham codeListing 3 is the parallel aspects of the 2D FFT case study implemented inMesham. For brevity the actual FFT computation algorithm, a CooleyTukeyimplementation, and other miscellaneous functions have been omitted. At line5 the two dimensional array S is declared to comprise of complex numbers beof size n in each dimension, allocated row major fashion and a single copy ofit resides upon process zero. This array is used to hold the initial data, animage which is read in at line 12 by process zero and then the results of thetransform are placed into it and written back out at line 29. Line 6 declaresvariable A , again n by n complex numbers, but this time it is partitioned via the horizontal type into p distinct partitions which are evenly distributed amongstthe processes using the evendist type. This even distribution follows a cyclicalapproach where partitioned blocks will be allocated to process after process andcan cycle around if there are more blocks than processes. Line 7 declares the2D array B to be sized, partitioned and distributed in a similar manner to thatof A but this array is indexed column major. The last partitioned array to bedeclared, C which uses vertical partitioning rather than horizontal, shares theunderlying memory with B ; in eﬀect this is a diﬀerent view or abstraction of8ome existing memory.Line 10 declares the sinusoid array. Using the multiple type without furtherinformation results in allocation to the memory of all processes and this is usedto compute the pre-calculated constant sinusoid parameters needed by the FFTkernel. Note that in this case no explicit array ordering is provided, in theabsence of further information arrays default to row major ordering. In fact wecould have omitted all row types in the code if we had wished but these areprovided to make explicit to the reader how the partitioned data is allocatedand viewed.The assignment A:=S at line 14 will result in a scattering of data held in S ,which is located on process zero, amongst the processes into each partitionedblock of A . In the loop at lines 16 to 19, each process will iterate through theblocks allocated to them and for each block perform the 1D FFT on individualrows. Assignment from A to B at line 21 essentially transposes A and shuﬄesthe blocks of array A across processes. This allows each process to performlinear FFT on the other dimension locally. Because C uses vertical partitioningand is a row major view of the data, performing row-wise FFT on C is the sameas performing column-wise FFT on B at lines 23 to 26. The last assignment S:=C gathers the data distributed amongst the processes into array S held onprocess zero.From the code listing it can be seen that the number of partitioned datablocks is two times the number of processes. Uneven partition sizes, for instancewhen the number of partitions does not divide evenly into the data size istransparent to the programmer. The types also abstract how and where thedata is decomposed and processes can hold any number of blocks with theallocation, communication and transposition all taken care of by the type library.In conventional languages and frameworks it can add considerable complexitywhen blocks of data are uneven sizes and unevenly distributed, but using thetype oriented approach this is all handled automatically. The programmer neednot worry about these low level and tricky aspects - unless they want to whereadditional type information can be used to override the default behaviour. It is often the case that programmers wish to get their parallel codes workingin the ﬁrst instance and then further tune and specialise if required. Oftendecisions made early on, such as the method of data decomposition, might notbe correct retrospectively but can be very diﬃcult to change without rewritinglarge portions of the code. Conversely, when orientating the code around types,changing the method of data decomposition is as simple as modifying a type.This will abstract exactly what data is where and allows for the programmer tonot only tune but also experiment with diﬀerent distribution options and howthese can aﬀect their code performance and scalability.In listing 3 the evendist type has been used to perform an even cyclicaldistribution of the data. Instead, the programmer can change one or more of thedistribution mechanisms to another distribution type such as array distribution.9he arraydist type allows the programmer to explicitly specify what blocksreside in the memory of what processes using an integer array. The index ofeach element in the array corresponds to the block Id and the value held therewhich process it resides upon. Listing 4 illustrates using array distribution andis a snippet of the Mesham FFT code declaring the distributed arrays. At line 1the array d is declared to be an array of p integers and in the absence of furtherinformation a copy of this is, by default, allocated on all processes. At lines 3to 5 for every even numbered block Id we are allocating it to process one anduneven block Ids to process two. The arrays A , B and C are then declared touse the arraydist type with the array d controlling what blocks belong where.Apart from modifying the type and code for the distribution, all other aspects ofthe FFT code in listing 3 remain unchanged and the programmer can explicitlychange what blocks belong where by modifying the values of the distributionarray d . for i from 0 to p − { } ;67 var A : a r r a y [ complex , n , n ] : : a l l o c a t e d [ row [ ] : : h o r i z o n t a l [ p ] : :s i n g l e [ a r r a y d i s t [ d ] ] ] ;8 var B : a r r a y [ complex , n , n ] : : a l l o c a t e d [ c o l [ ] : : h o r i z o n t a l [ p ] : :s i n g l e [ a r r a y d i s t [ d ] ] ] ;9 var C : a r r a y [ complex , n , n ] : : a l l o c a t e d [ row [ ] : : v e r t i c a l [ p ] : :s i n g l e [ a r r a y d i s t [ d ] ] ] : : s h a r e [B ] ; Listing 4: Mesham FFT example using array based data distribution

Whilst the programmability beneﬁts of orienting parallel codes around typeshave been argued, it is equally important to consider the performance and scal-ability characteristics of this programming model. We have tested the Me-sham version in code listing 3, which uses a CooleyTukey FFT kernel againstthe Fastest Fourier Transformation in the West version 3 (FFTW3)[5] library.FFTW is a very commonly used and mature FFT calculation framework whichlooks to optimise the computational aspect of FFT by selecting the most ap-propriate solver kernel based upon parameters of the data. Performance testinghas been carried out on HECToR, the UK National Supercomputer, a CrayXE6 with 32 cores per node, 32GB RAM per node and interconnection via theGemini router. Data distribution in both test codes is that of even, cyclical,distribution with one block of data per process. The results presented in thissection are the average of three runs. 10igure 1: Performance of Mesham FFT version compared to FFTW3Figure 1 illustrates the performance of the FFT example in Mesham com-pared with the same problem solved using FFTW3. It can be seen that onsmall numbers of processes the performance is very similar and both exhibitgood scalability as the number of cores is increased initially. There is someinstability with the FFTW3 version compared to running the code using aneven and uneven partitioning of data. Previous tests using FFTW2 illustratedthat that older version of the library performed poorly when run parallel withuneven block sizes of data. Ironically in our tests the latest version, FFTW3,exhibits better performance when run with an uneven partitioning of data com-pared to an even partitioning. The performance of the Mesham version is morestable and predictable. The rich amount of information available at compile andruntime means that the language is able to select the most appropriate form ofcommunication for speciﬁc situations automatically. The one size ﬁts all ap-proach of communication adopted by many existing libraries is often optimisedfor speciﬁc cases and does not necessarily perform well in all conﬁgurations. Atmedium numbers of core counts the performance of the Mesham FFT versionis more favourable than that of FFTW3 although as we go to larger numbersof processes the Mesham version does degrade faster. Due to the slightly largeroverhead of the presently implemented Mesham parallel runtime system, per-11ormance degradation sets in somewhat earlier for this strong scaling scenariothan in the highly tuned Cray MPI implementation.Due to the abstractions provided by the PGAS memory model and our useof types, it is entirely possible to maintain correctness of the code whilst runningon diﬀerent architectures although this might have a performance impact. Theimplementation of Mesham is such that all architecture dependant aspects, forexample how speciﬁc communications are implemented, are directed througha runtime abstraction layer which can be modiﬁed to suit diﬀerent target ma-chines. The runtime abstraction layer used for the experiments in this paper wasfor each PGAS processor to be single processes which are connected via MPI.A threading layer also exists which Mesham codes can use unmodiﬁed, and anavenue of further work will be to explore how we might optimise performanceby selecting or mixing these layers. As previously noted, by changing types theprogrammer can very easily change key aspects of their code or experiment withdiﬀerent choices such as data decomposition, and this will promote easy tuningto speciﬁc architectures. Contrast against more traditional approaches, such asMPI, the porting of these codes to diﬀerent architectures or mixing paradigmssuch as OpenMP with MPI often requires substantial and indepth changes tobe made.

The FFT case study that we have considered in listing 3 simply illustrates thecode in a single function. It is worth mentioning the suitability to more advancedcodes, or even library development, where data using these complex type rep-resentations are passed between functions. In the current implementation ofMesham the entire type chain of a variable must be speciﬁed in the formal ar-guments of a function, which means that the compiler has detailed knowledge ofthe variables passed to a function and can perform appropriate static analysisand optimisations upon them. At runtime, when passed as an actual argumentto a function, data will already have been allocated which occurs as part of avariable’s declaration. The Mesham runtime library keeps track of the stateof all program variables which means that during execution functions not onlyknow the exact type of data but also its current state. The result is that, forthe FFT example, no redistribution of the data would be required if passed toa function.

This paper is not intended to describe the entire language Mesham but illustratethe central ideas behind the programming paradigm and demonstrate advan-tages when applied to the PGAS memory model. Aspects of this paradigmcould, in the future, be used as part of existing PGAS languages to get the bestof both worlds - a solution which parallel programmers are already familiar withbut the added programmability beneﬁts of our approach.12he rationale behind type oriented parallelism is not only to generate ahighly eﬃcient parallel executable but also enable programmers to write thesource program in an intuitive and abstract style. The compiler essentiallyhelps the programmer determine various sophisticated details of parallelisationas long as such details can be derived from the types in the source program.Optimization algorithms can also beneﬁt from such additional type informa-tion. We have used a 2D parallel FFT case study to evaluate the success ofour approach, both in terms of programmability with the beneﬁts this aﬀords,and also performance when compared to more traditional solving solutions. Ithas been seen how the Mesham programmer can architect their code at a highlevel using language default behaviour and then, by modifying type information,further specialise and tune whereas existing PGAS solutions often impose spe-ciﬁc “best eﬀort” decisions upon the programmer. By using types programmerscan even experiment with diﬀerent choices, such as data decomposition, whichtraditionally require a much greater eﬀort to modify.We have compared the performance of the FFT Mesham case study againstthat of FFTW3. Whereas FFTW3 optimises heavily based upon the computa-tion aspect; our version, where the compiler and runtime optimise the communi-cation based upon the rich amount of type information, performs comparativelyand in some instances favourably. There is further work to be done investigat-ing why the performance of the Mesham version decreases more severely thanFFTW past the optimal number of processes and we are looking to extend ourversion to 3D FFT with additional data decompositions such as Pencil. We alsobelieve that Mesham would make a good platform for exploring heterogeneousPGAS, where the complexity of managing data stored on diﬀerent devices canbe abstracted via types. As discussed in section 5.2 all machine dependantaspects are current managed via a runtime abstraction layer, and further de-velopment of this could allow for existing codes to be run unmodiﬁed on theseheterogeneous machines.

References [1] N. Brown. Mesham language speciﬁcation, v.1.0. [online], 2013. Availableat .[2] UPC Consortium. Upc language speciﬁcations, v1.2.

Lawrence BerkeleyNational Lab Tech Report , LBNL-59208, 2005.[3] P. Hilﬁnger et al. Titanium language reference manual.

U.C. Berkeley TechReport , UCB/EECS-2005-15, 2005.[4] G. Luecke and J. Coyle. High performance fortran versus explicit messagepassing on the isb sp-2.

Technical Report Iowa State University , 1997.[5] M.Frigo and S.Johnson. Fftw: An adaptive software architecture for the ﬀt.

IEEE Conference on Acoustics, Speech, and SignalProcessing , 3:1381–1384,1998. 136] R. Numrich and J. Reid. Co-array fortran for parallel programming.

ACMSIGPLAN Fortran Forum , 17(2):1–31, 1998.[7] Cray Inc. Seattle. Chapel language specication (version 0.82). [online],October 2011. Available at http://chapel.cray.com/ .[8] D. Skillicorn and D. Talia. Models and languages for parallel computation.