[PDF] Let's Annotate to Let Our Code Run in Parallel

Abstract

This paper presents an approach that exploits Java annotations to provide meta information needed to automatically transform plain Java programs into parallel code that can be run on multicore workstation. Programmers just need to decorate the methods that will eventually be executed in parallel with standard Java annotations. Annotations are automatically processed at launch-time and parallel byte code is derived. Once in execution the program automatically retrieves the information about the executing platform and evaluates the information specified inside the annotations to transform the byte-code into a semantically equivalent multithreaded version, depending on the target architecture features. The results returned by the annotated methods, when invoked, are futures with a wait-by-necessity semantics.

Full PDF

LLet’s Annotate to Let Our Code Run in Parallel

Patrizio Dazzi

ISTI - CNR [email protected]

ABSTRACT

This paper presents an approach that exploits Java anno-tations to provide meta information needed to automati-cally transform plain Java programs into parallel code thatcan be run on multicore workstation. Programmers justneed to decorate the methods that will eventually be exe-cuted in parallel with standard Java annotations. Annota-tions are automatically processed at launch-time and par-allel byte code is derived. Once in execution the programautomatically retrieves the information about the executingplatform and evaluates the information speciﬁed inside theannotations to transform the byte-code into a semanticallyequivalent multithreaded version, depending on the targetarchitecture features. The results returned by the annotatedmethods, when invoked, are futures with a wait-by-necessitysemantics.

Keywords

Asynchronous method invocation, wait-by-necessity, anno-tations, skeletons, grids.

1. INTRODUCTION

Developing parallel applications is, in general, much morecomplex than developing sequential applications. Besidesbeing in charge of the whole parallel application structure,programmers have to deal with communications, synchro-nization, mapping and scheduling structure. As the pro-grammers usually write applications directly interacting withthe middleware, the whole process is cumbersome and er-ror prone. So far, several eﬀorts have been spent to facethis problem, and several approaches have been conceivedto design high-level programming languages/environmentsthat can automate most of the tasks required to implementworking and eﬃcient parallel applications.Other approaches oﬀer a lower abstraction level but allowmore programming freedom and guarantee a higher level ofpersonalization. In other words, programmers can customize their applications and deal with some aspects related to theparallelism as, for example, parallelism degree and the par-allel program structure.The approaches belonging to this category force the pro-grammer to structure the parallel application he wants toimplement adequately. Typically, such approaches allow theapplication “business logic” to be separated from the activi-ties required to coordinate and to synchronize parallel pro-cesses [4]. On the other side, several environments havebeen proposed to use more classical, low level programmingparadigms. However, all these approaches, while leaving theprogrammer a higher freedom of structuring the parallel ap-plications in an arbitrary way, require the programmers ex-plicitly deal with all the awkward details mentioned above.In this work, we describe Parallel Abstraction Layer (PAL),originally presented in [16]. It aims at avoiding the prob-lems typically present in a fully automated parallel approach[5], PAL leaves to programmer the responsibility to choosewhich parts of code have to be computed in parallel throughthe insertion of non-functional requirements in the sourceprogram code. Using the information provided by program-mers PAL transforms the program code into a parallel one.

2. PARALLEL ABSTRACTION LAYER

PAL is an approach conceived around a quite simple butvery embraceable, well-known, opinion “ ...people know theapplication domain and can better decompose the problem,compilers can better manage data dependence and synchro-nization ” [17]. the PAL approach to parallel programmingfundamentally relies on programmer knowledge to properly“structure” the parallel schema of an application and thenlet to the compiler/run time tool ability to eﬃciently imple-ment such schema.Basically, this almost matches the algorithmic skeletons ap-proach [11]. PAL represents a general-purpose mechanismbased on very simple applications structuring. In fact theprogrammer is only required to specify some hints that areexploited by the runtime support to implement a parallelversion of the application code. These hints are speciﬁedthrough the annotation mechanisms provided by Java [1].The programmers are required to give some kind of “paral-lel structure” to the code directly at the source code level,as it happens in the algorithmic skeleton case. However,the approach discussed in this work presents at least two a r X i v : . [ c s . P L ] J un dditional advantages. • First, annotations can be ignored and the semantics ofthe original sequential code is preserved. This meansthat the programmer application code can be run througha classical compiler/interpreter suite and debugged us-ing normal debugging tools. • Second, annotations are processed at load time, typi-cally exploiting reﬂection properties of the hosting lan-guage. As a consequence, while handling annotations,a bunch of knowledge can be exploited which is notavailable at compile time (e.g. running machines) andthis can lead to more eﬃcient parallel implementationsof the user application.In order to experiment the feasibility of the proposed ap-proach, we considered the languages that natively supportcode annotations for developing a validation prototype. BothJava and .NET frameworks provide an annotation mech-anism. They also provide an intermediate language (IL),portable among diﬀerent computer architecture (compile once– run everywhere), and holding some information typicallyonly available at source code level (e.g. code annotations)that can be used in the runtime for optimization purposes.The transformation process is done at load time, namelythe time when we have all the information needed to opti-mize the restructuring process with respect to the availableunderlying resources. The code transformation works at ILlevel thus it does not need that the application source codeis sent on target architecture. Furthermore, IL transforma-tion introduces in general fewer overheads than the sourcecode transformations followed by re-compilation.PAL transforms the annotated code in a parallel fashionby asynchronously executing parts of the original code. Theparts to be executed asynchronously are individuated by theuser annotations. In particular, we used Java and thereforethe more natural choice was to individuate method calls asthe parts to be asynchronously executed. PAL translatesthe IL codes of the “parallel” part by structuring them ac-cording with the features of the target architecture. Asyn-chronous execution of method code is achieved by exploitingthe concept of future [9, 10]. When a method is called asyn-chronously it immediately returns a future, that is a stub“empty” object. The caller can then go on with its own com-putations and use the future object just when the methodcall return value is actually needed. If in the meanwhile thereturn value has already been computed, the call to reifythe future succeeds immediately, otherwise it blocks untilthe actual return value is computed and then returns it.PAL programmers have just to put a @Parallel annotationon the line right before method declaration to mark thatmethod as a candidate for asynchronous execution. Thisallows keeping applications similar to normal sequential ap-plications, actually. Programmers may simply run the ap-plication through standard Java tools to verify it is func-tionally correct. The PAL approach also avoids the prolif-eration of source ﬁles and classes, as it works transformingIL code, but raises several problems related to data sharing management. As an example, methods annotated with a @Parallel cannot access class ﬁelds: they can only accesstheir own parameters and the local method variables. Thisis due to the impossibility to intercept all the accesses to theclass ﬁelds, actually. Then, PAL automatically performs atload time the activities aimed at achieving the asynchronousand parallel execution of the PAL-annotated methods andat managing any consistency related problems, without anyfurther programmer intervention.

3. THE PAL PROTOTYPE

To validate our approach, we implemented a PAL prototypein Java, as it provides a manageable intermediate language(Java byte-code [22]) and natively supports code annota-tions. s The prototype works taking the program byte-codeas input and transforming it in a parallel byte-code. In orderto do this it uses ASM [6]: a Java byte-code manipulationframework.The current prototype accepts only one kind of attributeto the @Parallel annotation: a parDegree denoting themaximum number of processing elements to be used for themethod execution. PAL uses such information to make achoice between the multithreaded and distributed version.This choice is driven by the number of processors/coresavailable on the host machine: if the machine owns a suﬃ-cient number of processors the annotated byte-code directlycompiled from user code is transformed in a semanticallyequivalent multithreaded version.In order to enable the PAL features, the programmer hasonly to add a few lines of code. As an example considera program computing the Mandelbrot set. The

Mandelbrot class uses a @Parallel annotation to state that all the inputdata (e.g. createLines calls) should be computed in paral-lel, with a speciﬁed parallelism degree. Unfortunately, dueto some Java limitations, the programmer must specify anad-hoc return type (

PFFuture ), and consequently return anobject of this type.

PFFuture is a template deﬁned by thePAL framework. It represents a container needed to enablethe future mechanism. The type speciﬁed as argument is theoriginal method return type. Initially, we tried to have to amore transparent mechanism for the future implementation,without any explicit Future declaration. It consisted in theload-time substitution of the return type with a PAL-typeinheriting from the original one. In our idea, the PAL-typewould have ﬁltered any original type dereferentiation fol-lowing the wait-by-necessity [8] semantics. Unfortunately,we had to face two Java limitations that limit the currentprototype to the current solution.These limitations regard the impossibility to extend somewidely used Java BCL classes (String, Integer,...) becausethey are declared final , and the impossibility to interceptall class ﬁeld accesses.In the

Main class, the user just asks to transform the

Main and the

Mandelbrot classes with PAL, that is, to processthe relevant PAL annotations and to produce an executableIL which exploits parallelism according to the features (hwand sw) of the target architecture where the

Main itself isbeing run. . RELATED WORK

PAL oﬀers a simple yet expressive technique for parallel pro-gramming. By exploiting “runtime compilation” it adaptsthe executable code to diﬀerent architectures. It does notintroduce a new or diﬀerent paradigm, while exploiting par-allelism at the method call level. So far have been proposed acertain number of systems based on similar ideas. However,although diﬀerent experiments exist in the so-called concur-rent object-oriented languages scenario (COOLs) [19], wedecided to discuss only those actually very similar to PAL. In[18] the authors propose a Java version of OpenMP giving tothe programmers the possibility to specify some PRAGMAsinside comments to source code. These pragmas are eventu-ally used by a speciﬁc java HPC compiler to transform theoriginal program in a diﬀerent one exploiting parallelism, forinstance through loop-parallelization. There are three im-portant diﬀerences between this approach and the ours one:ﬁrst of all PAL works at method level making method invo-cations asynchronous, while the work presented by Klemmet al. mainly works at the loop-parallelization level. An-other very important diﬀerence is related to the moment inwhich the transformation is made: this approach works atcompile time starting from source-code, while PAL directlytransforms the byte-code at load and run time. As a con-sequence, PAL may optimize its transformation choices ex-ploiting the knowledge available on the features of the com-puting resources of the target execution platform. Eventu-ally, PAL uses java Annotations to enrich the source code,instead the Java version of OpenMP uses the source codecomments. The former approach exploits Java basic fea-tures, in particular annotations, which type and syntax arechecked by compiler, with the limitation that annotationscannot be placed everywhere in the source code. the lat-ter solution instead is more “artiﬁcial” but it is not limitedto classes, methods and class ﬁelds (as the java Annota-tions ) and it can be also applied to pure Java code blocks.If we limit the discussion to the approaches that transforma sequential object-oriented program into a concurrent oneby replacing method invocations with asynchronous calls,(where parallelism can be easily extracted from sequentialcode without modiﬁcation, without changing the sequentialsemantics and the wait for return values can be postponedto the next usage, eventually using future objects) the num-ber of approaches similar to PAL is small. However, someother approaches share single points/features with our PALapproach. Java made popular the remote method invocation(RMI) for interaction between objects in disjoint memories.The same properties that apply for parallelizing sequentiallocal calls apply for remote ones, with the advantage thatremote calls do not rely on shared memory. ParallelizingRMIs scales much better than local calls, as the number oflocal processors does not limit the number of parallel tasks.This led to many implementations of asynchronous RMIs.ProActive is a popular object oriented distributed program-ming environment supporting asynchronous RMIs [21]. Itoﬀers a primitive class that should be extended to create re-mote callable active objects, as well as a runtime system toremotely instantiate this type of objects. Any call to an ac-tive object is done asynchronously, and values are returnedusing future objects. Compilation is completely standard,but instantiation must be done supplying the new objectlocation. All active objects must descend from the primi-tive active object class, so existing code must be completely

1 2 4

Parallel Degree E f ﬁc i en cy Figure 1: Mandelbrot computation: eﬃciency com-parison with diﬀerent image resolution, processingelement number and task computational weight. encapsulated to become active, as there is no multiple inher-itance in Java. Although concurrency is available throughasynchronous calls, scalable parallelism is obtained creatingseveral distributed objects, instead of calling several concur-rent methods, which is not always a natural way of structur-ing the parallelism. Some other systems, at diﬀerent levels,oﬀer asynchronous remote method calls, like JavaParty [20],JJPF [13], Muskel [3, 15, 14] and Ibis [23]. They provide alower level of abstraction with respect to PAL, being moreconcerned with the performance of RMI and eﬃcient imple-mentation of asynchronous mechanisms. Usually they oﬀer agood replacement for the original RMI system, either simpli-fying object declaration or speeding up the communication.Both rely on speciﬁc compilers to generate code, althoughIbis generate standard JVM byte-code that could thereforebe executed on any standard JVM.

5. EXPERIMENTAL RESULTS

To validate our approach we ran some experiments withthe current prototype. We conducted our tests on a hyper-threaded bi-processors workstation (Intel Xeon 2Ghz, Linuxkernel 2.6).Our test application is a fractal image generator, which com-putes sections of the Mandelbrot set.We picked up Mandelbrot set computation as it is a verypopular benchmark for embarrassingly parallel computation.PAL addresses exactly these kinds of computations. Mostof times, the implementation of these applications requiresa signiﬁcant programming eﬀort, despite being “easy” em-barrassingly parallel, far more consistent than the eﬀort re-quired to execute the same kind of application exploitingPAL. To study in more detail the behavior of the trans-formed version in several contexts, we ran the fractal gen-erator setting diﬀerent combinations of resolution (600x400,1200x800, 2400x1600) and task computational weights, start-ing from 5 up to 40 lines at time. Clearly when the task size(number of lines to compute) increases, the total number oftasks decreases. . CONCLUSION AND FUTURE WORK

In this paper we present PAL, an approach for easing mul-ticore SMPD programming. PAL exploits the programmerknowledge provided through annotations to restructure Javaprograms and make them parallel. The whole process isdriven by the analysis of the degree of parallelism speciﬁedthrough annotations. This process is executed at launchtime, directly at intermediate language level. This allowsobtaining and to exploit at the right time all the informa-tion needed to parallelize the applications with respect tothe parallel tools available on the target execution environ-ment and to the user supplied non-functional requirements.A load time transformation allows hiding most of paralleliza-tion issues. To validate the approach we developed a PALprototype that we used it to conduct some preliminary ex-periments. The results are encouraging and show that theoverhead introduced by PAL is negligible, while keeping theprogrammer eﬀort to parallelize the code low. Anyway, theprototype we developed presents some limitations. Basi-cally, the class ﬁelds are not accessible from PAL-annotatedmethods, moreover, the programmer has to include an ex-plicit dereferentiation of objects returned by PAL-annotatedmethods. In the next future we plan to reﬁne the implemen-tation to address some of these issues as well as extendingthe approach to be useful also in distributed architectureslike Grids or Cloud (and Federation of Clouds too [7, 12]).We think this will be interesting and will support cross-fertilization between these concepts as happened in [2].

7. REFERENCES

Making Grids Work , pages3–15. Springer, 2008.[3] M. Aldinucci, M. Danelutto, and P. Dazzi. Muskel: anexpandable skeleton environment.

Scalable Computing:Practice and Experience , 8(4), 2001.[4] M. Aldinucci, M. Danelutto, and M. Vanneschi.Autonomic qos in assist grid-aware components. In

Parallel, Distributed, and Network-Based Processing,2006. PDP 2006. 14th Euromicro InternationalConference on , pages 10 pp.–, 2006.[5] G. S. Almasi and A. Gottlieb.

Highly parallelcomputing . Benjamin-Cummings Publishing Co., Inc.,Redwood City, CA, USA, 1989.[6] C. T. Bruneton E, Lenglet R. Asm: a codemanipulation tool to implement adaptable systems,grenoble, france. Adaptable and ExtensibleComponent Systems, Nov. 2002.[7] E. Carlini, M. Coppola, P. Dazzi, L. Ricci, andG. Righetti. Cloud federations in contrail. In

Euro-Par2011: Parallel Processing Workshops , pages 159–168.Springer, 2012.[8] D. Caromel. Service, asynchrony, andwait-by-necessity.

Journal of Object-OrientedProgramming , Nov/Dec 1989.[9] D. Caromel and L. Henrio.

A Theory of DistributedObject . Springer-Verlag, 2005. [10] D. Caromel, L. Henrio, and B. Serpette. Asynchronousand deterministic objects, 2004.[11] M. Cole. Bringing Skeletons out of the Closet: APragmatic Manifesto for Skeletal ParallelProgramming.

Parallel Computing , 30(3):389–406,2004.[12] M. Coppola, P. Dazzi, A. Lazouski, F. Martinelli,P. Mori, J. Jensen, I. Johnson, and P. Kershaw. Thecontrail approach to cloud federations.

Proceedings ofthe International Symposium on Grids and Clouds(ISGC 12) , 2012.[13] M. Danelutto and P. Dazzi. A java/jini frameworksupporting stream parallel computations.

ParCo 2005,John von Neumann Institute for Computing Series ,33:681–688, 2005.[14] M. Danelutto and P. Dazzi. Joint structured/nonstructured parallelism exploitation through data ﬂow.2006.[15] M. Danelutto and P. Dazzi. Jointstructured/unstructured parallelism exploitation inmuskel. In

Computational Science–ICCS 2006 , pages937–944. Springer, 2006.[16] M. Danelutto, M. Pasin, M. Vanneschi, P. Dazzi,D. Laforenza, and L. Presti. Pal: exploiting javaannotations for parallelism. In

Achievements inEuropean Research on Grid Systems , pages 83–96.Springer, 2008.[17] A. S. Grimshaw. The mentat computation modeldata-driven support for object-oriented parallelprocessing. Technical Report CS-93-30, 28, 1993.[18] M. Klemm, R. Veldema, M. Bezold, andM. Philippsen. A proposal for openmp for java. In

Proceedings of the International Workshop onOpenMP , June 2006.[19] M. Philippsen. A survey of concurrent object-orientedlanguages.

Concurrency: Practice and Experience ,12(10):917–980, 2000.[20] M. Philippsen and M. Zenger. JavaParty – transparentremote objects in Java.

Concurrency: Practice andExperience , 9(11):1225–1242, Nov. 1997.[21] O. team. Proactive home page, 2006. .[22] F. Y. Tim Lindholm.

The Java Virtual MachineSpeciﬁcation . Sun Microsystems Press, second editionedition, 2004.[23] R. V. van Nieuwpoort, J. Maassen, G. Wrzesinska,R. Hofman, C. Jacobs, T. Kielmann, and H. E. Bal.Ibis: a ﬂexible and eﬃcient java-based gridprogramming environment.