Simplifying Parallelization of Scientific Codes by a Function-Centric Approach in Python
Jon K. Nilsen, Xing Cai, Bjorn Hoyland, Hans Petter Langtangen
aa r X i v : . [ c s . D C ] F e b Simplifying Parallelization of Scientific Codes by aFunction-Centric Approach in Python
Jon K. Nilsen , , Xing Cai , , Bjørn Høyland and Hans PetterLangtangen , USIT, P.O. Box 1059 Blindern, N-0316 Oslo, Norway Department of Physics, P.O. Box 1048 Blindern, University of Oslo, N-0316 Oslo,Norway Center for Biomedical Computing, Simula Research Laboratory, P.O. Box 134,N-1325 Lysaker, Norway Department of Informatics, P.O. Box 1080 Blindern, University of Oslo, N-0316Oslo, Norway Department of Political Science, P.O. Box 1097 Blindern, University of Oslo,N-0317 Oslo, NorwayE-mail: [email protected] , [email protected] , [email protected] , [email protected] Abstract.
The purpose of this paper is to show how existing scientific software can beparallelized using a separate thin layer of Python code where all parallel communicationis implemented. We provide specific examples on such layers of code, and theseexamples may act as templates for parallelizing a wide set of serial scientific codes.The use of Python for parallelization is motivated by the fact that the language iswell suited for reusing existing serial codes programmed in other languages. Theextreme flexibility of Python with regard to handling functions makes it very easy towrap up decomposed computational tasks of a serial scientific application as Pythonfunctions. Many parallelization-specific components can be implemented as genericPython functions, which may take as input those functions that perform concretecomputational tasks. The overall programming effort needed by this parallelizationapproach is rather limited, and the resulting parallel Python scripts have a compactand clean structure. The usefulness of the parallelization approach is exemplified bythree different classes of applications in natural and social sciences.
Submitted to:
Computational Science & Discovery implifying Parallelization of Scientific Codes
1. Introduction
Due to limited computing power of standard serial computers, parallel computing hasbecome indispensable for investigating complex problems in all fields of science. Afrequently encountered question is how to transform an existing serial scientific codeinto a new form that is executable on a parallel computing platform. Although portableparallel programming standards, such as MPI and OpenMP, have greatly simplified theprogramming work, the task of parallelization may still be quite complicated for domainscientists. This is because inserting MPI calls or OpenMP directives directly into anexisting serial code often requires extensive code rewrite as well as detailed knowledgeof and experience with parallel programming.The hope for non-specialists in parallel computing is that many scientificapplications possess high-level parallelism. That is, the entire computational workcan be decomposed into a set of individual (and often collaborative) computationaltasks, each of coarse grain, and can be performed by an existing piece of serialcode. Depending on the specific application, the decomposition can be achieved byidentifying a set of different parameter combinations, or (fully or almost) independentcomputations, or different data groups, or different geometric subdomains. For a giventype of decomposition, the parallelization induced programming components, such aswork partitioning, domain partitioning, communication, load balancing, and globaladministration, are often generic and independent of specific applications. These genericcomponents can thus be implemented as reusable parallelization libraries once and forall. This is what we exemplify in the present paper.It is clear that a user-friendly parallelization approach relies on at least two factors:(1) The existing serial code should be extensively reused; (2) The programming effort bythe end user must be limited. To achieve these goals we suggest to use Python to wrapup pieces of existing serial code (possibly written in other languages), and implementthe parallelization tasks in separate and generic Python functions.Python [1] is an extremely expressive and flexible programming language at its core.The language has been extended with numerous numerical and visualization modulessuch as NumPy [2] and SciPy [3]. The two requirements of a user-friendly parallelizationmentioned above are actually well met by Python. First of all, Python is good at inter-operating with other languages, especially Fortran, C, and C++, which are heavilyused in scientific codes. Using wrapper tools such as F2PY [4], it is easy to wrap up anexisting piece of code in Fortran and C and provide it with a Pythonic appearance.Moreover, among its many strong features, Python is extremely flexible withhandling functions. Python functions accept both positional arguments and keywordarguments. The syntax of a variable set of positional and keyword arguments (knownas “ (*args,**kwargs) ” to Python programmers) allows writing libraries routines thatwork with any type of user-defined functions. That is, the syntax makes it possible tocall a Python function without revealing the exact number of arguments.It is also straightforward to pass functions as input arguments to a Python function implifying Parallelization of Scientific Codes function-centric representation of the scientific code. With such a function-centric approach, wecan build a general framework in Python for almost automatic parallelization of theprogram flow in the original scientific code. Later examples will convey this idea indetail.Performance of the resulting parallel application will closely follow the performanceof the serial application, because the overhead of the parallelization layer in Python isjust due to a small piece of extra code, as we assume the main computational work totake place in the Python functions that call up pieces of the original scientific code.In the parallelization layer, good performance can be ensured by using efficient arraymodules in Python (such as numpy [2]) together with light-weight MPI wrappers (suchas pypar [5]). For examples of writing efficient Python code segments for some standardserial and parallel numerical computations, we refer the reader to Cai et al. [6].
Related Work.
In C++, generic programming via templates and object-orientedprogramming has been applied to parallelizing serial scientific codes. Two examplescan be found in [7] and [8], where the former uses C++ class hierarchies to enableeasy implementation of additive Schwarz preconditioners, and the latter uses C++templates extensively to parallelize finite element codes. Many scientific computingframeworks have also adopted advanced programming to incorporate parallelism behindthe scene. In these frameworks (see, e.g., [9, 10, 11, 12, 13]) the users can write parallelapplications in a style quite similar to serial programming, without being exposed tomany parallelizing details. Likewise are frameworks that are specially designed to allowcoupling of different serial and parallel components, such as Cactus [14] and MpCCI [15].The Python programming language, however, has not been widely used to parallelizeexisting serial codes. The Star-P system [16] provides the user with a programmingenvironment where most of the parallelism is kept behind the scene. Hinsen [17] hascombined Python with BSP to enable high-level parallel programming. In addition,quite a number of Python MPI wrappers exist, such as pyMPI [18], pypar [5],
MYMPI [19], mpi4py [20, 21], and
Scientific.MPI [22]. Efforts in incorporating parallelism vialanguage extensions of Python can be found in [23, 24, 25].The contribution of the present paper is to show by examples that a function-centric approach using Python may ease the task of parallel scientific programming.This result is primarily due to Python’s flexibility in function handling and functionarguments. As a result, generic tasks that arise in connection with parallelization canoften be programmed as a collection of simple and widely applicable Python functions, implifying Parallelization of Scientific Codes
2. A Motivating Simple Example
Suppose we want to carry out a parameter analysis that involves a large numberof evaluations of a multi-variable mathematical function f ( a , . . . , a q ). The Pythonimplementation of f may use p positional arguments and k keyword arguments suchthat the total p + k arguments contain at least the variables a , . . . , a q (i.e., q ≤ p + k ).As a very simple example, consider the parabola f ( x, a, b, c ) = ax + bx + c with thefollowing Python implementation ( q = 4 , p = 1 , k = 3): def func(x, a=0, b=0, c=1):return a*x**2+b*x+c Suppose we want to evaluate func for a particular set of input parameters chosen froma large search space, where x , a , b , and c vary in specified intervals. The completeproblem can be decomposed into three main steps: (1) initialize a set of arguments to func ; (2) evaluate func for each entry in the set of arguments; (3) process the set offunction return values from all the func calls.Step (1) calls a user-defined function initialize which returns a list of 2-tuples,where each 2-tuple holds the positional and keyword arguments (as a tuple and adictionary) for a specific call to func . Step (2) iterates over the list from step (1)and feed the positional and keyword arguments into func . The returned value (tuple) isstored in a result list. Finally, step (3) processes the result list in a user-defined function finalize which takes this list as input argument. implifying Parallelization of Scientific Codes
5A generic Python function that implements the three-step parameter analysis canbe as follows: def solve_problem(initialize, func, finalize):input_args = initialize()output = [func(*args, **kwargs) for args, kwargs in input_args]finalize(output)
Note that the use of list comprehension in the above code has given a very compactimplementation of the for -loop for going through all the evaluations of func . The initialize , func , and finalize functions are passed to solve problem as inputarguments. These three user-defined functions are independent of solve problem .As an example, assume that x is a set of n uniformly distributed coordinates in[0 , L ], and we vary a and b in [ − ,
1] each with m values, while c is fixed at the value 5.For each combination of a and b , we call func with the vector x as a positional argumentand the a , b , c values as keyword arguments, and store the evaluation results of func ina list named output . The objective of the computations is to extract the a and b valuesfor which func gives a negative value for one or several of the coordinates x ∈ [0 , L ]. Forthis very simple example, the concrete implementation of the initialize and finalize functions can be put inside a class named Parabola as follows: class Parabola:def __init__(self, m, n, L):self.m, self.n, self.L = m, n, Ldef initialize(self):x = numpy.linspace(0, self.L, self.n)a_values = numpy.linspace(-1, 1, self.m)b_values = numpy.linspace(-1, 1, self.m)c = 5self.input_args = []for a in a_values:for b in b_values:func_args = ([x], {’a’: a, ’b’: b, ’c’: c})self.input_args.append(func_args)return self.input_argsdef func(self, x, a=0, b=0, c=1):return a*x**2+b*x+cdef finalize(self, output_list):self.ab = []for input, result in zip(self.input_args, output_list):if min(result) < 0:self.ab.append((input[0][1], input[0][2]))
Now, to find the combinations of a and b values that make ax + bx + c <
9, wecan write the following two lines of code (assuming m = 100, n = 50, and L = 10): problem = Parabola(100, 50, 10)solve_problem(problem.initialize, problem.func, problem.finalize) Note that the desired combinations of a and b values will be stored in the list problem.ab . Also note that we have placed func inside class Parabola , to have all implifying Parallelization of Scientific Codes func as stand-alone function or a classmethod is a matter of taste.Despite the great mathematical simplicity of this example, the structure of the solve problem function is directly applicable to a wide range of much more advancedproblems. Although initialize and finalize are Python functions with very simplearguments (none and a list, respectively), this is not a limitation of their applicability.For example, the initialize step in our simple example needs values for m , n , and L ,the a and b interval and so on, which can not be specified in the generic solve problem function. To overcome this limitation, the information of m , n , and L can be hard-coded (not recommended), or transferred to initialize through global variables (notrecommended in general) or carried with initialize as a state, either as class attributesor as a surrounding scope in a closure. We have chosen the class approach, i.e.,class attributes store user-dependent data structures such that the initialize and finalize methods can have the simple input argument structure demanded by thegeneric solve problem function. Alternatively, a closure as follows can be used insteadof a class (this construct requires some knowledge of Python’s scoping rules): def initialize_wrapper(m, n, L):def initialize(self):x = numpy.linspace(0, L, n)a_values = numpy.linspace(-1, 1, m)...return input_argsreturn initialize Now, the returned initialize function will carry with it the values of m , n , and L in the surrounding scope. The choice between the class approach and the closureapproach, or using global variables in a straightforward global initialize function,is up to the programmer. The important point here is that initialize must oftendo a lot, and the input information to initialize must be handled by some Pythonconstruction. Similar comments apply to finalize . Let us say that we want to utilize several processors to share the work of all the func evaluations, i.e., the for -loop in the generic solve problem function. This can clearlybe achieved by a task-parallel approach, where each evaluation of func is an independenttask. The main idea of parallelization is to split up the for -loop into a set of shorter for -loops, each assigned to a different processor. In other words, we need to split upthe input args list into a set of sub-lists for the different processors. Note that thispartitioning work is generic, independent of both the func function and the actualarguments in the input args list. Assuming homogeneous processors and that allthe function evaluations are equally expensive, we can divide the input args list into num procs (number of processors) sub-lists of equal length. In case input args is notdivisible by num procs , we adjust the length of some sub-lists by 1: def simple_partitioning(length, num_procs): implifying Parallelization of Scientific Codes sublengths = [length/num_procs]*num_procsfor i in range(length % num_procs): Using the above generic get subproblem input args function, each processor getsits portion of the global input args list, and a shorter for -loop can be executed there.Note that the syntax of Python lists and numpy arrays has made the function verycompact.The next step of parallelization is to collect the function evaluation results from allthe processors into a single global output list. Finally, we let finalize(output) runonly on the master processor (assuming that this work does not require parallelization).For the purpose of collecting outputs from all the processors, the following genericPython function can be used: def collect_subproblem_output_args(my_output_args, my_rank,num_procs, send_func, recv_func):if my_rank == 0:
The last two input arguments to the above function deserve some attention.Both send func and recv func are functions themselves. In the case of usingthe pypar wrapper of MPI commands, we may simply pass pypar.send as the send func input argument and pypar.receive as recv func . Moreover, switchingto another MPI module is transparent with regard to the generic function named collect subproblem output args . It should also be noted that most Python MPImodules are considerably more user-friendly than the original MPI commands inC/Fortran. This is because (1) the use of keyword arguments greatly simplifies thesyntax, and (2) any picklable (marshalable) Python data type can be communicateddirectly.Now that we have implemented the generic functions get subproblem input args and collect subproblem output args , we can write a minimalistic parallel solver asfollows: def parallel_solve_problem(initialize, func, finalize,my_rank, num_procs, send, recv):input_args = initialize()my_input_args = get_subproblem_input_args(input_args,my_rank, num_procs)my_output = [func(*args, **kwargs) \for args, kwargs in my_input_args] implifying Parallelization of Scientific Codes output = collect_subproblem_output_args(my_output, my_rank,num_procs, send, recv)if my_rank == 0:finalize(output) We remark that the above function is generic in the sense that it is independent ofthe actual implementation of initialize , func , and finalize , as well as the PythonMPI module being used. All problems that can be composed from independent functioncalls can (at least in principle) be parallelized by the shown small pieces of Python code.As a specific example of using this parallel solver, we may address the problemof evaluating the parabolic function ( func and class Parabola ) for a large number ofparameters. Using the pypar
MPI module and having the problem-dependent code ina module named
Parabola and the general function-centric tools in a module named function centric , the program becomes as follows: from Parabola import func, Parabolafrom function_centric import parallel_solve_problemproblem = Parabola(m=100, n=50, L=10)import pyparmy_rank = pypar.rank()num_procs = pypar.size()parallel_solve_problem(problem.initialize,func,problem.finalize,my_rank, num_procs,pypar.send, pypar.receive)pypar.finalize()
To the reader, it should be obvious from this generic example how to parallelizeother independent function calls by the described function-centric approach.
3. Function-Centric Parallelization
We have shown how to parallelize a serial program that is decomposable into threeparts: initialize , calls to func (i.e., a set of independent tasks), and finalize . Inthis section, we describe how the function-centric parallelization is helpful for threeimportant classes of scientific applications: Markov chain Monte Carlo simulations,dynamic population Monte Carlo simulations, and solution of partial differentialequations. We use Python to program a set of simple and generic parallelizationfunctions.
The standard Markov chain Monte Carlo algorithms are embarrassingly parallel andhave exactly the same algorithmic structure as the example of parameter analysis inSection 2. This means that the functions initialize , func , and finalize can easily beadapted to Monte Carlo problems. More specifically, the initialize function preparesthe set of random samples and other input parameters. Some parametric model is implifying Parallelization of Scientific Codes func function, whereas finalize collects the data returned from allthe func calls and prepares for further statistical analysis.Function-centric parallelization of Markov chain Monte Carlo applicationsclosely follows the example in Section 2. We can reuse the three genericfunctions named get subproblem input args , collect subproblem output args , and parallel solve problem , assuming that all the func evaluations are equally costly andall the processors are equally powerful so there is no need for more sophisticated loadbalancing.In Section 4.1, we will look at a real-life Markov chain problem from political science(Appendix Appendix A gives its mathematical description). A more advanced branch of Monte Carlo algorithms is population Monte Carlo, see [26].Here, a group of walkers, also called the population , is used to represent a high-dimensional vector and the computation is carried out by a random walk in the statespace. During the computation some of these walkers may be duplicated or deletedaccording to some acceptance/rejection criteria, i.e., the population is dynamic in time.Population Monte Carlo algorithms have been proven useful in a number of fields,spanning from polymer science to statistical sciences, statistical physics, and quantumphysics.Unlike the examples so far, where the computational tasks were totally independentand of static size, population Monte Carlo algorithms may be viewed as an iteration intime where we repeatedly do some work on a dynamic population, including moving thewalkers of the population and adjusting the population size, which in a parallel contextcalls for dynamic load balancing.
A serial implementation of the time integration functioncan be as follows: def time_integration(initialize, do_timestep, finalize):walkers, timesteps = initialize()output = []for step in range(timesteps):old_walkers_len = len(walkers)output.append(do_timestep(walkers))walkers.finalize_timestep(old_walkers_len, len(walkers))finalize(output)
The input arguments to the generic time integration function are three functions: initialize , do timestep , and finalize . This resembles the three-step structurediscussed in Section 2. The do timestep function can have a unified implementation forall the variants of population Monte Carlo algorithms. The other two input functionsare typically programmed as methods of a class that implements a particular algorithm(such as diffusion Monte Carlo in Section 4.2). Here, the initialize method sets upa population object walkers (to be explained below) and determines the number of implifying Parallelization of Scientific Codes finalize method can, e.g., store theoutput for later analysis.The purpose of the do timestep function is to implement the work for onetime step, including propagating the walkers and adjusting the population. Animplementation that is applicable for all population Monte Carlo algorithms may havethe following form: def do_timestep(walkers):walkers.move()for walker in range(len(walkers)):if walkers.get_marker(walker) == 0:walkers.delete(walker)elif walkers.get_marker(walker) > 1:walkers.append(walker, walkers.get_marker(walker)-1)return walkers.sample_observables() The above implementation of time integration and do timestep assumes that walkers is an object of a class, say with name
Walkers , that has a certain number ofmethods. Of course, the flexibility of Python allows that the concrete implementationof class
Walkers be made afterwards, unlike C++ and Java that require class
Walkers be written before implementing time integration and do timestep . Here, we expectclass
Walkers to provide a generic implementation of a group of walkers, with supportingmethods for manipulating the population. The most important methods of class
Walkers are as follows: • move() carries out the work of moving each walker of the population randomlyaccording to some rule or distribution function. • get marker(walker) returns the number of copies belonging to a walker with index walker , where 0 means the walker should be deleted, 2 or more means that clonesshould be created. • append(walker, nchilds) and delete(walker) carry out the actual cloning andremoval of a walker with index walker . • sample observables() returns the observables at a given time step, e.g., anestimate of the system energy. • finalize timestep(old size, new size) does some internal book keeping at theend of each time step, such as adjusting some internal variables. It takes as input thetotal number of walkers before and after the walker population has been adjustedby the do timestep function. • len is one of Python’s special class methods and is in our case meant toreturn the number of walkers. A call len(walkers) yields the same result as walkers. len () .For a real application, such as the diffusion Monte Carlo algorithm (see Section 4.2and Appendix Appendix B), the concrete implementation of the methods should reflectthe desired numerical algorithm. For example, the move method of diffusion MonteCarlo uses diffusion and branching as the rule to randomly move each walker, and the finalize timestep method adjusts the branching ratio. implifying Parallelization of Scientific Codes Parallelism in population Monte Carlo algorithms arisesnaturally from dividing the walkers among the processors. Therefore, a parallel versionof the time integration function may be as follows: def parallel_time_integration(initialize, do_timestep, finalize,my_rank, num_procs, send, recv, all_gather):my_walkers, timesteps = initialize(my_rank, num_procs)old_walkers_len = sum(all_gather(numpy.array([len(my_walkers)])))my_output = []for step in range(timesteps):
In comparison with its serial counterpart, the parallel time integration function has a few noticeable changes. First, the input arguments have been extendedwith five new arguments. The two integers my rank and num procs are, as before, meantfor identifying the individual processors and finding the total number of processors. Theother three new input arguments are MPI communication wrapper functions: send , recv , and all gather , which can be provided by any of the Python wrapper modulesof MPI. The only exception is that pypar does not directly provide the all gather function, but we can easily program it as follows: def all_gather (input_array):array_gathered_tmp = pypar.gather (input_array, 0)array_gathered = pypar.broadcast (array_gathered_tmp, 0)return array_gathered Second, we note that the initialize function is slightly different from the serialcase, now accepting my rank and num procs as input. This is because initial divisionof the walkers is assumed to be carried out here, giving rise to my walkers on eachprocessor. Third, a new function dynamic load balancing is called during eachtime step. This function will be explained below in detail. Fourth, unlike that theserial counterpart could simply pass the size of its walkers to finalize timestep ,the parallel implementation needs to collect the global population size before calling finalize timestep . We remark that each local population knows its own size,but not the global population size. For this purpose, the dynamic load balancing function returns the individual local population sizes as a numpy array. Last, the implifying Parallelization of Scientific Codes collect subproblem output args function from Section 2.2 is used to assemble allthe individual results onto the master processor before calling the finalize function.As mentioned before, parallelization of population Monte Carlo algorithms hasto take into account that the total number of walkers changes with time. Dynamic re-distribution of the walkers is therefore needed to avoid work load imbalance. The generic dynamic load balancing function is designed for this purpose, where we evaluatethe amount of work for each processor and, if the work distribution is too skew, wemove the excess walkers from a busy processor to a less busy one. The function firstchecks the distribution of local population sizes. If the difference between the smallestnumber of walkers and the largest number of walkers exceeds some predefined threshold, dynamic load balancing finds a better population distribution and redistributes thewalkers: def dynamic_load_balancing(walkers, task_time, my_rank, num_procs, \send, recv, all_gather):walkers_per_proc = all_gather(numpy.array([len(walkers)]))if imbalance_rate(walkers_per_proc) > walkers.threshold_factor:timing_list = all_gather(numpy.array([task_time]))rebalanced_work = find_optimal_workload(timing_list,walkers_per_proc)walkers = redistribute_work(walkers,walkers_per_proc,rebalanced_work,my_rank, num_procs, send, recv)return walkers_per_proc Two helper functions find optimal workload and redistribute work are usedin the above implementation. Here, find optimal workload finds the optimaldistribution of work, based on how much time each local population has used. The redistribute work function carries out the re-shuffling of walkers. A straightforward(but not optimal) implementation of these functions goes as follows: def find_optimal_workload(timing_list, current_work_per_proc):total_work = sum(current_work_per_proc)C = total_work/sum(1./timing_list)tmp_rebalanced_work = C/timing_listrebalanced_work = numpy.array(tmp_rebalanced_work.tolist(),’i’)remainders = tmp_rebalanced_work-rebalanced_workwhile sum(rebalanced_work) < total_work:maxarg = numpy.argmax(remainders)rebalanced_work[maxarg] += 1remainders[maxarg] = 0return rebalanced_workdef redistribute_work(my_walkers, work_per_proc, rebalanced_work,my_rank, num_procs, send, recv):difference = work_per_proc-rebalanced_workdiff_sort = numpy.argsort(difference)prev_rank_min = diff_sort[0]while sum(abs(difference)) != 0:diff_sort = numpy.argsort(difference)rank_max = diff_sort[-1]rank_min = diff_sort[0]if rank_min == prev_rank_min and rank_max != diff_sort[1]:rank_min = diff_sort[1] implifying Parallelization of Scientific Codes if my_rank==rank_max:send(my_walkers.cut_slice(rebalanced_work[my_rank]),\int(rank_min))elif my_rank==rank_min:my_walkers.paste_slice(recv(int(rank_max)))difference[rank_min] += difference[rank_max]difference[rank_max] = 0prev_rank_min = rank_minreturn my_walkers Careful readers will notice that two particular methods, my walkers.cut slice and my walkers.paste slices , provide the capability of migrating the work loadbetween processors in the redistribute work function. These two methods have to beprogrammed in class
Walkers , like the other needed methods described earlier: move , get marker , append , delete , and so on. The cut slice method takes away excesswork from a local population and the paste slice method inserts additional work intoa local population. Note that the input argument to the cut slice method is an indexthreshold meaning that local walkers with indices larger than that are to be taken away.The returned slice from cut slice is a picklable Python object that can be sent andreceived through MPI calls.The generic redistribute work function deserves a few more words. Among itsinput arguments is the ideal work distribution, rebalanced work , which is calculatedby find optimal workload . The redistribute work function first calculates thedifference between the current distribution, work per proc , and the ideal distribution.It then iteratively moves walkers from the processor with the most work to the processorwith the least work until the difference is evened out.This load balancing scheme is in fact independent of population Monte Carloalgorithms. As long as you have an algorithm repeatedly doing a task over time andwhere the amount of work in the task varies over time, this scheme can be reused. Theonly requirement is that an application-specific implementation of class Walkers , interms of method names and functionality, should match with dynamic load balancing and redistribute work . It should be noted that the given implementation of the latterfunction is not optimal.The algorithm of diffusion Monte Carlo, described in Appendix Appendix B, isa typical example of a population Monte Carlo algorithm. The implementation isdescribed in Section 4.2 and Appendix Appendix B.
From the perspective of communication between processors, parallelization of the MonteCarlo algorithms is relatively easy. Parallel Markov chain Monte Carlo algorithms onlyrequire communication in the very beginning and end, whereas parallel populationMonte Carlo algorithms only require communication at the end of each time step.Actually, our function-centric approach to parallelization can allow more frequentcommunication. To show the versatility of function-centric parallelization, we apply implifying Parallelization of Scientific Codes domaindecomposition . The idea is to divide the global domain, in which the PDEs are to besolved, into n overlapping subdomains. The PDEs can then be solved in parallel onthe n subdomains. However, the correct boundary condition at the internal subdomainboundaries are not known, thus leading to an iterative approach where one appliesboundary conditions from the last iteration, solves for the n subdomain problems again,and repeats the process until convergence of the subdomain solutions (see e.g. [27, 28]).This algorithm is commonly called additive Schwarz iteration and can successfully beapplied to many important classes of PDEs [29, 30, 31]. The great advantage of thealgorithm, especially from a software point of view, is that the PDE solver for the globalproblem can be reused for each subdomain problem. Some additional code is needed forcommunicating the solutions at the internal boundaries between the subdomains. Thiscode can be implemented in a generic fashion in Python, as we explain later.Let us first explain the additive Schwarz algorithm for solving PDEs in more detail.We consider some stationary PDE defined on a global domain Ω: L ( u ) = f, x ∈ Ω , (1)subject to some boundary condition involving u and/or its derivatives. Dividing Ω intoa set of overlapping subdomains { Ω s } Ps =1 , we have the restriction of (1) onto Ω s , for all s , as L ( u ) = f, x ∈ Ω s . (2)The additive Schwarz method finds the global solution u by an iterative processthat generates a series of approximations u , u , u and so on. During iteration k ,each subdomain computes an improved local solution u s,k by locally solving (2) for u = u s,k with u s,k = u k − as (an artificial) boundary condition on Ω s ’s non-physicalinternal boundary that borders with neighboring subdomains. All the subdomains can concurrently carry out the local solution of (2) within iteration k , thus giving rise toparallelism. At the end of iteration k , neighboring subdomains exchange the latestlocal solutions in the overlapping regions to (logically) form the global field u k . Thesubdomain problems (2) are of the same type as the global problem (1), which impliesthe possibility of reusing an existing serial code that was originally implemented for (1).The additional code for exchange of local solutions among neighbors can be implementedby generic communication operations, independently of specific PDEs.A generic implementation of parallel additive Schwarz iteration algorithm can berealized as the following Python function: def additive_Schwarz_iterations(subdomain_solve, communicate,set_BC, max_iter, threshold, solution,convergence_test=simple_convergence_test):iter = 0; not_converged = True implifying Parallelization of Scientific Codes iter += 1solution_prev = solution.copy()set_BC(solution)solution = subdomain_solve()communicate(solution)not_converged = not convergence_test(\solution, solution_prev, threshold) In the above function, max iter represents the maximum number of additiveSchwarz iterations allowed, and subdomain solve is a function that solves thesubdomain problem of form (2) and returns an object solution , which is typically a numpy array containing the latest subdomain solution u s,k on a processor (subdomain).However, solution may well be a more complex object, say holding a collectionof scalar fields over computational grids, provided that (i) the object has a copy method, (ii) convergence test and communicate can work with this object type, and(iii) subdomain solve returns such an object. This flexibility in choosing solution reflects the major dynamic power of Python and provides yet another illustration of thegenerality of the examples in this paper.Given an existing serial code, for example in a language like Fortran or C/C++, the subdomain solve function is easily defined by wrapping up an appropriate piece of theserial code as a Python class (since subdomain solve does not take any arguments, thefunction needs a state with data structures, conveniently implemented as class attributesas explained in Section 2.1).The communicate argument is a function for exchanging the latest local solutionsamong the subdomains. After the call, the solution object is updated with recentlycomputed values from the neighboring subdomains, and contents of solution have beensent to the neighbors. The communicate function is problem independent and can beprovided by some library. In our implementation, the implementation is entirely inPython to take advantage of easy programming of parallel communication in Python.The set BC argument is a function for setting boundary conditions on a subdomain’sinternal boundary. This function depends on the actual serial code and is naturallyimplemented as part of the class that provides the subdomain solve function.The convergence test function is assumed to perform an appropriate convergencetest. The default generic implementation can testmax ≤ s ≤ P k u s,k − u s,k − k k u s,k k against a prescribed threshold value. An implementation reads def simple_convergence_test(solution, solution_prev, threshold=1E-3):diff = solution - solution_prevloc_rel_change = vdot(diff,diff)/vdot(solution,solution)glob_rel_change = all_reduce(loc_rel_change, MAX)return glob_rel_change < threshold We remark that all reduce is a wrapper of the MPI
MPI Allreduce command and vdot computes the inner product of two numpy arrays. implifying Parallelization of Scientific Codes subdomain solve , communicate , set BC , and convergence test . In other words, it isnot natural to divide the work of solving a PDE into initialize , func , and finalize .Nevertheless, function-centric parallelization is also here user-friendly and gives astraightforward implementation of additive Schwarz iterations as above. The convergence test function shown above is clearly generic, and so is the communicate function in the sense that it does not depend on the PDE. Both functions can bereused for different PDEs. The other two functions are PDE dependent, however, subdomain solve normally wraps an existing serial code, while the implementationof set BC is typically very simple.
4. Applications and Numerical Experiments
In this section we will address three real research projects involving the three classes ofalgorithms covered in Section 3. The projects have utilized our function-centric approachto parallelizing existing codes. That is, we had some software in Fortran, C++, andR performing the basic computations needed in the projects. The serial software waswrapped in Python, adapted to components such as initialize , func , do timestep , finalize , subdomain solve , communicate , set BC . Parallelization was then carriedout as explained in previous sections. An important issue to be reported is the parallelefficiency obtained by performing the parallelization in a Python layer that is separatefrom the underlying serial scientific codes.The Python enabled parallel codes have been tested on a Linux cluster of 3.4 GHzItanium2 processors, which are interconnected through 1Gbits ethernet. The purposeis to show that the function-centric parallelization approach is easy to use and thatsatisfactory parallel performance is achievable. The first case is from political science and concerns estimating legislators’ ideal pointsby the Markov chain Monte Carlo (MCMC) method. For a detailed descriptionof the mathematical problem and the numerical method, we refer the reader toAppendix Appendix A. This application fits into the setup in Section 3.1. The statisticalengine is provided by the PSCL library [32] in R [33], for which there exists a Pythonwrapper.To use the function-centric parallelization described in Section 3.1, we have writtena Python class named
PIPE . In addition to the constructor of the class (i.e., the init method), there are three methods as follows: • initialize sets up the functionality of the PSCL library through the Pythonwrapper of R (named rpy ), and prepares the input argument list needed for func . implifying Parallelization of Scientific Codes • func carries out the computation of each task by invoking appropriate functionsavailable through rpy (in short, func is a Python wrapper to the R function ideal from the PSCL library). • finalize summarizes the output and generates an array in R format.The resulting parallel Python program is now as short as from function_centric import parallel_solve_problemimport pyparmy_rank = pypar.rank()num_procs = pypar.size()from pypipe import PIPEproblem = PIPE("EP1.RData", "rcvs", "NULL", "NULL")parallel_solve_problem(problem.initialize, problem.func, problem.finalize,my_rank, num_procs, pypar.send, pypar.receive)pypar.finalize() The practical importance of a parallel MCMC code is that large andcomputationally intensive simulations are now easily doable. More specifically, datafrom the European Parliament between 1979 and 2004 [34] are used for simulation.During the five year legislative terms, the European Parliament expanded the size ofthe membership as well as the number of votes taken. (This trend has continued since2004.) It is hence increasingly computationally intensive to estimate the ideal pointmodel without reducing the length of the Markov chain. We examined the parallelperformance by comparing the computing time for each of the five legislatures, runningthe parallelized code on 32 CPUs. The results are reported in Table 1. When comparingthe results, the reader should note that we have not made any attempts to optimize the ideal code (called by our func function) for the purpose of parallelization. This makesit straightforward to switch to new versions of the ideal function. We ran 100,000MCMC iterations. The parallel efficiency was about 90%.
Table 1.
Speedup results associated with voting analysis.
Legislature
Votes Members 1 CPU 32 CPUs Efficiency1979 - 1984 810 548 287m 32.560s 10m 13.318s 87.91%1984 - 1989 1853 637 783m 59.059s 26m 58.702s 91.06%1989 - 1994 2475 597 1006m 59.258s 33m 26.140s 94.11%1994 - 1999 3603 721 1905m 0.930s 66m 0.068s 90.20%1999 - 2004 5639 696 2898m 45.224s 102m 7.786s 88.70%
As an example of population Monte Carlo methods, we will now look at parallel DiffusionMonte Carlo (DMC) computations (see Appendix Appendix B for a detailed numericaldescription), which is used here to simulate Bose-Einstein condensation. We recall fromSection 3.2 that dynamic load balancing is needed in connection with the parallelization, implifying Parallelization of Scientific Codes dynamic load balancing function. To utilize theparallel time integration function parallel time integration from Section 3.2, weneed to program a parallel version of the initialize function. The do timestep function from Section 3.2 can be used as is. def initialize(my_rank, num_procs):nwalkers = 1000nspacedim = 3stepsize = 0.1timesteps = 200walkers_per_proc = simple_partitioning(nwalkers, num_procs)my_walkers = Walkers(walkers_per_proc[my_rank], nspacedim, stepsize)my_walkers.threshold_factor = 1.1return my_walkers, timesteps
This initialize function is quite similar to its serial counter part. As notedin Section 3.2, it takes as input my rank and num procs . The simple partitioning function is called to partition the walker population. A my walkers object is assignedto each processor, and a threshold factor is prescribed to determine when load balancingis needed.Together with the parallel time integration function from Section 3.2, theabove initialize function is the minimum programming effort needed to parallelizea serial population Monte Carlo code. For the particular case of our parallel DiffusionMonte Carlo implementation, we also need to know the global number of walkers in everytimestep to be able to estimate its observables globally. Moreover, the load balancingscheme needs the time usage of each processor during each time step.A class with name
Walkers needs to be implemented to match with theimplementations of parallel time integration , dynamic load balancing , and theabove initialize function. The essential work is to provide a set of methodswith already decided names (see Section 3.2), such as move , append , delete , finalize timestep , cut slice , and paste slice . A concrete example of the Walkers class is described with more details in Appendix Appendix B.We report in Table 2 the timing results of a number of parallel DMC computations.The parallel efficiency was about 85%. We increased the total number of walkers whenmore processors were used in the simulation, such that the number of walkers assigned toeach processor remained as 200. Such a use of parallel computers for DMC simulationsmimics the everlasting wish of quantum physicists to do larger computations as soon asmore computing resource becomes available. Note that in this scaled scalability test,good speedup performance is indicated by an almost constant time usage independentof the number of processors.
Simulating the propagation of ocean waves is the target of the our third and finalconcrete case. The reader is referred to Appendix Appendix C for the mathematicalmodel and the numerical method. The involved equations can be solved in parallel bythe additive Schwarz algorithm of Section 3.3. implifying Parallelization of Scientific Codes Table 2.
Timing results of the parallel DMC simulations where each processor isconstantly assigned with 200 walkers, all moved in 5000 time steps.CPUs Time Efficiency1 37m10.389s N/A5 42m32.359s 87.39%10 42m00.734s 88.48%20 42m29.945s 87.47%30 42m33.895s 87.33%40 43m30.092s 85.45%50 43m39.159s 85.16%
Our starting point for parallelization is a 25 years old legacy Fortran 77 codeconsisting of a set of subroutines. More specifically, the most important subroutinesare
KONTIT and
BERIT , which target the two semi-discretized equations (C.3) and(C.4) of the mathematical model (see Appendix Appendix C). These two Fortran77 subroutines contain intricate algorithms with nested layers of do-loops, which areconsidered to be very difficult to parallelize by directly inserting MPI calls in the Fortrancode. Performing the parallelization outside the Fortran code is therefore much moreconvenient. Using the proposed framework in the present paper the parallelization is atechnically quite straightforward task.The subdomain solver consists of calls to the subroutines
KONTIT and
BERIT . Theimplementation of the Python function subdomain solve (see Section 3.3) requires aPython interface to
KONTIT and
BERIT , which can easily be produced by the F2PYsoftware. Since a subdomain solver needs to set artificial boundary conditions at non-physical boundaries, we have programmed two light-weight wrapper subroutines inFortran,
WKONTIT and
WBERIT , which handles the boundary conditions before invoking
KONTIT and
BERIT . We then apply F2PY to make
WKONTIT and
WBERIT callable fromPython. Since the Fortran subroutines have lots of input data in long argument listsand subdomain solve takes no arguments, we have created a class where the Fortraninput variables are stored as class attributes: import f77 implifying Parallelization of Scientific Codes def bernoulli(self): Note that since there are two PDEs (C.3) and (C.4), we have created two functions: subdomain solve1 and subdomain solve2 . The main computation of the resultingparallel program is in the following while loop: t = 0while t < t_stop:t = t+dtadditive_Schwarz_iterations(subdomain_solve1, communicate,set_BC, 10, 1E-3, sd.Y)additive_Schwarz_iterations(subdomain_solve2, communicate,set_BC, 10, 1E-3, sd.F)
The additive Schwarz iterations function from Section 3.3 can be placed in areusable module. The communicate function is borrowed from a Python library formesh partitioning and inter-subdomain communication. The set BC function actuallydoes not do anything for this particular application.Speedup results are reported in Table 3, for which the global solution mesh wasfixed at 1000 × Table 3.
The speedup results of the Python enabled parallel Boussinesq simulations.CPUs Time Speedup Efficiency1 166.66s N/A N/A2 83.61s 1.99 99.67%4 44.45s 3.75 93.73%8 20.16s 8.27 103.33%16 11.43s 14.58 91.13%
5. Conclusion
We have shown how serial scientific codes written in various common languages,including Fortran, C, C++, and Python, can be parallelized in a separate, smallsoftware unit written in Python. The advantage of such an approach is twofold.First, the existing, often complicated, scientific high-performance code remains (almost)unchanged. Second, the parallel algorithm and its inter-processor communication areconveniently implemented in high-level Python code.This approach to parallelization has been implemented in a software frameworkwhere the programmer needs to implement a few Python functions for carrying out the implifying Parallelization of Scientific Codes initialize for preparinginput data to the mathematical model, func for calling up the serial scientific code,and finalize for processing the computational results. Some more functions must besupplied in more complicated problems where the algorithm evolves in time, with a needfor dynamic load balancing and more parallel communication.Our simple software frameworks outlined in this paper are applicable to manydifferent scientific areas, and we have described some common classes of problems:parameter investigation of a mathematical model, standard Monte Carlo simulation,Monte Carlo simulation with need for dynamic load balancing, and numerical solutionof partial differential equations. In each of these cases, we have outlined fairly detailedPython code such that most technical details of the parallel implementations aredocumented. This may ease the migration of the ideas to new classes of problemsbeyond the scope of this paper.In particular, the shown frameworks have been used to parallelize three realscientific problems taken from our research. The problems concern Markov ChainMonte Carlo models for voting behavior in political science, Diffusion Monte Carlomethods for Bose-Einstein condensation in quantum mechanics, and additive Schwarzand finite difference methods for simulating ocean waves by a system of partialdifferential equations. The results of our investigations of the parallel efficiency are veryencouraging: In all these real science problems, parallelizing serial codes in the proposedPython framework gives almost optimal speedup results, showing that there arises nosignificant loss due to using Python and performing the parallelization “outside” theserial codes.As a conclusion, we believe that the ideas and code samples from this paper cansimplify parallelization of serial codes greatly, without significant loss of computationalefficiency. This is good news for scientists who are non-experts in parallel programmingbut want to parallelize their serial codes with as small efforts as possible. implifying Parallelization of Scientific Codes Appendix A. Voting in Legislatures
In the spatial model of politics, both actors’ preferences over policies (ideal points) andpolicy alternatives are arranged geometrically in a low-dimensional Euclidean space. Anactor receives the highest possible utility if a policy is located at her ideal point; shegains or loses utility as the policy moves towards or away from her ideal point [35]. Weadopt the Bayesian approach proposed by Clinton, Jackman and Rivers [36]. Assumethere are n legislators who vote on m proposals. On each vote j = 1 , . . . , m , legislator i = 1 , . . . , n chooses between a ”Yea” position ζ j and a ”Nay” position ψ j located inthe policy-space R d , where d is the number of dimensions. Then, we have y ij = 1 iflegislator i votes ”Yea” on roll call j , and y ij = 0 if she votes ”Nay”. The model assumesquadratic utility functions. The ideal point of legislator i is x i ∈ R , while η ij and υ ij arestochastic elements whose distribution is jointly normal. The variance of the stochasticelements is ( η ij − υ ij ) = σ j . Denote the Euclidean norm by k · k , utility maximisingimplies that legislator i votes ”Yea” on vote j if U i ( ζ j ) = −k x i − ζ j k + η ij > U i ( ψ j ) = −k x i − ψ j k + υ ij (A.1)and ”Nay” otherwise. Clinton, Jackman and Rivers [36] show that the model can beunderstood as a hierarchical probit model: P ( y ij = 1) = Φ( β ′ j x i − α j ) , (A.2)where β j =2( ζ j − ψ j ) /σ j , α j =( ζ ′ j ζ j − ψ ′ j ψ j ) /σ j , Φ( · ) is the standard normal function, β j is the midpoint between the ”Yea” and ”Nay” positions on proposal j , and x i isthe legislator’s ideal point. The direction of α j indicates the location of the status quorelative to the proposal. If α j is positive, the new proposal is located higher on thedimension than the status quo. If α j is negative, the new proposal is located lower onthe dimension than the status quo. The MCMC Algorithm.
In the Markov Chain Monte Carlo (MCMC) algorithm forthe statistical analysis of voting behavior [36], the difference between utilities of thealternatives on the j th vote for the i th legislator is given by y ∗ ij = β j x i − α j + ǫ ij , where β j and α j are model parameters, x i is a vector of regression coefficients and ǫ ij arestandard normal errors. If we know β j and α j , x i can be imputed from the regressionof y ∗ ij + α j on β j using the m votes of legislator i and vice versa. If we know x i , wecan use the votes of the n legislators on roll call j to find β j and α j . Given x i , β j and α j (either from priors or from the previous iteration), we can find y ∗ ij by drawing ǫ ij randomly from a normal distribution subject to the constraints implied by the actualvotes, i.e., if y ij = 0, y ∗ ij < y ij = 1, y ∗ ij > β j and α j , j = 1 , . . . , m and all coefficient vectors x i , i = 1 , . . . , n . The MCMC algorithm formsa Markov chain to explore as much as possible of this joint density, i.e., letting t indexan MCMC iteration,(i) find y ∗ ,tij from x t − i , β t − j and α t − j , implifying Parallelization of Scientific Codes β tj and α tj using x t − i and y ∗ ,tij ,(iii) find x ti from β tj , α tj and y ∗ ,tij .This process must then be repeated until convergence, i.e., that the samples have movedaway from the priors to the neighborhood of the posterior mode before samples aredrawn.Clinton, Jackman and Rivers [36, p. 369] find that the computation time isincreasing in nmT , where n is the number of legislators, m is the number of votesand T is the number of MCMC iterations. Although they argue that very long runs arenormally not necessary, they nevertheless advise long runs to ensure that the MCMCalgorithm has converged. It is increasingly time-consuming to estimate the model onon a standard desktop computer as the size of the legislature and the number of votesincrease. Appendix B. Bose-Einstein Condensation
The famous experiment of Anderson et al. [37] was about cooling 4 × Rb downto temperatures in the order of 100 nK for observing Bose-Einstein condensation in thedilute gas. To model this fascinating experiment in the framework of Quantum MonteCarlo, so that numerical simulations can be extended beyond the physical experiments,we may use the governing Schr¨odinger equation: i ~ ∂∂t Ψ( R , t ) = H Ψ( R , t ) . (B.1)The most important parts of the mathematical model are a Hamiltonian H and awave function Ψ, see [38]. The Hamiltonian for N trapped interacting atoms is givenby H = − ~ m N X i =1 ∇ i + N X i =1 V ext ( r i ) + N X i In the Diffusion Monte Carlo (DMC)method [39], the Schr¨odinger equation is solved in imaginary time, − ∂dψ ( R , t ) ∂t = [ H − E ] ψ ( R , t ) . (B.3)The formal solution of (B.3) is ψ ( R , t ) = e − [ H − E ] t ψ ( R , , (B.4) implifying Parallelization of Scientific Codes e [ − ( H − E ) t ] is called the Green’s function , and E is a convenient energy shift.The wave function ψ ( R , t ) in DMC is represented by a set of random vectors { R , R , . . . , R M } , in such a form that the time evolution of the wave function isactually represented by the evolution of a set of walkers. This feature gives rise to taskparallelism. The wave function is positive definite everywhere, as it happens with theground state of a bosonic system, so it may be considered as a probability distributionfunction.The DMC method involves Monte Carlo integration of the Green’s function byevery walker. The time evolution is done in small time steps τ , using the followingapproximate form of the Green’s function: e − [ H − E ] t = n Y i =1 e − [ H − E ] τ , (B.5)where τ = t/n . Assume that an arbitrary starting state can be expanded in the basisof stationary, ψ ( R , 0) = X ν C ν φ ν ( R ) , (B.6)we have ψ ( R , t ) = X ν e − [ E ν − E ] t C ν φ ν ( R ) , (B.7)in such a way that the lowest energy components will have the largest amplitudes aftera long elapsed time, and in the limit of t → ∞ the most important amplitude willcorrespond to the ground state (if C = 0) ‡ .The Green’s function is approximated by splitting it up in a diffusional part, G D = (4 πDτ ) − N/ exp {− ( R ′ − R ) / Dτ } , (B.8)which has the form of a Gaussian and a branching part, G B = exp {− (( V ( R ) + V ( R ′ )) / − E T ) τ } . (B.9)While diffusion is taken care of by a Gaussian random distribution, the branching issimulated by creation and destruction of walkers with a probability G B . The idea ofDMC computation is quite simple; once we have found an appropriate approximationof the short-time Green’s function and determined a starting state, the computationconsists in representing the starting state by a collection of walkers and letting themindependently evolve in time. That is, we keep updating the walker population, until alarge enough time when all other states than the ground state are negligible. ‡ This can easily be seen by replacing E with the ground state energy E in (B.7). As E is the lowestenergy, we will get lim t →∞ P ν exp[ − ( E ν − E ) t ] φ ν = C φ . implifying Parallelization of Scientific Codes Algorithm 1 Diffusion Monte Carlo for step in range( , timesteps ) :for i in range( , N walkers ) : Diffusion;propose move R ′ = R + ξ Branching;calculate replication factor n : n = int (exp {− (( V ( R ) + V ( R ′ )) / − E T ) τ } ) if n = 0 : mark walker as dying if n > : mark walker to make n − N walkers and adjust trial energy;Sample contributions to observable. The Implementation. In Algorithm 1 we summarize the DMC algorithm correspondingto (B.8)-(B.9). In the algorithm ξ is a Gaussian with zero mean and a variance of2 Dτ corresponding to (B.8). The deleting and cloning of walkers are, as mentioned inSection 3.2, performed by the do timestep function, repeated here for clarity: def do_timestep(walkers):walkers.move()for walker in range(len(walkers)):if walkers.get_marker(walker) == 0:walkers.delete(walker)elif walkers.get_marker(walker) > 1:walkers.append(walker, walkers.get_marker(walker)-1)return walkers.sample_observables() The main computational work of the DMC algorithm at each time step isimplemented in the move function inside class Walkers , together with a helper function branching : class Walkers:...def branching(self, new_positions):old_potential = potential(self.positions)new_potential = potential(new_positions)branch = numpy.exp(-(0.5 * (old_potential + new_potential)- self.adjust_branching) * self.stepsize)self.markers = numpy.array(branch+numpy.random.uniform(0,1,branch.shape), ’i’)def move(self):displacements = numpy.random.normal(0, 2*self.stepsize,self.positions.shape)new_positions = self.positions+displacementsself.branching(new_positions) implifying Parallelization of Scientific Codes self.positions = new_positions... The move function first generates a set of Gaussian (normal) distributed randomnumbers, corresponding to (B.8). Next, it calls the branching function, which calculatesa potential V ( r ) = r for the old and the new positions § . These potentials are used tocalculate G B following (B.9) and create an integer array self.markers with its averagevalue equal to G B (stored in the branch variable). This array is of the same length asthe number of walkers (stored in self.positions ) and marks the walkers as dying orclone-able.It is worth noticing that if the new potential of a walker is much higher than thatin the previous time step (i.e., the walker is far from the center of the trap), the value of branch will be close to 0 and the walker will be deleted. However, if the new potentialis much lower (i.e. closer to the center of the trap), branch will be greater than 1 andthe walker will be cloned. As long as the two-body interaction is ignored, the walkerswill only be encouraged to move towards the center of the trap, thus yielding a lowerenergy than seen in real experiments. Appendix C. Ocean Wave Propagation The following two PDEs, normally termed as the Boussinesq water wave equations [40],can be used to model wave propagation: ∂η∂t + ∇ · (cid:18) ( H + αη ) ∇ φ + ǫH (cid:18) ∂η∂t − ∇ H · ∇ φ (cid:19) ∇ H (cid:19) = 0 , (C.1) ∂φ∂t + α ∇ φ · ∇ φ + η − ǫ H ∇ · (cid:18) H ∇ ∂φ∂t (cid:19) + ǫ H ∇ ∂φ∂t = 0 . (C.2)The primary unknowns of (C.1)-(C.2) are the water surface elevation η ( x, y, t ) and thedepth-averaged velocity potential φ ( x, y, t ). The symbol H denotes the water depth asa function of ( x, y ). The advantage of the above Boussinesq wave model, in comparisonwith the standard shallow water equations, is its capability of modeling waves that areweakly dispersive ( ǫ > 0) and/or weakly nonlinear ( α > η ℓ − η ℓ − ∆ t + ∇ · (cid:16)(cid:18) H + α η ℓ − + η ℓ (cid:19) ∇ φ ℓ − + ǫH (cid:18) η ℓ − η ℓ − ∆ t − ∇ H · ∇ φ ℓ − (cid:19) ∇ H (cid:17) = 0 , (C.3) § In a more optimized implementation, the old potential would have been stored from the previousmove and not calculated every time. implifying Parallelization of Scientific Codes φ ℓ − φ ℓ − ∆ t + α ∇ φ ℓ − · ∇ φ ℓ − + η ℓ − ǫ H ∇ · (cid:18) H ∇ (cid:18) φ ℓ − φ ℓ − ∆ t (cid:19)(cid:19) + ǫ H ∇ (cid:18) φ ℓ − φ ℓ − ∆ t (cid:19) = 0 , (C.4)where we use ℓ to denote the time level, and ∆ t denotes the time step size. The basicidea of computation at each time step is to first compute η ℓ based on η ℓ − and φ ℓ − fromthe previous time step, and then compute φ ℓ using the new η ℓ and the old φ ℓ − . To carryout the actual numerical computation, we need a spatial discretization of (C.3)-(C.4),using e.g. finite differences or finite elements, so we end up with two systems of linearequations that need to be solved during each time step. References [1] Guido van Rossum et al. The Python programming language, 1991–. .[2] D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin, and T. Oliphant. Numerical Python. Technicalreport, Lawrence Livermore National Lab., CA, 2001. .[3] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python. , 2001–.[4] F2PY software package. http://cens.ioc.ee/projects/f2py2e .[5] Pypar – parallel programming with Python. http://sourceforge.net/projects/pypar , 2007.[6] X. Cai, H. P. Langtangen, and H. Moe. On the performance of the Python programming languagefor serial and parallel scientific computations. Scientific Programming , 13(1):31–56, 2005.[7] X. Cai and H. P. Langtangen. Developing parallel object-oriented simulation codes in Diffpack.In Proceedings of the Fifth World Congress on Computational Mechanics , Vienna University ofTechnology, 2002. ISBN 3-9501554-0-6.[8] F. Cirak and J. C. Cummings. Generic programming techniques for parallelizing and extendingprocedural finite element programs. Engineering with Computers , 24:1–16, 2008.[9] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith,and H. Zhang. PETSc Web page. , 2001.[10] O. Lawlor, S. Chakravorty, T. Wilmarth, N. Choudhury, I. Dooley, G. Zheng, and L. Kal´e.ParFUM: A parallel framework for unstructured meshes for scalable dynamic physicsapplications. Engineering with Computers , 22(3):215–235, 2006.[11] J. R. Stewart and H. C. Edwards. A framework approach for developing parallel adaptivemultiphysics applications. Finite Elements in Analysis and Design , 40:1599–1617, 2004.[12] The Trilinos project. http://trilinos.sandia.gov/ , 2008.[13] UG home page. http://sit.iwr.uni-heidelberg.de/~ug/ , 2007.[14] T. Goodale, G. Allen, G. Lanfermann, J. Mass´o, T. Radke, E. Seidel, , and J. Shalf. The Cactusframework and toolkit: Design and applications. In J. M. L. M. Palma et al., editor, Proceedingsof VECPAR 2002 , volume 2565 of Lectures Notes in Computer Science , pages 197–227. SpringerVerlag, 2003.[15] MpCCI 3.0. , 2008. implifying Parallelization of Scientific Codes [16] Star-P Overview. , 2008.[17] K. Hinsen. Parallel scripting with Python. Computing in Science & Engineering , 9(6):82–89,2007.[18] pyMPI: Putting the py in MPI. http://pympi.sourceforge.net/ , 2008.[19] MYMPI webpage. http://peloton.sdsc.edu/~tkaiser/mympi/ , 2008.[20] L. Dalc´ın, R. Paz, and M. Storti. MPI for Python. Journal of Parallel and Distributed Computing ,65(9):1108–1115, 2005.[21] L. Dalc´ın, R. Paz, M. Storti, and J. D’El´ıa. MPI for Python: Performance improvements andMPI-2 extensions. Journal of Parallel and Distributed Computing , 68(5):655–662, 2008.[22] ScientificPython webpage. http://dirac.cnrs-orleans.fr/plone/software/scientificpython/ , 2007.[23] G. D. Benson and A. S. Fedosov. Python-based distributed programming with Trickle. In H. R.Arabnia, editor, Proceedings of PDPTA’07 , pages 30–36. CSREA Press, 2007.[24] G. Olson. Introduction to concurrent programming with Stackless Python. http://members.verizon.net/olsongt/stackless/why_stackless.html , 2006.[25] C. E. Rasmussen, M. J. Sottile, J. Nieplocha, R. W. Numrich, and E. Jones. Co-array Python: Aparallel extension to the Python language. In M. Danelutto, D. Laforenza, and M. Vanneschi,editors, Proceedings of Euro-Par 2004 , Lectures Notes in Computer Science, pages 632–637.Spriner Verlag, 2004.[26] Yukito IBA. Population Monte Carlo algorithms. Transactions of the Japanese Society forArtificial Intelligence , 16:279–286, 2001.[27] T. F. Chan and T. P. Mathew. Domain decomposition algorithms. In Acta Numerica 1994 , pages61–143. Cambridge University Press, 1994.[28] B. F. Smith, P. E. Bjørstad, and W. Gropp. Domain Decomposition: Parallel Multilevel Methodsfor Elliptic Partial Differential Equations . Cambridge University Press, 1996.[29] X. Cai, G. K. Pedersen, and H. P. Langtangen. A parallel multi-subdomain strategy for solvingBoussinesq water wave equations. Advances in Water Resources , 28(3):215–233, 2005.[30] X. Cai and H. P. Langtangen. Parallelizing PDE solvers using the Python programming language.In A. M. Bruaset and A. Tveito, editors, Numerical Solution of Partial Differential Equationson Parallel Computers , volume 51 of Lecture Notes in Computational Science and Engineering ,pages 295–325. Springer-Verlag, 2006.[31] H. P. Langtangen and X. Cai. A software framework for easy parallelization of PDE solvers. InC. B. Jensen, T. Kvamsdal, H. I. Andersson, B. Pettersen, A. Ecer, J. Periaux, N. Satofuka,and P. Fox, editors, Parallel Computational Fluid Dynamics . North-Holland, 2001.[32] Simon Jackman. PSCL: classes and methods for R developed in the Political ScienceComputational Laboratory, Stanford University. Technical report, Department of PoliticalScience, Standford University, 2006.[33] R Development Core Team. R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria, 2006. ISBN 3-900051-07-0.[34] Simon Hix, Abdul Noury, and Gerard Roland. Power to the parties: cohesion and competition inthe European Parliament, 1979-2001. British Journal of Political Science , 35(2):209–234, 2005.[35] Melvin J. Hinich and Michael C. Munger. Analytical Politics . Cambridge University Press, 1997.[36] Joshua Clinton, Simon Jackman, and Doug Rivers. The statistical analysis of roll call data. American Political Science Review , 98(4):355–370, 2004.[37] J.R. Anderson, M.R. Ensher, M.R. Matthews, C.E. Wieman, and E.A. Cornel. Observation ofBose-Einstein condensation in a dilute atomic vapor. Science , 269:198, 1995.[38] J. K. Nilsen, J. Mur-Petit, M. Guilleumas, M. Hjorth-Jensen, and A. Polls. Vortices in atomicBose-Einstein condensates in the large gas parameter region. Phys. Rev. A , 71, 2005. implifying Parallelization of Scientific Codes [39] R. Guardiola. Monte Carlo methods in quantum many-body theories. In J. Navarro and A. Polls,editors, Microscopic Quantum Many-Body Theories and Their Applications , volume 510 of Lecture Notes in Physics , pages 269–336. Springer Verlag, 1998.[40] D. M. Wu and T. Y. Wu. Three-dimensional nonlinear long waves due to moving surface pressure. Proc. 14th Symp. Naval Hydrodyn. , pages 103–129, 1982.[41] G. Pedersen and H. P. Langtangen. Dispersive effects on tsunamis. In