[PDF] Model Coupling between the Weather Research and Forecasting Model and the DPRI Large Eddy Simulator for Urban Flows on GPU-accelerated Multicore Systems

Abstract

In this report we present a novel approach to model coupling for shared-memory multicore systems hosting OpenCL-compliant accelerators, which we call The Glasgow Model Coupling Framework (GMCF). We discuss the implementation of a prototype of GMCF and its application to coupling the Weather Research and Forecasting Model and an OpenCL-accelerated version of the Large Eddy Simulator for Urban Flows (LES) developed at DPRI. The first stage of this work concerned the OpenCL port of the LES. The methodology used for the OpenCL port is a combination of automated analysis and code generation and rule-based manual parallelization. For the evaluation, the non-OpenCL LES code was compiled using gfortran, fort and pgfortran}, in each case with auto-parallelization and auto-vectorization. The OpenCL-accelerated version of the LES achieves a 7 times speed-up on a NVIDIA GeForce GTX 480 GPGPU, compared to the fastest possible compilation of the original code running on a 12-core Intel Xeon E5-2640. In the second stage of this work, we built the Glasgow Model Coupling Framework and successfully used it to couple an OpenMP-parallelized WRF instance with an OpenCL LES instance which runs the LES code on the GPGPI. The system requires only very minimal changes to the original code. The report discusses the rationale, aims, approach and implementation details of this work.

Full PDF

MModel Coupling between the Weather Researchand Forecasting Model and the DPRI LargeEddy Simulator for Urban Flows onGPU-accelerated Multicore Systems

Wim VanderbauwhedeAugust-September 2014

School of Computing Science, University of Glasgow, UKIn collaboration with Prof. Tetsuya Takemi, DPRI, University of Kyoto, JapanIn this report we present a novel approach to model coupling for shared-memory multicore systems hosting OpenCL-compliant accelerators, which wecall The Glasgow Model Coupling Framework (GMCF). We discuss the imple-mentation of a prototype of GMCF and its application to coupling the WeatherResearch and Forecasting Model and an OpenCL-accelerated version of the LargeEddy Simulator for Urban Flows (LES) developed at DPRI.The ﬁrst stage of this work concerned the OpenCL port of the LES. Themethodology used for the OpenCL port is a combination of automated analysisand code generation and rule-based manual parallelization. For the evaluation,the non-OpenCL LES code was compiled using gfortran , ifort and pgfortran ,in each case with auto-parallelization and auto-vectorization. The OpenCL-accelerated version of the LES achieves a 7 × speed-up on a NVIDIA GeForceGTX 480 GPGPU, compared to the fastest possible compilation of the originalcode running on a 12-core Intel Xeon E5-2640.In the second stage of this work, we built the Glasgow Model Coupling Frame-work and successfully used it to couple an OpenMP-parallelized WRF instancewith an OpenCL LES instance which runs the LES code on the GPGPI. The systemrequires only very minimal changes to the original code. The report discussesthe rationale, aims, approach and implementation details of this work.1 a r X i v : . [ c s . D C ] A p r Introduction

The Weather Research and Forecasting Model (WRF) is a mesoscale numerical weather pre-diction system (NWS) intended both for forecasting and atmospheric research. It is an OpenSource project, used by a large fraction of weather and climate scientists worldwide. TheWRF code base is written in Fortran-90 and quite complex and extensive (about 1,000,000lines of code). The Large Eddy Simulator for the Study of Urban Boundary-layer Flows (LES) is devel-oped by Hiromasa Nakayama and Haruyasu Nagai at the Japan Atomic Energy Agencyand Prof. Tetsuya Takemi at the Disaster Prevention Research Institute of Kyoto Univer-sity [

7, 8 ] . It generates turbulent ﬂows by using mesoscale meteorological simulations, andwas designed to explicitly represent the urban surface geometry. Its purpose is to conductbuilding-resolving large-eddy simulations (LESs) of boundary-layer ﬂows over urban areasunder realistic meteorological conditions. WRF is used to compute the wind proﬁle as inputfor LES, so there is a clear case for coupling both models. “Model coupling” means using data generated by one model as inputs for another model:e.g. climate simulations: atmosphere model, ocean model, land model, ice model. In com-bination with hardware Acceleration using GPU / manycore / FPGA, model coupling couldresult in much reduced run times and / or higher accuracy simulations.Model Coupling is of growing importance because models of e.g climate need to be asaccurate as possible, and therefore include many factors and effects. Creating a single modelincorporating all effects is increasingly difﬁcult, and requires very large research teams tocover all specialisms. Combining existing models allows us to model a large variety of verycomplex systemsThere are a number of existing libraries to support the process of model coupling, e.g.MCT, ESMF, OASIS [

6, 3, 2 ] . However, each of them requires hand modiﬁcation of existingmodel codes, as well as writing of additional code to control the way the coupling is doneneeds to be written as well. Furthermore, current Model Coupling libraries all use MPI. Creating coupled models using the current approach is very difﬁcult and requires team ofexperts in various disciplines. As a result model coupling is too hard for most researchteams. A better approach will allow more researchers to use model coupling and do betterscience. http: // ombined Shared/Distributed System - Coupling communication: Intra-node shared-memory (Pthreads)- Grid communication: Inter-node distributed memory (MPI)Node j Node i Distributed system

Inter-node distributed memory (MPI)- NxM coupling communication (orange)- Nearest-neigbour Grid communication (blue, green)

Figure 1: Current and proposed approaches to model couplingFurthermore, the current approaches were developed for clusters of single-core machinesand are not ideal for multicore / GPU nodes. In particular, the current approach to loadbalancing requires that every model i gets a ﬁxed number of nodes m i , proportional to its run time, i.e. m i = α ∆ t i . As aresult, data from m i nodes of model i need to be transferred to m j nodes of model j . This iscomplex and asymmetrical, esp. for more than two models.Our proposed approach (Fig. 1) is to limit coupling to processes running on a shared-memory node. The nodes run all processes required for coupling in threads, if a thread isnot needed for coupling it goes to sleep and does not consume CPU time. As a result, thecluster communication is more symmetrical, and the load in each node can be balancedmore easily. The DPRI Large Eddy Simulator (LES) is a high-resolution simulator which models ﬂow overurban topologies. The main properties of the DPRI LES are: • Generates turbulent ﬂows by using mesoscale meteorological simulations. • Explicitly represents the urban surface geometry. • Used to conduct building-resolving large-eddy simulations (LESs) of boundary-layerﬂows over urban areas under realistic meteorological conditions.3igure 2: LES main program (Fortran-77) • Essentially solves the Poisson equation for the pressure, using Successive Over-Relaxation • Written in Fortran-77, single-threaded, about a thousand lines of code.

The LES structure consists of sequential calls to following subroutines for each time step: velnw:

Update velocity for current time step bondv1:

Calculate boundary conditions (initial wind proﬁle, inﬂow, outﬂow) velfg:

Calculate the body force feedbf:

Calculation of building effects les:

Calculation of viscosity terms (Smagorinsky model) adam:

Adams-Bashforth time integration press:

Solving of Poisson equation (SOR) ubroutine Block Parallelisation velnw data parallel over full domain bondv1 init uvw data parallel over full domaincalc uout reduction over full domaincalc uvw data parallel over boundary planes velfg data parallel over full domain feedbf data parallel over full domain les calc sm data parallel over full domainbound sm data parallel over boundary planescalc visc data parallel over full domain adam data parallel over full domain pres rhsav reduction over full domainsor reduction over full domain + iterationpav reduction over full domainadj data parallel over full domainboundp data parallel over boundary planes Table 1: LES Computational structure and parallelization strategy for each block5 .2 LES Code Structure – Computational2.3 Methodology

In this section we discuss our novel methodology for porting Fortran NWP applications toOpenCL.The approach used in this work is as follows:1. Convert the original F77 code to F95, remove common blocks, if required refactorinto subroutines. All this is done using our rf4a tool. The resulting F95 code has fullyexplicit subroutine arguments.2. For each subroutine that could become a kernel, we created a wrapper in two stages:a) We generate a wrapper from the original subroutine using the pragma !$ACC KernelWrapper(adam)call adam(fgh,fgh_old,im,jm,km)!$ACC End KernelWrapper

This generates a template for a module module_adam_ocl which contains a sub-routine adam_ocl . We manually edit the generated code if required. • The template ﬁle ( module_adam_ocl_TEMPL.f95 ) is valid Fortran-95, if com-piled and run it will be functionally identical to the original subroutine.b) We generate the OpenCL API from the template using the pragma !$ACC Kernel(adam_kernel), GlobalRange(im*jm*km), LocalRange(0)call adam(fgh,fgh_old,im,jm,km)!$ACC End Kernel

This actually uses even lower-level pragmas underneath, but these are normally notexposed. The code generator requires a Fortran subroutine for adam_kernel , whichinitially is a copy original routine adam , but can be edited if the OpenCL kernel argu-ments would be different.This generates module_adam_ocl.f95 which contains all necessary calls to OpenCLusing our

OclWrapper library. • If we compile and run the code at this stage it will fail because the actual OpenCLkernel does not yet exist.3. At this stage we have a wrapper with OpenCL API calls and a “kernel” in Fortran,initially a copy of the original subroutine. We convert this kernel to C using ourpatched version of F2C_ACC . Then we further convert this C code to OpenCL, withsome cleaning up. At the moment, this stage is manual, because the conversion scriptis not yet ready. https: // github.com / wimvanderbauwhede / RefactorF4Acc / tree / master / F2C-ACC

So at this point we have an actual fully working OpenCL version of the code,albeit a purely sequential one.4. In this work we manually parallelized the kernels. Ideally, we would like to use somecompiler transformation to assist us with this, esp. loop merging.5. Finally, we merged all kernels into a super-kernel, and also merged the wrappers. Inthis work we did this manually but the process is straightforward to automate.

A very effective way to compute boundary conditions in OpenCL is the boundary range approach, where we create a global range consisting of the sum of the products of thedomain dimensions for each direction: if the domain is i p × j p × kp then the boundaryrange is i p × j p + j p × kp + i p × kp . In this way, the local range can be set to auto. Thekernel has an if-statement to calculate the different boundary conditions: if (gl_id < jp*kp ) {unsigned int k = gl_id / jp;unsigned int j = gl_id % jp;// do work} else if (gl_id < jp*kp + kp*ip) {unsigned int k = (gl_id - jp*kp) / ip;unsigned int i = (gl_id - jp*kp) % ip;// do work} else if (gl_id < jp*kp + kp*ip + jp*ip) {unsigned int j = (gl_id - jp*kp - kp*ip) / ip;unsigned int i = (gl_id - jp*kp - kp*ip) % ip;// do work} The global range is padded to a multiple of the product of the suggested number of threadsand the number of compute units , as this results in much better load balancing: unsigned int m = nthreads*nunits;if (range % m) != 0) {range += m - (range % m)} Consequently, the last branch in the if statement in the kernel must also be guarded.

The LES uses separate arrays for values in x, y and z directions. We merged those arrays intoﬂoat4 arrays, to improve locality and alignment. Thus, the arrays f,g,h which are actuallythe values of the force ﬁeld f in the x, y and z direction, were merged into fgh. based on CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and CL_DEVICE_MAX_COMPUTE_UNITS .5.2 Using Private Scalars Instead of Global Arrays In several places, the LES computes on global arrays, e.g. the f / g / h arrays. Where possible,we replaced the global access to fgh(i,j,k) by a local scalar fgh_ijk, in effect a form of manualcaching. The original computation of f / g / h (in the velfg routine) ﬁrst computes cov and diu arraysfor the full domain, and then accesses them at neighboring points to compute f / g / h: covx1 = (dx1(i+1)*cov1(i,j,k)+dx1(i)*cov1(i+1,j,k)) /(dx1(i)+dx1(i+1))covy1 = (cov2(i,j,k)+cov2(i,j+1,k))/2.covz1 = (cov3(i,j,k)+cov3(i,j,k+1))/2.covc = covx1+covy1+covz1dfu1(i,j,k) = 2.*(-diu1(i,j,k)+diu1(i+1,j,k))/(dx1(i)+dx1(i+1)) + (-diu2(i,j,k)+diu2(i, &j+1,k))/dy1(j) + (-diu3(i,j,k)+diu3(i,j,k+1))/dzn(k)df = vn*dfu1(i,j,k)f(i,j,k) = (-covc+df) By pre-computing the values for the cov and diu arrays at points i + / j + / k +

1, the loopscan be merged: float covx1 = (dx1[i+2]*cov_ijk.s0+dx1[i+1]*cov_ijk_p1.s0) /(dx1[i+1]+dx1[i+2]);float covy1 = (cov_ijk.s1+cov_ijk_p1.s1)/2.0F;float covz1 = (cov_ijk.s2+cov_ijk_p1.s2)/2.0F;float covc = covx1+covy1+covz1;float dfu1_ijk = 2.0F*(-diu_ijk.s0+diu_ijk_p1.s0)/(dx1[i+1]+dx1[i+2]) + (-diu_ijk.s1+diu_ijk_p1.s1)/dy1[j] + (-diu_ijk.s2+diu_ijk_p1.s2)/dzn[k+1];float df = vn*dfu1_ijk;fgh_ijk.s0 = (-covc+df);

The major bottleneck of the LES is the solver for the Poisson equation, which uses SuccessiveOver-Relaxation. The original algoritm is implemented using the red-black scheme: do nrd = 0,1do k = 1,kmdo j = 1,jmdo i = 1+mod(k+j+nrd,2),im,2reltmp = omega*(cn1(i,j,k) *( &cn2l(i)*p(i+1,j,k) + &cn2s(i)*p(i-1,j,k) + &cn3l(j)*p(i,j+1,k) + &cn3s(j)*p(i,j-1,k) + &cn4l(k)*p(i,j,k+1) + &cn4s(k)*p(i,j,k-1) - & hs(i,j,k))-p(i,j,k))p(i,j,k) = p(i,j,k) +reltmpsor = sor+reltmp*reltmpend doend doend doend do Conceptually, the p array is divided in red and black points so that every red point has blacknearest neighbors and vice-versa. The new values for p are computed in two iterations (the nrd -loop in the code example), one for the red, one for the black.While this is very effective for single-threaded code, and in fact also for parallel codeon distributed memory systems, it suffers from poor locality because the accesses to p arestrided, and the correction computation requires access to all six neighbors of p. As aresult, the threads in each compute unit cannot perform coalesced reads or writes. In [ ] ,Konstantinidis and Cotronis explore a GPU implementation of the SOR method and concludethat their proposed approach of reordering the matrix elements according to their colorresults in considerable performance improvement. However, their approach is not readilyapplicable to our problem because one the one hand we have a 3-D array which is muchharder to reorder than a 2-D array (i.e. the cost of reordering is higher) and also, we cannotuse the reordered array as-is, so we would incur the high reordering cost twice.Therefore, we developed a new technique which we call “twinned double buffering”:rather than using the read-black scheme for updating p, we use a “double buffer” updatescheme for p: for nrd =

0, p_1 is updated by values from p_0, and for nrd =

1, vice versa. Ifwe would use two arrays for this, locality would still be poor, so instead we use an array ofvectors of two ﬂoats, which we called a “twinned” array. Using this data structure and thedouble buffering scheme, and assigning the compute units to the k index and the threadsin the compute unit to the i index of the array, we obtain coalesced memory access. Thedifference in performance is indeed huge: our approach is 6 × faster than the parallelizedversion of the red-black scheme. In this section we present a validation study of the OpenCL LES by comparing it to the orig-inal Fortran version. We ran the LES on a domain of 150x150x90 points, with 50 iterationsfor SOR (default setup).

The compilers used for the comparison were gfortran 4.8.2 for OpenCL code, as well aspgf77 12.5-0 and ifort 12.0.0 for the reference code. We used the following optimizationsfor auto-vectorization and auto-parallelization: • gfortran -Ofast -floop-parallelize-all -ftree-parallelize-loops=24 • pgf77 -O3 -fast -Mvect=simd:256 cores vector size Clock speed (GHz)Intel Xeon E5-2640

24 8 2.5 480 42.6

Nvidia GeForce GX480

15 32 1.4 672 177.4

CPI (~FLOPS) Memory BW (GB/s)

Table 2: Hardware Performance IndicatorsFigure 5: Comparison of OpenCL LES with original Fortran code • ifort -O3 -parallel The run time of the F77 and F95 code is the same with all compilers (to within a few %)

The host platform is an Intel Xeon E5-2640 0 @ 2.50GHz, dual-socket 6-core CPU with two-way hyperthreading (i.e. 24 threads), with AVX vector instruction support, 32GB memory,15MB cache, Intel OpenCL v1.2. TheGPU platform is an NVIDIA GeForce GTX 480 @ 1.40 GHz, 15 compute units, 1.5GBmemory, 250KB cache, NVIDIA OpenCL 1.1 (CUDA 6.5.12).Table 2 shows the Hardware Performance Indicators for both systems.

The results of the comparison of the OpenCL code with the F77 and F95 reference coderesults are summarized in Figure 5 . The OpenCL-LES running on the GPU is 7 \ times fasterthan the fastest reference runtime.We implemented all kernels of the LES in OpenCL, because this allows us to keep all datastructures in the GPU memory, rather than copying between the host and the GPU. Figure 612igure 6: Proportion of time spent in each routineshows the breakdown of the run time per subroutine in the original LES code. We see thatthe press routine, which contains the SOR, takes most of the run time, followed by the velfgand les routines. Figure 2.7.3 shows the corresponding times for the actual OpenCL kernels.It should be noted that 50 SOR iterations is actually on the low side to achieve convergence.So the SOR routine will dominate the run time entirely for more iterations. The most important outcome of the effort to convert the LES to OpenCL is the developmentof an open-source toolchain to facilitate porting of Fortran-77 code to OpenCL. Using thistoolchain considerable reduces the porting effort, as illustrated by our work. Furthermore,the porting of the LES speciﬁcally has led to the development of a novel parallel implemen-tation of the 3-D Successive Over-relaxation algorithm. The resulting performance is verygood: the OpenCL LES running on GPU is seven times faster than the original code runningon a multicore CPU.

In this section we introduce the Glasgow Model Coupling Framework. We discuss the thelonger-term aims of the framework, and status of the current prototype.13

ELNW__BONDV1_INIT_UVWBONDV1_CALC_UOUTBONDV1_CALC_UVWVELFG__FEEDBF__LES_CALC_SMLES_BOUND_SMLES_CALC_VISC__ADAMPRESS_RHSAVPRESS_SORPRESS_PAVPRESS_ADJPRESS_BOUNDP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

OpenCL LES: Contributions of kernels to total run time

GPUCPU

Figure 7: Proportion of time spent in each OpenCL kernel

Our longer-term aim is to create a system where there is no need for the end user to writeany code at all to achieve model coupling. Instead, we envisage that our system will usea scenario, written in a formal way but in a natural language, to express how the modelsshould be coupled. The scenario will serve as the input to a compilation and code generationsystem that will create the fully integrated model coupling system.Designing such a speciﬁcation language is one of the major research tasks to achieve thisgoal. We plan to use a system based on Multi-Party Session Types [ ] to ensure that thescenario results in a correct system that e.g. is deadlock-free and where data exchanges be-tween incompatible senders and receivers are impossible. The Scribble protocol language [ ] uses MPST and therefore constitutes a natural starting point for our investigation.However, in order to make the scenarios accessible to a large audience, we want to explorethe use of natural language based speciﬁcation, e.g. the Gherkin DSL as used in behavior-driven development [ ] .Furthermore, apart from the scenario, there is also a need for a formal speciﬁcation ofeach model participating in the scenario. One of the reasons why model coupling is currentlyvery difﬁcult is that it takes a lot of effort and skill to determine where e.g. the wind velocityis generated in a given model, what the time loop is, what the variables and their dimensionsare, etc. Our aim is to design a model speciﬁcation that allows a code analysis tool to workout all necessary information – as opposed to a speciﬁcation that would provide all thisinformation, as such as speciﬁcation would be difﬁcult and time consuming to produce. http: // / .2 GMCF Model Coupling Operation Adopting the classiﬁcation terminology used by Michalakes , GMCF adopts a component-based coupling model with dataﬂow-style peer-to-peer interaction. There is no centralizedcontrol or scheduling. Instead, models communicate using a demand-driven mechanism,i.e. they send request to other models for data or control information. At each simulationtime step, each model requests time stamps from the models it is coupled with, and waitsuntil it has received the corresponding time steps. This synchronization step ensures thatthe timing of data exchanges is correct. After syncing, each model can request and / or senddata either before or after computation. When the time loop of a model ﬁnishes, the modelbroadcast a message to all communicating models. To demonstrate the main concepts of our approach to Model Coupling, we created a proto-type with the express aim to couple WRF and the DPRI LES model using a producer / consumerpattern. As we will see, the same prototype is readily useable for symmetrical data exchangeas well.We created a modiﬁed version of our Glasgow Parallel Reduction Machine (GPRM) frame-work . GPRM is a framework for task-based parallel computing, and one can consider ModelCoupling as a special case of this. We call our new framework the Glasgow Model CouplingFramework (GMCF) . The necessary modiﬁcations are related two different aspects: • GPRM Is a C ++ framework, and most scientiﬁc models are written in Fortran. There-fore we developed an automated approach to integrating Fortran code into GPRM. • GPRM uses run-to-completion semantics, i.e. once a task is started, it runs to comple-tion and only then produces output and checks for new input data. In a long-running,time-step-base application such as a NWP model, this approach is not practical, asit would be a huge effort to lift the the time loop out of the existing model and intoGPRM. Therefore we created a new API which allows communication between modelsfrom inside the model code, obviously essential for model coupling.Apart from those major changes, we also changed the code organization and the buildsystem to simplify the creation of a coupled model using GMCF.

The run-time architecture of GMCF is illustrated in Fig. 3.4.1. On startup, the GMCF runtimesystem creates a ﬁxed number of threads. Each of these threads contains a main loop whichblocks on a FIFO. http: // / wrf / users / workshops / WS2010 / presentations / Tutorials / Coupling%20WRF%20Michalakes.pdf https: // github.com / wimvanderbauwhede / LES https: // github.com / wimvanderbauwhede / GannetCode https: // github.com / wimvanderbauwhede / gmcf Each of the threads in the GMCF runtime instantiates an object of the Tile class and calls itsrun() method, which implements the wait loop. The Tile class contains the FIFOs and theCore class, which takes care of the actual model calls (Figure 3.4.2).Our approach is to transform the Fortran top-level program unit into a subroutine whichtakes a pointer to the tile as argument (Figure 10). All model subroutines become methodsof a singleton Core class which inherits from the base Core class which provides the APIcalls. This approach allows us to hide any state in the object. The Core class has a run()method which selects the model-speciﬁc method based on the thread in which it is called atthe start of the run.Furthermore, the Fortran API consists of a generic part and a per-model part, both im-plemented as modules. The generic part requires the tile pointer as an argument for eachcall. The per-model API stores this pointer in its module so that the ﬁnal API calls requireno additional arguments. 16igure 9: GMCF software architectureWhen the Core.run() method calls the model subroutine, this subroutine can use theGMCF C ++ API method calls to interact with the FIFOs and so with the other models in thesystem.

The main purpose of GMCF is to make model coupling easier. Therefore, we try to minimizethe necessary changes to the original code, and in fact our long-term aim is to entirelyautomate the process. Currently, the user has to manually insert the actual model couplingAPI calls.Apart from that, the build system takes care of the entire integration. This is less obviousthan might seem at ﬁrst: the GMCF runtime and API are written in C ++ , so it is necessary togenerate code to interface with the Fortran model code. The necessary information requiredfrom the user is very limited: the full path of the top-level program unit of each model. Thebuild system analyses this program unit and adds the necessary code for GMCF integration,and generates all required interface code. In this section we present the implementation and preliminary results of coupling WRF andthe OpenCL LES using the GMCF. 17igure 10: GMCF Fortran-C ++ integration and code generation18igure 11: Coupling WRF and the Large Eddy SimulatorWRF is used to compute the wind proﬁle as input for LES. In the original version of theLES, this input is manually extracted from WRF and hardcoded in the Fortran source. In order to achieve coupling, some modiﬁcations are required to both WRF and LES. Onthe one hand, the build system needs to be modiﬁed so that the model is compiled into alibrary rather than an executable. For LES, as we use SCons, this is trivial. WRF uses acomplex build infrastructure based on make. The changes are still very small, and some ofthe changes are actually patches to bugs in the WRF build system. In fact, the modiﬁed WRFbuild system can now build WRF executables using both Fortran and C ++ main programs.The source code of the main program unit needs to be modiﬁed to make it a subroutine,this is a very minor change. The other main changes are related to the actual coupling. Ourapproach is to create a per-model API based on the generic GMCF API. Eventually, this per-model API will be automatically generated. Then the API is used to achieve synchronizationand data exchange. The changes to main/wrf.F and the LES main.f95 required to be able to use GMCF areshown below; these changes are auto-generated by the build system, the user only needs toprovide the name of the original program and of the new top-level subroutine. subroutine program_wrf_gmcf(gmcf_ptr, model_id) ! This replaces ’program wrf’use gmcfAPI ! gmcfuse gmcfAPIwrf ! gmcf SE module_wrf_top, only : wrf_init, wrf_dfi, wrf_run, wrf_finalizeIMPLICIT NONEinteger(8) , intent(In) :: gmcf_ptr ! gmcfinteger , intent(In) :: model_id ! gmcfcall gmcfInitWrf(gmcf_ptr, model_id) ! gmcf! Set up WRF model.CALL wrf_init! Run digital filter initialization if requested.CALL wrf_dfi! WRF model time-stepping. Calls integrate().CALL wrf_run! WRF model clean-up. This calls MPI_FINALIZE() for DM parallel runs.CALL wrf_finalizeend subroutine program_wrf_gmcf

And for the LES: subroutine program_les_gmcf(gmcf_ptr, model_id)! ... LES-specific use statements ...use gmcfAPIuse gmcfAPIlesinteger(8) , intent(In) :: gmcf_ptrinteger , intent(In) :: model_id! ... LES-specific declarations ...call gmcfInitLes(gmcf_ptr, model_id)! ... LES-specific codeend subroutine program_les_gmcf

The actual synchronization and data exchanges also requires very few changes, currentlythese have to be done manually. In WRF, these are in frame/module_integrate.F : MODULE module_integrateCONTAINSRECURSIVE SUBROUTINE integrate ( grid )use gmcfAPIuse gmcfAPIwrf! ... WRF-specific code ...! WRF main time loopDO WHILE ( .NOT. domain_clockisstopsubtime(grid) ) ! Synchronise simulation times between WRF & LES call gmcfSyncWrf! ... Actual model computations ... ! Prepare the wind proﬁle call gmcfCreateWindprofile(grid%u_2,grid%v_2,grid%w_2) ! Send the data to the LES when requested call gmcfPostWrfEND DO all gmcfFinishedWrf! ... WRF-specific code ...END SUBROUTINE integrateEND MODULE module_integrate In LES, the time loop is in main.f95 : do n = n0,nmaxtime = float(n-1)*dt ! Synchronise simulation times between WRF & LES call gmcfSyncLes ! Request and receive the data from WRF call gmcfPreLesif (can_interpolate == 0) thencan_interpolate = 1else ! Interpolate the value for the current LES time step call gmcfInterpolateWindprofileLes(u,v,w)! ... Actual model computationsend if ! interpolate guard end docall gmcfFinishedLes Here the code is a little bit more involved because LES only takes data from WRF everysimulated minute but simulates with a resolution of half a second. The WRF data are inter-polated and the interpolation can only start after two time steps, hence the guard.The LES required other changes as well, because it used a ﬁxed wind proﬁle rather thantaking data from WRF. However, these changes are not speciﬁc to model coupling so theyare not included.

The changes to the model code are very small because the GMCF operation is abstractedinto a per-model API. This API consists of a small number of subroutines: • The

Init call initializes GMCF for the given model. • The

Sync call synchronizes the model with the other models it’s communicating with. • The

Pre and

Post calls are taking care of the actual data exchange,

Pre is before com-putation,

Post is after computation. • The

Finished call informs all other models that the given model has ﬁnished.These subroutines are currently written manually, the next step is generate the code basedon annotations. • For

Init and

Sync , what is required is the time step information relative to some refer-ence, typically the model with the largest time step.21

For the

Pre and

Pos t calls, we need to know the names, types and sizes of the variablescontaining the data to be exchanged. • The

Finished call can be generated without extra informationThere are also some API calls that are not as generic as the ones above: the data receivedfrom another model must be assigned to a variable in the given model, and often we onlywant to send a portion of a multidimensional array, so we may want to create intermediatevariables. In the case of WRF and LES, we have gmcfInterpolateWindproﬁleLes(u,v,w) and gmcfCreateWindproﬁleWrf(u,v,w) but actually the u,v,w are not quite identical in both mod-els. Currently, these subroutines have to be written by hand. In order to generate the codefor these subroutines, we need to describe exactly how a variable from one model maps toa variable of another model. Our approach is to create an intermediate variable for eachexchange and deﬁne the mapping to that variable for each model.

Thanks to the extensive code generation and automation, building the GMCF executableis quite simple: ﬁrst, specify the models to couple using a conﬁguration ﬁle. This ﬁle isused to build the GMCF framework. Then build each model into a library. Then link theGMCF framework library and the model libraries into the top-level executable. Detailedinstructions are available on GitHub.For the evaluation we tested the correctness of the time step synchronization and the dataexchange. And actual evaluation of the performance requires the WRF input ﬁles used forcreating the simulation that generates the wind proﬁle for the LES, and this has not beendone yet.However, we have successfully built and run a model coupling of WRF using OpenMP ona multicore GPU with the OpenCL LES running on the GPU.

The overall conclusions of this two-month pilot project are very encouraging: • We have demonstrated that our approach to model coupling, aimed at modern het-erogeneous manycore nodes, can be used to couple a complex NWP simulator suchas WRF, parallelized using OpenMP, with a custom LES simulator running on a GPUusing OpenCL. • The LES is a very high-resolution simulator, therefore GPU acceleration was essentialand we have shown that we can achieve a speed-up of seven times using OpenCL onthe GPU. • Furthermore, our model coupling framework is designed to be useable for automaticgeneration from scenarios and speciﬁcations, and this will be the focus of our futurework. Already, using GMCF requires only very minor modiﬁcations to the originalmodel source code and build system. 22

All software developed during this project has been open sourced and is available at https://github.com/wimvanderbauwhede .This work was conducted during a research visit at the Disaster Prevention Research Insti-tute of Kyoto University, supported by an EPSRC Overseas Travel Grant, EP / L026201 / References [ ] David Chelimsky, Dave Astels, Bryan Helmkamp, Dan North, Zach Dennis, and AslakHellesoy.

The RSpec book: Behaviour driven development with Rspec, Cucumber, andfriends . Pragmatic Bookshelf, 2010. [ ] Tore Furevik, Mats Bentsen, Helge Drange, IKT Kindem, Nils Gunnar Kvamstø, andAsgeir Sorteberg. Description and evaluation of the bergen climate model: Arpegecoupled with micom.

Climate Dynamics , 21(1):27–51, 2003. [ ] Chris Hill, Cecelia DeLuca, Max Suarez, Arlindo da Silva, et al. The architecture of theearth system modeling framework.

Computing in Science & Engineering , 6(1):18–28,2004. [ ] Kohei Honda, Aybek Mukhamedov, Gary Brown, Tzu-Chun Chen, and Nobuko Yoshida.Scribbling interactions with a formal foundation. In

Distributed Computing and InternetTechnology , pages 55–75. Springer, 2011. [ ] Elias Konstantinidis and Yiannis Cotronis. Graphics processing unit acceleration ofthe red / black sor method. Concurrency and Computation: Practice and Experience ,25(8):1107–1120, 2013. [ ] Jay Larson, Robert Jacob, and Everest Ong. The model coupling toolkit: A new for-tran90 toolkit for building multiphysics parallel coupled models.

International Journalof High Performance Computing Applications , 19(3):277–292, 2005. [ ] Hiromasa Nakayama, Tetsuya Takemi, and Haruyasu Nagai. Les analysis of the aerody-namic surface properties for turbulent ﬂows over building arrays with various geome-tries.

Journal of Applied Meteorology and Climatology , 50(8):1692–1712, 2011. [ ] Hiromasa Nakayama, Tetsuya Takemi, and Haruyasu Nagai. Large-eddy simulation ofurban boundary-layer ﬂows by generating turbulent inﬂows from mesoscale meteoro-logical simulations.

Atmospheric Science Letters , 13(3):180–186, 2012. [ ] Nicholas Ng, Nobuko Yoshida, and Kohei Honda. Multiparty session c: Safe parallelprogramming with message optimisation. In