Efficient Multidimensional Data Redistribution for Resizable Parallel Computations
aa r X i v : . [ c s . D C ] J un Efficient Multidimensional Data Redistribution forResizable Parallel Computations
Rajesh Sudarsan and Calvin J. Ribbens
Department of Computer ScienceVirginia Tech, Blacksburg, VA 24061-0106 { sudarsar, ribbens } @vt.edu Abstract.
Traditional parallel schedulers running on cluster supercomputers sup-port only static scheduling, where the number of processors allocated to an ap-plication remains fixed throughout the execution of the job. This results in under-utilization of idle system resources thereby decreasing overall system throughput.In our research, we have developed a prototype framework called ReSHAPE,which supports dynamic resizing of parallel MPI applications executing on dis-tributed memory platforms. The resizing library in ReSHAPE includes supportfor releasing and acquiring processors and efficiently redistributing applicationstate to a new set of processors. In this paper, we derive an algorithm for redis-tributing two-dimensional block-cyclic arrays from P to Q processors, organizedas 2-D processor grids. The algorithm ensures a contention-free communicationschedule for data redistribution if P r ≤ Q r and P c ≤ Q c . In other cases, the al-gorithm implements circular row and column shifts on the communication sched-ule to minimize node contention. Key words:
Dynamic scheduling, Dynamic resizing, Data redistribution, Dynamic re-source management, process remapping, resizable applications
As terascale supercomputers become more common and as the high-performance com-puting (HPC) community turns its attention to petascale machines, the challenge ofproviding effective resource management for high-end machines grows in both impor-tance and difficulty. A fundamental problem is that conventional parallel schedulers arestatic, i.e., once a job is allocated a set of resources, they remain fixed throughout thelife of an application’s execution. It is worth asking whether a dynamic resource man-ager, which has the ability to modify resources allocated to jobs at runtime, would allowmore effective resource management. The focus of our research is on dynamically re-configuring parallel applications to use a different number of processes, i.e., on dynamicresizing of applications. In order to explore the potential benefits and challenges of dynamic resizing, we aredeveloping ReSHAPE, a framework for dynamic Re sizing and S cheduling of H omo-geneous A pplications in a P arallel E nvironment. The ReSHAPE framework includes a A shorter version of this paper is available in the proceedings of the
The Fifth InternationalSymposium on Parallel and Distributed Processing and Applications (ISPA07) rogramming model and an API, data redistribution algorithms and a runtime library,and a parallel scheduling and resource management system framework. ReSHAPE al-lows the number of processors allocated to a parallel message-passing application to bechanged at run time. It targets long-running iterative computations, i.e., homogeneouscomputations that perform similar computational steps over and over again. By moni-toring the performance of such computations on various processor sizes, the ReSHAPEscheduler can take advantage of idle processors on large clusters to improve the turn-around time of high-priority jobs, or shrink low-priority jobs to meet quality-of-serviceor advanced reservation commitments.Dynamic resizing necessiates runtime application data redistribution. Many highperformance computing applications and mathematical libraries like ScaLAPACK [1]require block-cyclic data redistribution to achieve computational efficiency. Data re-distribution involves four main stages — data identification and index computation,communication schedule generation, message packing and unpacking and finally, datatransfer. Each processor identifies its part of the data to redistribute and transfers thedata in the message passing step according to the order specified in the communicationschedule. A node contention occurs when one or more processors sends messages to asingle processor. A redistribution communication schedule aims to minimize these nodecontentions and maximiz network bandwidth utilization. Data is packed or marshalledon the source processor to form a message and is unmarshalled on the destination pro-cessor.In this paper, we present an algorithm for redistributing two-dimensional block-cyclic data from P ( P r rows × P c columns) to Q ( Q r rows × Q c columns) processors,organized as 2-D processor grids. We evaluate the algorithm’s performance by measur-ing the redistribution time for different block-cyclic matrices. If P r ≤ Q r and P c ≤ Q c ,the algorithm ensures a contention-free communication schedule for redistributing datafrom source processor set P to Q processor set. In other cases the algorithm minimizesnode contentions by performing row or column circular shifts on the communicationschedule. The algorithm discussed in this paper supports 2-D block cyclic data redistri-bution for only one- and two-dimensional processor topology. We also discuss in detailthe modifications needed to port an existing scientific application to use the dynamicresizing capability of ReSHAPE using the API provided by the framework.The rest of the paper is organized as follows: Section 2 discusses prior work in thearea of data redistribution. Section 3 briefly reviews the architecture of the ReSHAPEframework and discusses in detail the two-dimensional redistribution algorithm and theReSHAPE API. Section 4 reports our experimental results of the redistribution algo-rithm with the ReSHAPE framework tested on the SystemX cluster at Virginia Tech.We conclude in Section 5 discussing future directions to this research. Data redistribution within a cluster using message passing approach has been exten-sively studied in literature. Many of the past research efforts [2] [3] [4] [5] [6] [7][8] [9] [10] [11] [12] were targeted towards redistributing cyclically distributed onedimensional arrays between the same set of processors within a cluster on a 1-D pro-cessor topology. To reduce the redistribution overhead cost, Walker and Otto [12] andaushik [7] proposed a K-step communication schedule based on modulo arithmeticand tensor products repectively. Ramaswamy and Banerjee [9] proposed a redistribu-tion technique, PITFALLS, that uses line segments to map array elements to a processor.This algorithm can handle any arbitrary number of source and destination processors.However, this algorithm does not use communication schedules during redistributionresulting in node contentions during data transfer. Thakur et al. [11][10] use gcd and lcm methods for redistributing cyclically distributed one dimensional arrays on the sameprocessor set. The algorithms described by Thakur et al. [10] and Ramaswamy [9] usea series of one-dimensional redistributions to handle multidimensional arrays. This ap-proach can result in significant redistribution overhead cost due to unwanted commu-nication. Kalns and Ni [6] presented a technique for mapping data to processors byassigning logical processor ranks to the target processors. This technique reduces thetotal amount of data that must be communicated during redistribution. Hsu et al. [5]further extended this work and proposed a generalized processor mapping techniquefor redistributing data from cyclic(kx) to cyclic(x), and vice versa. Here, x denotes thenumber of data blocks assigned to each processor. However, this method is applicableonly when the number of source and target processors are same. Chung et al. [2] pro-posed an efficient method for index computation using basic-cycle calculation (BCC)technique for redistributing data from cyclic(x) to cyclic(y) on the same processor set.An extension of this work by Hsu et al. [13] uses generalized basic-cyclic calcula-tion method to redistribute data from cyclic(x) over P processors to cyclic(y) over Qprocessors. The generalized BCC uses uses bipartite matching approach for data re-distribution. Lim et al. [8] developed a redistribution framework that could redistributeone-dimensional array from one block-cyclic scheme to another on the same processorset using a generalized circulant matrix formalism. Their algorithm applies row and col-umn transformations on the communication schedule matrix to generate a conflict-freeschedule.Prylli et al. [14], Desprez et al. [3] and Lim et al. [15] proposed efficient algorithmsfor redistributing one- and two-dimensional block cyclic arrays. Prylli et al. [14] pro-posed a simple scheduling algorithm, called Caterpillar, for redistributing data across atwo-dimensional processor grid. At each step d in the algorithm, processor P i (0 < i ≤ P ) in the destination processor set exchanges its data with processor P (( P − i − d ) mod P ) .The Caterpillar algorithm does not have a global knowledge of the communicationschedule and redistributes the data using the local knowledge of the communications atevery step. As a result, this algorithm is not efficient for data redistribution using “non-all-to-all” communication. Also, the redistribution time for a step is the time taken totransfer the largest message in that step. Desprez et al. [3] proposed a general solutionfor redistributing one-dimensional block-cyclic data from a cyclic(x) distribution on aP-processor grid to a cyclic(y) distribution on a Q-processor grid for arbitrary values ofP, Q, x, and y. The algorithm assumes the source and target processors as disjoint setsand uses a bipartite matching to compute the communication schedule. However, thisalgorithm does not ensure a contention-free communication schedule. In a recent work,Guo and Pan [4] described a method to construct schedules that minimizes number ofcommunication steps, avoids node contentions, and minimizes the effect of differencein message length in each communication step. Their algorithm focuses on redistribut-ing one-dimensional data from a cyclic(kx) distribution on P processors to cyclic(x)istribution on Q processors for any arbitrary positive values of P and Q. Lim et al. [15]propose an algorithm for redistributing a two-dimensional block-cyclic array acrossa two-dimensional processor grid. But the algorithm is restricted to redistributing dataacross different processor topologies on the same processor set. Park et al. [16] extendedthe idea described by Lim et al. [15] and proposed an algorithm for redistributing one-dimensional block-cyclic array with cyclic(x) distribution on P processors to cyclic(kx)on Q processors where P and Q can be any arbitrary positive value.To summarize, most of the existing approaches either deal with redistribution ofblock-cyclic array across one-dimensional processor topology on the same or on a dif-ferent processor set. The Caterpillar algorithm by Prylli et al. [14] is the closest relatedwork to our redistribution algorithm in that it supports redistribution on checkerboardprocessor topology. In our work, we extend the idea in [15][16] to develop an algorithmto redistribute two-dimensional block-cyclic data distributed across a 2-D processorgrid topology. The data is redistributed from P ( P r × P c ) to Q ( Q r × Q c ) processorswhere P and Q can be any arbitrary positive value. Our work is contrary to Desprez etal. [3] where they assume that there is no overlap among processors in the source anddestination processor set. Our algorithm builds an efficient communication scheduleand uses non-all-to-all communication for data redistribution. We apply row and col-umn transformations using the circulant matrix formalism to minimize node contentionsin the communication schedule. The ReSHAPE framework, shown in Figure 1(a), consists of two main components. Thefirst component is the application scheduling and monitoring module which schedulesand monitors jobs and gathers performance data in order to make resizing decisionsbased on application performance, available system resources, resources allocated toother jobs in the system and jobs waiting in the queue. The second component of theframework consists of a programming model for resizing applications. This includesa resizing library and an API for applications to communicate with the scheduler tosend performance data and actuate resizing decisions. The resizing library includes al-gorithms for mapping processor topologies and redistributing data from one processortopology to another. The individual components in these modules are explained in detailby Sudarsan and Ribbens [17].
The resizing library provides routines for changing the size of the processor set assignedto an application and for mapping processors and data from one processor set to another.An application needs to be re-compiled with the resize library to enable the scheduler todynamically add or remove processors to/from the application. During resizing, ratherthan suspending the job, the application execution control is transferred to the resizelibrary which maps the new set of processors to the application and redistributes thedata (if required). Once mapping is completed, the resizing library returns control backto the application and the application continues with its next iteration. The application ystemMonitor RemapScheduler PerformanceProfilerApplicationScheduler Job Startup moduleCommunication Module Redistribution Component Processor Mapping
ApplicationMonitor
ReSHAPE
Resizing library
Application Profiler
Application Scheduling and Monitoring module
Monitors individual applicationexecution Sending the application profile information after every iteration Data redistribution information and re-initiation for the next iteration Contacting the scheduler to receive new processor configurationTermination or process initiation message
Application informationExpand/shrink/No change New processor config.Freed processor information Information for expansion/shrinkingdecisionsApplicationand processor information
UserApplication UserApplication UserApplicationMPI processes (a)
MPI Application Resize library Remap Scheduler Performance profiler
6. Spawn new processors and create a new BLACS 7. Merge the new and old BLACS context. Create an expanded processor set 8. Redistribute the data among the expanded processor set 6. Redistribute the data to reduced set of processors 7. Exit and create a new BLACS context with smaller number of processors1. Processor remapping request. Information about application’s performance in previous iteration 3. Request for information about application’s past performances iteration 4. Application’sperformance informationin earlier iteration 2. Communicating application performance 5. Decision to expand orshrink.New processor configuration ShrinkExpandResume computation with new iteration Resume computation with new iteration 6. No change (b)
Fig. 1. (a) Architecture of ReSHAPE (b) State diagram for application expansion andshrinkinguser needs to indicate the global data structures and variables so that they can be redis-tributed to the new processor set after resizing. Figure 1(b) shows the different stagesof execution required for changing the size of the processor set for an application.Our API gives programmers a simple way to indicate resize points in the application,typically at the end of each iteration of the outer loop. At resize points, the applicationcontacts the scheduler and provides performance data to the scheduler. The metric usedto measure performance is the time taken to compute each iteration. The scheduler’s de-cision to expand or shrink the application is passed as a return value. If an application isallowed to expand to more processors, the response from the Remap Scheduler includesthe size and the list of processors to which an application should expand. A call to theredistribution routine remaps the global data to the new processor set. If the Sched-uler asks an application to shrink, then the application first redistributes its global dataacross a smaller processor set, retrieves its previously stored MPI communicator, andcreates a new BLACS [18] context for the new processor set. The additional processesare terminated when the old BLACS context is exited. The resizing library notifies theRemap Scheduler about the number of nodes relinquished by the application. .2 Application Programming Interface (API)
A simple API allows user codes to access the ReSHAPE framework and library. Thecore functionality is accessed through the following internal and external interfaces.These functions are available for use by advanced application programmers. Thesefunctions provide the main functionality of the resizing library by contacting the sched-uler, remapping the processors after an expansion or a shrink, and redistributing thedata. These functions are listed as follows: – reshape Initialize (global data array, nprocessors, blacs context, iterationCount,processor row, processor column, job id) : initializes the iterationCount and theglobal data array with the initial values and creates a blacs context for the two-dimensional processor topology. The function returns values for processor row,column configuration and job id. – reshape ContactScheduler(iteration time, redistribution time, processor row count,processor column count, job id) : contacts the scheduler and supplies last iterationtime; on return, the scheduler indicates whether the application should expand,shrink, or continue execution with the current processor size. – reshape Expand () : adds the new set of processors (defined by previous call toreshape contactScheduler) to the current set using BLACS. – reshape Shrink () : reduces the processor set size (defined by previous call to re-shape contactScheduler) to an earlier configuration and relinquishes additional pro-cessors. – reshape Redistribute(Global data array, current BLACS context, current processorset size, EXPAND/SHRINK) : redistributes global data among the newly spawned orshrunk processors. The redistribution time is computed and stored for next resizepoint. – reshape Log (starttime, endtime) : computes the average iteration time of the currentiteration for all the processors and stores it for next resize point.Figure 2(a) shows the source code for a simple MPI application for solving a se-quence of linear system of equations using ScaLAPACK functions. The original codewas refactored to identify the global data structures and variables. The ReSHAPE APIcalls were inserted at the appropriate locations in the refactored code. Figure 2(b) showsthe modified code. The data redistribution library in ReSHAPE uses an efficient algorithm for redistribut-ing block-cyclic arrays between processor sets organized in a 1-D (row or column for-mat) or checkerboard processor topology. The algorithm for redistributing 1-D block-cyclic array over a one-dimensional processor topology was first proposed by Park etal. [16]. We extend this idea to develop an algorithm to redistribute both one- and two-dimensional block-cyclic data across a two-dimensional processor grid of processors.In our redistribution algorithm, we assume the following: – Source processor configuration: P r × P c ( rows × columns ), P r , P c > . (cid:133) int main{int argc, char**argv[]){ double **A,**B; int maxIterations =10; //MPI Initializations //Read Global matrix A of dimensions m x n, B with dimensions n x p for(iterationCount=0;iterationCount
M at ( x , y ) to refer block ( x, y ) , ≤ x, y < N . – The data that can be equally divided among the source and destination processors Pand Q respectively, i.e., N is evenly divisible by P r , P c , Q r , and Q c . Each processorhas an integer number of data blocks. – The source processors are numbered P ( i,j ) , ≤ i < P r , ≤ j < P c and thedestination processors are numbered as Q ( i,j ) , ≤ i < Q r , ≤ j < Q c roblem Definition. We define 2D block-cyclic distribution as follows: Given a twodimensional array of n × n elements with block size NB and a set of P processorsarranged in checkerboard topology, the data is partitioned into N × N blocks and dis-tributed across P processors, where N = n/N B . Using this distribution a matrix block, M at ( x, y ) , is assigned to the source processor P c ∗ ( x % P r ) + y % P c , ≤ x < N , ≤ y < N . Here we study the problem of redistributing a two-dimensional block-cyclic matrix from P processors to Q processors arranged in checkerboard topology,where P = Q and N B is fixed. After redistribution, the block
M at ( x, y ) will belongto the destination processor Q c ∗ ( x % Q r ) + y % Q c , ≤ x < N , ≤ y < N . b00 b01 b00 b01b10 b11 b10 b11b00 b01 b00 b01b10 b11 b10 b11b00 b01 b00 b01b10 b11 b10 b11 b00 b01 b00 b01b10 b11 b10 b11b00 b01 b00 b01b10 b11 b10 b11b00 b01 b00 b01b10 b11 b10 b11b00 b01 b02 b03b10 b11 b12 b13b20 b21 b22 b23b00 b01 b02 b03b10 b11 b12 b13b20 b21 b22 b23 b00 b01 b02 b03b10 b11 b12 b13b20 b21 b22 b23b00 b01 b02 b03b10 b11 b12 b13b20 b21 b22 b23 Source data layoutDestination data layout (a)
IDPC FDPCC
Transfer
Re-ordering DestinationMappingSourceMappingR C ( R x C ) / P (b) Fig. 3. (a) P = ( × ), Q = ( × ) Data layout in source and destination processors.(b) Creating of Communication Schedule ( C T ransfer ) from Initial Data Processor Con-figuration table (IDPC), Final Data Processor Configuration table (FDPC)
Redistribution Terminologies.(a) Superblock : Figure 3(a) shows the checkerboard distribution of a × block-cyclic data on source and destination processor grids. The b entry in the sourcelayout table indicates that the block of data is owned by processor P (0 , , blockdenoted by b is owned by processor P (0 , and so on. The numbers on the topright corner in every block indicates the id of that data block. From this data layout,a periodic pattern can be identified for redistributing data from source to destina-tion layout. The blocks M at (0 , , M at (0 , , M at (2 , , M at (2 , , M at (4 , and M at (4 , , owned by processor P (0 , in the source layout, are transferred toprocessors Q (0 , , Q (0 , , Q (2 , , Q (2 , , Q (1 , and Q (1 , . This mapping patternrepeats itself for blocks M at (0 , , M at (0 , , M at (2 , , M at (2 , , M at (4 , nd M at (4 , . Thus we can see that the communication pattern of the blocks M at ( i, j ) , ≤ i < , ≤ j < repeats for other blocks in the data. A superblockis defined as the smallest set of data blocks whose mapping pattern from source todestination processor can be uniquely identified. For a 2-D processor topology datadistribution, each superblock is represented as a table of R rows and C columns,where R = lcm ( P r , Q r ) C = lcm ( P c , Q c ) The entire data is divided into multiple superblocks and the mapping pattern ofthe data in each superblock is identical to the first superblock, i.e., the data blockslocated at the same relative position in all the superblocks are transferred to thedestination processor. A 2-D block matrix with
Sup elements is used to representthe entire data where each element is a Superblock. The dimensions of this blockmatrix are
Sup R and Sup C where, Sup R = N/R Sup C = N/C Sup = (
N/R ∗ N/C ) (b) Layout : Layout is an 1-D array of Sup R ∗ Sup C elements where each elementis a 2-D table which stores the block ids present in that superblock. There are Sup number of 2-D tables in the Layout array where each table has the dimension R × C . (c) Initial Data-Processor Configuration (IDPC) : This table represents the initialprocessor layout for the data before redistribution for a single superblock. Sincethe data-processor mapping is identical over all the superblocks, only one instanceof this table is created. The table has R rows × C columns. IDP C ( i, j ) containsthe processor id P ( i,j ) that owns the block M at ( i, j ) located at the same relativeposition in all the superblocks, ( ≤ i <, R , ≤ j < C ). (d) Final Data-Processor Configuration (FDPC) : The table represents the final pro-cessor configuration for the data layout after redistribution for a single superblock.Like IDPC , only one instance of this table is created and used for all the data su-perblocks. The dimensions of this table is R × C . FDPC(i, j) contains the processorid Q ( i,j ) that owns the block M at ( i, j ) after redistribution located at the same rel-ative position in all the superblocks, ( ≤ i < R , ≤ j < C ). (e) The source processor for any data block
Mat(i, j) in the data matrix can be computedusing the formula
Source ( i, j ) = P c ∗ ( i % P r ) + ( j % P c ) (f) Communication schedule send table ( C T ransfer ) : This table contains the finalcommunication schedule for redistributing data from source to destination layout.This table is created by re-ordering the FDPC table. The columns of C T ransfer correspond to P source processors and the rows correspond to individual commu-nication steps in the schedule. The number of rows in this table is determined by ( R ∗ C ) /P . The network bandwidth is completely utilized in every communicationstep as the schedule involves all the source processors in data transfer. A positiveentry in the C T ransfer table indicates that in the i th communication step, processor j will send data to C T ransfer ( i, j ) , ≤ i < ( R ∗ C ) /P , ≤ j < ( P r ∗ P c ) . (g) Communication schedule receive table ( C Recv ) : This table is derived from the C T ransfer table where the columns correspond to the destination processors. Thetable has the same number of rows as the C T ransfer table. A positive entry at C Recv ( i, j ) indicates that processor j will receive data from source processor at C Recv ( i, j ) in the i th communication step, ≤ i < ( R ∗ C ) /P , ≤ j < ( Q r ∗ Q c ) .f ( Q r ∗ Q c ) ≥ ( P r ∗ P c ) , then the additional entries in the C Recv table are filledwith -1.
Algorithm.Step 1:
Create Layout table
The Layout array of tables are created by traversing through all the data blocks inmatrix
M at ( i, j ) , where ≤ i , j < N , ≤ j < N . The superblocks in M at ( i, j ) is traversed in row-major format. Pseudocode:for superblockcount ← to Sup − dofor i ← to R/P r − dofor j ← to C/P c − dofor k ← to P r − dofor l ← to P c − do Layout [ superblockcount ]( i ∗ C/P c + k, j ∗ R/P r + l ) = M at ( superblockid row ∗ R + i ∗ P c + k,superblockid col ∗ C + j ∗ P r + l ) if ( reached end of column ) then increment Sup R Sup C ← else increment Sup C Step 2:
Creating IDPC and FDPC tables
An entry at
IDP C ( i , j ) is calculated using the index i and j of the table and thesize of the source processor set P , ≤ i < R , ≤ j < C . The Source functionreturns the processor id of the owner of the data before redistribution stored in thatlocation.Similarly, an entry F DP C ( i , j ) is computed using the i and j coordinates of thetable and the size of the destination processor set Q , ≤ i < R , ≤ j < C .The Source function returns the processor id of the owner of the redistributed datastored in that location. Pseudocode:for i ← to R − dofor j ← to C − do IDP C ( i, j ) ← Source ( i, j ) ← P c ∗ ( i % P r , j % P c ) for i ← to R − dofor j ← to C − do F DP C ( i, j ) ← Source ( i, j ) ← Q c ∗ ( i % Q r , j % Q c ) Step 3:
Communication schedule tables( C T ransfer and C Recv ) The C T ransfer table stores the final communication schedule for transferring dataetween the source and the destination processors. The columns in C T ransfer cor-respond to source processor P ( i,j ) . The table has C T ransferRows rows and ( P r ∗ P c ) columns, where C T ransferRows = ( R ∗ C ) / ( P r ∗ P c ) Each entry in the C T ransfer table is filled by sequentially traversing the
FDPC table in row-major format. The data corresponding to each processor inserted at theappropriate column at the next available location. An integer counter updates itselfand keeps track of the next available location (next row) for each processor.
Pseudocode: processor id = IDP C ( i, j ) C T ransfer ( counter j , processor id ) ← F DP C ( i, j ) U pdate counter j where ≤ i < R and ≤ j < C . Each row in the C T ransfer table formsa single communication step where all the source processors send the data to aunique destination processor. The C Recv table is used by the destination processorsto know the source of their data in a particular communication step. C Recv ( i, C T ransfer ( i, j )) = j where ≤ i < C T ransferRows and ≤ j < ( Q r × Q c ) .Node contention can occur in the C T ransfer communication schedule if any one ofthe following conditions are true(i) P r ≥ Q r (ii) P c ≥ Q c (iii) P r ≥ Q r and P c ≥ Q c If there are node contentions in the communication schedule, create a
ProcessorMapping (PM) table of dimension R × C and initialize it with the values fromFDPC table. To reduce node contentions, the PM tables are circularly shifted inrow or columns. To maintain data consistency, same operations are performed onthe IDPC table and the superblock tables within the Layout array. The C T ransfer table is created from the modified PM table. We identify 3 situations where nodecontentions can occur. Case 1 and case 2 are applicable during both expansion andshrinking of an application while Case 3 can occur only when an application isshrinking to a smaller destination processor set.Do the following operation on IDPC, PM and on each 2-D table in the Layout array.
Case 1 : If P r > Q r and P c < Q c then1. Create ( R/P r ) groups with P r rows in each group.2. For ≤ i < P r , perform a circular right shift on each row i by P c ∗ i elementsin each group.3. Create the C T ransfer table from the resulting PM table. Case 2 : If P r < Q r and P c > Q c then1. Create ( C/P c ) groups with P c columns in each group.2. For ≤ j < P c , perform a circular down shift on each column j by P r ∗ j elements in each group.3. Create the C T ransfer table from the resulting PM table. Case 3 : If P r > Q r and P c > Q c then. Create ( C/P c ) groups with P c columns in each group.2. For ≤ j < P c , perform a circular down shift each column j by P r ∗ j elementsin each group.3. Create ( R/P r ) groups with P r rows in each group.4. For ≤ i < P r , perform a circular right shift each row i by P c ∗ i elements ineach group5. Create the C T ransfer table from the resulting PM table.The C Recv table is not used when the schedule is not contention-free. Node con-tention results in overlapping entries in the C Recv table thus rendering it as unus-able.
Step 4:
Data marshalling and unmarshalling
If a processor’s rank equal the value at
IDP C ( i , j ) , then the processor collects thedata from the relative indexes of all the superblocks in the Layout array. Each col-lection of data over all the superblocks forms a single message for communicationfor processor j.If there are no node contentions in the schedule, each source processor stores ( R ∗ C ) / ( P r ∗ P c ) messages, each of size ( N ∗ N/ ( R ∗ C )) in the original order ofthe data layout. The messages received on the destination processor are unpackedinto individual blocks and stored at an offset of ( R/Q r ) ∗ ( C/Q c ) elements fromthe previous data block in the local array. The first data block is stored at zero th location of the local array. If the communication schedule has node contentions,the order of the messages are shuffled according to row or column transformations.In such cases, the destination processor performs reverse index computation andstores the data at the correct offset. Step 5:
Data Transfer
The message size in each send communication is equal to ( N ∗ N ) / ( R ∗ C ) data blocks. Each row in the C T ransfer table corresponds to a single communi-cation step. In each communication step, the total volume of messages exchangedbetween the processors is P ∗ ( N ∗ N/ ( R ∗ C )) data blocks. This volume in-cludes cases where data is locally copied to a processor without performing aMPI Send and MPI Recv operation. In a single communication step j, a sourceprocessor P i sends the marshalled message to the destination processor given by C T ransfer ( j, i ) , where ≤ j < C T ransferRows , ≤ i < ( P r ∗ P c ) , Data Transfer Cost.
For every communication call using MPI Send and MPI Recv,there is a latency overhead associated with it. Let us denote this time to initiate amessage by λ . Let τ denote the time taken to transmit a unit size of message fromsource to destination processor. Thus, the time taken to send a message from asource processor in single communication step is (( N ∗ N ) / ( R ∗ C )) ∗ τ . Thetotal data transfer cost for redistributing the data across destination processors is C T ransferRows ∗ ( λ + (( N ∗ N ) / ( R ∗ C )) ∗ τ ) . This section presents experimental results which demonstrate the performance of ourtwo-dimensional block-cyclic redistribution algorithm. The experiments were conduct-ed on 50 nodes of a large homogeneous cluster (System X). Each node is a dual 2.3Hz PowerPC 970 processor with 4GB of main memory. Message passing was doneusing MPICH2 [19] over a Gigabit Ethernet interconnection network. We integratedthe redistribution algorithm into the resizing library and evaluated its performance bymeasuring the total time taken by the algorithm to redistribute block-cyclic matricesfrom P to Q processors. We present results from two sets of experiments. The first setof experiments evaluates the performance of the algorithm for resizing and compares itwith the Caterpillar algorithm. The second set of experiments focuses on the effects ofprocessor topology on the redistribution cost. Table 1 shows all the possible processorconfigurations for various processor topologies. Processor configurations for the one-dimensional processor topology ( × Q r ∗ Q c or Q r ∗ Q c × ) are not shown in thetable. For the two set of experiments described in this section, we have used the fol-lowing matrix sizes - × , × , × , × , × , × , × and × . A problem size of indicates the matrix × . The processor configurations listed in Table 1 evenlydivide the problem sizes listed above. Table 1.
Processor configuration for various topologiesTopology Processor configurationsNearly-square × , × , × , × , × , × , × , × , × , × , × , × , × Skewed-rectangular × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × , × Every time an application acquires or releases processors, the globally distributed datahas to be redistributed to the new processor topology. Thus, the application incurs aredistribution overhead each time it expands or shrinks. We assume a nearly-squareprocessor topology for all the processor sizes used in this experiment. The matrix storesdata as double precision floating point numbers. Figure 4(a) shows the overhead forredistributing large dense matrices for different matrix sizes using the our redistributionalgorithm. Each data point in the graph represents the data redistribution cost incurredwhen increasing the size of the processor configuration from the previous (smaller) con-figuration. Problem size 8000 and 12000 start execution with 2 processors, problem size16000 and 20000 start with 4 processors, and the 24000 case starts with 6 processors.The starting processor size is the smallest size which can accommodate the data. Thetrend shows that the redistribution cost increases with matrix size, but for a fixed matrixsize the cost decreases as we increase the number of processors. This makes sense be-cause for small processor size, the amount of data per processor that must be transferredis large. Also the communication schedule developed by our redistribution algorithm isndependent of the problem size and depends only on the source and destination pro-cessor set size. T o t a l R ed i s t r i bu t i on T i m e ( i n s e cs ) ProcessorsRedistribution overhead for various matrix sizes Matrix sizes8000x800012000x1200016000x1600020000x2000024000x24000 (a) Expansion T o t a l R ed i s t r i bu t i on T i m e ( i n s e cs ) Problem size Redistribution overhead for shrinkingP=40 (5x8) Q=25 (5x5)P=25 (5x5) Q=10 (2x5)P=25 (5x5) Q=8 (2x4)P=25 (5x5) Q=4 (2x2)P=50 (5x10) Q=32 (4x8) (b) Shrinking
Fig. 4.
Redistribution overhead incurred while resizing using ReSHAPE.Figure 4(b) shows the overhead cost incurred while shrinking large matrices from P processors to Q processors. In this experiment, we assign the values for P from theset , , and Q from the set , , , and . Each data point in the graph rep-resents the redistribution overhead incurred while shrinking at that problem size. Fromthe graph, it is evident that the redistribution cost increases as we increase the prob-lem size. Typically, a large difference between the source and destination processor setresults in higher redistribution cost. The rate at which the redistribution cost increasesdepends on the size of source and destination processor set. But we note that smallerdestination processor set size has a greater impact on the redistribution cost comparedto the difference between the processor set sizes. This is shown in the graph where theredistribution cost for shrinking from P = 50 to Q = 32 is lower compared to the costwhen shrinking from P = 25 to Q = 10 or P = 25 to Q = 8 .Figure 5(a) and 5(b) compares the total redistribution cost of our algorithm andthe Caterpillar algorithm. We have not compared the redistribution costs with the bi-partite redistribution algorithm as our algorithm assumes that data redistribution fromP to Q processors includes an overlapping set processors from the source and desti-nation processor set. The total redistribution time is the sum total of schedule com-putation time, index computation time, packing and unpacking the data and the datatransfer time. In each communication step, each sender packs a message before send-ing it and the receiver unpacks the message after receiving it. The Caterpillar algorithmdoes not attempt to schedule communication operations and send equal sized messagesin each step. Figure 5(a) shows experimental results for redistributing block-cyclic two-dimensional arrays from a × processor grid to a × processor grid. On average,the total redistribution time of our algorithm is 12.7 times less than the Caterpillar algo-rithm. In Figure 5(b), the total redistribution time of our algorithm is about 32 times lessthan of the Caterpillar algorithm. In our algorithm, the total number of communicationcalls for redistributing from 8 to 40 processors is 80 whereas in Caterpillar the number T o t a l R ed i s t r i bu t i on T i m e ( i n s e cs ) Problem size P= 8 (2 x 4) Q= 40 (5 x 8) Blocksize=100x100CaterpillarReSHAPE (a) Redistribution overhead while resizing from8 to 40 processors T o t a l R ed i s t r i bu t i on T i m e ( i n s e cs ) Problem sizeP= 8 (2 x 4) Q= 50 (5 x 8) Blocksize=100x100 CaterpillarReSHAPE (b) Redistribution overhead while resizing from8 to 50 processors
Fig. 5.
Comparing the total redistribution time for data redistribution in our algorithmwith Caterpillar algorithmis 160. Similarly, the number of MPI communication calls in our algorithm for redis-tributing 2D block-cyclic array from 8 processors to 50 processors is 196 as comparedto 392 calls in the Caterpillar algorithm. T o t a l R ed i s t r i bu t i on T i m e ( i n s e cs ) ProcessorsDifferent topologies with 20000 x 20000 matrix Topologies1 x (P*Q)(P*Q) x 1P x Q (P > Q)P x Q (P < Q) (a) × matrix with different topolo-gies T o t a l R ed i s t r i bu t i on T i m e ( i n s e cs ) ProcessorsDifferent topologies with 24000 x 24000 matrixTopologies1 x (P*Q)(P*Q) x 1P x Q (P > Q)P x Q (P < Q) (b) × matrix with different topolo-gies Fig. 6.
Effects of skewed processor topologies on total redistribution timeIn this experiment, we report the performance of our redistribution algorithm withfour different processor topologies — One-dimensional-row (Row-major), One-dim-ensional-column (Column major), Skewed-rectangular-row ( P r × P c , P r > P c ) andSkewed-rectangular-column ( P r × P c , P r < P c ). The processor configurations usedfor the Skewed-rectangular topologies are listed in Table 1. Figure 6(a) and Figure 6(b)hows the overhead for redistributing problem size 20000 and 24000 across differentprocessor topologies using the our redistribution algorithm, respectively. The total re-distribution cost for redistributing × matrix across an one-dimensionaltopology is comparable to the total redistribution cost on a nearly-square processortopology (see Figure 4(a)). In the case of skewed-rectangular topologies, the total redis-tribution time is slightly higher compared to the redistribution cost with nearly-squareprocessor topologies. We ran this experiment on other problem sizes — × and × and observed results similar to Figure 6(a). An increase in the totalredistribution time for skewed-rectangular topology can be due to one of the two situa-tions.(1) There is an increase in the total number of messages to be transferred using thecommunication schedule.(2) Node contention in the communication schedule is high.Since the dimensions of a superblock depends upon source and destination pro-cessor row and columns, a change in the processor topology can change the numberof elements in a superblock. As a result, the number of messages exchanged betweenprocessors will also vary thereby increasing or decreasing the total redistribution time.Figure 6(b) shows that the total redistribution cost for a skewed processor topology sud-denly increases when the processor size increases from 30 to 36 ( × to × ). Inthis case the number of elements in superblock increases to 540. Table 2 shows the totalMPI send/receive counts for redistributing between different processor sets on differenttopologies. From Table 2, we note that data redistribution using a skewed-rectangularprocessor topology requires exactly half the number of send/receive operation as com-pared to nearly-square topology. The algorithm uses only 18 MPI send/receive opera-tions to redistribute data from to processors and 36 to redistribute from to processors as compared to 36 and 72 respectively required for a nearly-square topol-ogy. In Figure 6(a), the cost of redistribution in a P < Q topology is more than theredistribution cost for a
P > Q topology. The reason for this additional overhead canbe attributed to increased number of node contentions in the comunication schedule forthe
P < Q topology. The node contentions reduces as the processor size increases andthe topology is maintained in subsequent iterations. When data is redistributed from P = (square topology) to Q = (skewed topology), node contentions in the communi-cation schedule of Q = ( × ) are higher compared to the schedule for redistributionto Q = ( × ). In this paper we have introduced a framework, ReSHAPE, that enables parallel messagepassing applications to be resized during execution. We have extended the functionalityof the resizing library in ReSHAPE to support redistribution of 2-D block-cyclic ma-trices distributed across a 2-D processor topology. We build upon the work by Park etal. [16] to derive an efficient 2-D redistribution algorithm. Our algorithm redistributesa two-dimensional block-cyclic data distribution on a 2-D grid of P ( P r × P c ) proces-sors to two-dimensional block-cyclic data distribution on a 2-D grid with Q ( Q r × Q c ) able 2. Counting topology dependent Send/Recvs. (P, Q) = size of source and desti-nation processor setRedistribution Communication Nearly square 1 Dimensional Skewed-rectangleconfiguration steps Copy Send/Recv Copy Send/Recv Copy Send/Recv(2, 4) 2 2 2 2 2 2 2(4, 6) 3 3 9 4 8 3 9(4, 8) 2 2 6 4 4 2 6(6, 9) 3 6 12 6 12 3 15(8, 16) 2 8 8 8 8 4 12(9, 12) 4 6 30 9 27 3 33(12, 16) 4 12 36 12 36 12 36(16, 20) 5 10 70 16 64 16 64(20, 25) 5 20 80 20 80 5 95(25, 30) 6 15 135 25 125 4 146(25, 40) 8 7 193 20 180 25 175(30, 36) 6 30 150 30 150 15 525(36, 48) 4 12 132 36 108 36 108(4, 20) 10, 5 (skewed) 2 38 4 36 2 18(8, 40) 10, 5 (skewed) 8 72 8 72 4 36(8, 50) 25 8 192 8 192 8 192processors, where P and Q can be any arbitrary positive value. The algorithm ensures acontention-free communication schedule if P r ≤ Q r , P c ≤ Q c . For all other conditionsinvolving P r , P c , Q r , Q c , the algorithm minimizes node contention in the communica-tion schedule by performing a sequence of row or column circular shifts. We also showthe ease of use of API provided by the framework to port and execute applications tomake use of ReSHAPE’s dynamic resizing capability. Currently the algorithm can re-distribute N × N blocks of data on P processors to Q processors only if Q r and Q c evenly divide N so that all the processors have equal number of integer blocks. We planto generalize this assumption so that the algorithm can redistribute data between P andQ processors for any arbitrary value of P and Q.We are currently evaluating ReSHAPE framework with different scheduling strate-gies for processor reallocation, quality-of-service and advanced reservation services.We are also working towards adding resizing capabilities to several production scien-tific codes and adding support for a wider array of distributed data structures and otherdata redistribution algorithms. Finally, we plan to make ReSHAPE a more extensibleframework so that support for heterogeneous clusters, grid infrastructure, shared mem-ory architectures, and distributed memory architectures can be implemented as individ-ual plug-ins to the framework. eferences
1. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,Hammerling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACKUser’s Guide. SIAM, Philadelphia. (1997)2. Chung, Y.C., Hsu, C.H., Bai, S.W.: A Basic-Cycle Calculation Technique for Efficient Dy-namic Data Redistribution. IEEE Trans. Parallel Distrib. Syst. (1998) 359–3773. Desprez, F., Dongarra, J., Petitet, A., Randriamaro, C., Robert, Y.: Scheduling Block-CyclicArray Redistribution. In: Proceedings of the Conference ParCo’97. Volume 12. (1998) 227–2344. Guo, M., Pan, Y.: Improving communication scheduling for array redistribution. J. ParallelDistrib. Comput. (2005) 553–5635. Hsu, C.H., Chung, Y.C., Yang, D.L., Dow, C.R.: A Generalized Processor Mapping Tech-nique for Array Redistribution. IEEE Trans. Parallel Distrib. Syst. (2001) 743–7576. Kalns, E.T., Ni, L.M.: Processor Mapping Techniques Toward Efficient Data Redistribution.IEEE Trans. Parallel Distrib. Syst. (1995) 1234–12477. Kaushik, S.D., Huang, C.H., Johnson, R.W., Sadayappan, P.: An approach tocommunication-efficient data redistribution. In: ICS ’94: Proceedings of the 8th interna-tional conference on Supercomputing. (1994) 364–3738. Lim, Y.W., Bhat, P.B., Prasanna, V.K.: Efficient Algorithms for Block-Cyclic Redistributionof Arrays. In: SPDP ’96: Proceedings of the 8th IEEE Symposium on Parallel and DistributedProcessing (SPDP ’96). (1996) 749. Ramaswamy, S., Simons, B., Banerjee, P.: Optimizations for efficient array redistribution ondistributed memory multicomputers. Journal of Parallel Distributed Computing (1996)217–22810. Thakur, R., Choudhary, A., Fox, G.: Runtime Array Redistribution in HPF Programs. In:Scalable High Performance Computing Conference, Knoxville, Tenn. (1994) 309–31611. Thakur, R., Choudhary, A., Ramanujam, J.: Efficient Algorithms for Array Redistribution.IEEE Trans. Parallel Distrib. Syst. (1996) 587–59412. Walker, D.W., Otto, S.W.: Redistribution of block-cyclic data distributions using MPI. Con-currency: Practice and Experience (1996) 707–72813. Hsu, C.H., Bai, S.W., Chung, Y.C., Yang, C.S.: A Generalized Basic-Cycle CalculationMethod for Efficient Array Redistribution. IEEE Trans. Parallel Distrib. Syst. (2000)1201–121614. Prylli, L., Tourancheau, B.: Efficient Block-Cyclic Data Redistribution. In: Proceedings ofEuroPar’96. Volume 1123 of Lectures Notes in Computer Science., Springer Verlag (1996)155–16415. Lim, Y.W., Park, N., Prasanna, V.K.: Efficient Algorithms for Multi-dimensional Block-Cyclic Redistribution of Arrays. In: ICPP ’97: Proceedings of the international Conferenceon Parallel Processing. (1997) 234–24116. Neungsoo Park and Viktor K. Prasanna and Cauligi S. Raghavendra: Efficient Algorithms forBlock-Cyclic Array Redistribution Between Processor Sets. IEEE Transactions on Paralleland Distributed Systems10