Unleashing the Power of Distributed CPU/GPU Architectures: Massive Astronomical Data Analysis and Visualization case study
aa r X i v : . [ a s t r o - ph . I M ] N ov **Volume Title**ASP Conference Series, Vol. **Volume Number****Author** c (cid:13) **Copyright Year** Astronomical Society of the Pacific Unleashing the Power of Distributed CPU / GPU Architectures:Massive Astronomical Data Analysis and Visualization case study
A. H. Hassan , C. J. Fluke , and D. G. Barnes [email protected], Centre for Astrophysics and Supercomputing,Swinburne University of Technology, POBox 218, Hawthorn, Australia, 3122 Monash e-Research Centre, Monash University, Clayton, VIC 3800, Australia
Abstract.
Upcoming and future astronomy research facilities will systematicallygenerate terabyte-sized data sets moving astronomy into the Petascale data era. Whilesuch facilities will provide astronomers with unprecedented levels of accuracy and cov-erage, the increases in dataset size and dimensionality will pose serious computationalchallenges for many current astronomy data analysis and visualization tools. With suchdata sizes, even simple data analysis tasks (e.g. calculating a histogram or computingdata minimum / maximum) may not be achievable without access to a supercomputingfacility.To e ff ectively handle such dataset sizes, which exceed today’s single machinememory and processing limits, we present a framework that exploits the distributedpower of GPUs and many-core CPUs, with a goal of providing data analysis and visu-alizing tasks as a service for astronomers. By mixing shared and distributed memoryarchitectures, our framework e ff ectively utilizes the underlying hardware infrastructurehandling both batched and real-time data analysis and visualization tasks. O ff ering suchfunctionality as a service in a “software as a service” manner will reduce the total costof ownership, provide an easy to use tool to the wider astronomical community, andenable a more optimized utilization of the underlying hardware infrastructure.
1. Introduction
Since they were first introduced for general purpose computing, graphics process-ing units (GPUs) have become a science-enabling technology across a wide varietyof scientific fields [e.g. bioinformatics (Schatz et al. 2007) and weather forecasting(Michalakes & Vachharajani 2008)]. The lower cost per floating point operation, thelow power consumption, and the sustainable speedup are all motivations to utilize GPUsas a practical high performance computing architecture - despite being somewhat harderto program than CPUs. Within the astronomical community, astronomers have adoptedGPUs to approach many data processing and simulation problems [see Fluke (2012)].With the energy and the power consumption as a major obstacle toward further per-formance increase in current multi-core CPU computing architectures (Bergman et al.2008), it is anticipated that GPUs and other many-core architectures(e.g. field-programmablegate array and Cell processors) will be one of the main ways to address expected petas-cale data analysis and visualization problems. With datasets exceeding current single1 A.H.Hassan, C.J.Fluke,and D.G.Barnesmachine memory limits, and currently relatively low GPU memory (e.g. 6 GB ), itis vital to e ff ectively address the problem of data handling and synchronization overheterogeneous distributed CPU / GPU architectures.Within this work, we are presenting a general purpose framework to e ff ectivelyutilize heterogeneous multi-core CPUs and GPUs toward addressing data intensive highperformance computing problems in astronomy.
2. Distributed GPU architecture
Figure 1 shows the main framework components. Each GPU device within a nodeis managed through a CPU core, which is responsible for preparing the input data,invoking the GPU kernel in a synchronous manner, performing any necessary pre / post-processing, and sharing the data with the other threads. Each thread works as an in-dependent process with a two-way communication with the master thread, which han-dles the communication between di ff erent threads (if needed) and the communicationwith the other nodes in a master-slave pattern. The communication is performed inan asynchronous manner between the master threads and other threads using a custommessage queue at each thread. Di ff erent threads can access a shared memory space,allocated by the master threads, to share data and / or update its status, which is utilizedby the scheduler sub-module for task allocation. This access is controlled via one ormore semaphores to ensure exclusive memory write. Lately, GPU drivers have startedto support the usage of a unified address space between GPU and CPU memory (e.g.NVIDIA CUDA 4.0 ), which can be utilized in this case as long as a control on the con-current access to this shared memory is minimal or not required. Another hardwarefeature which may be beneficial to speed-up data movement between di ff erent levels ofthe memory hierarchy is to use multiple execution queues (or streams) to overlap GPUcomputation with data I / O.All the communications between di ff erent distributed nodes are performed throughthe master threads only. Di ff erent data scattering and gathering operations are per-formed in two stages: a local stage between GPUs and CPUs using shared memory, anda global stage over the network using the message passing interface (MPI) protocol.This partitioning, as long as it suits the problem, minimizes the amount of communica-tion by a factor of N , where N is the number of GPU units per node.To demonstrate the performance of the presented framework, we use interac-tive volume rendering of larger-than-memory spectral data cubes as a case study [seeHassan et al. (2011) for problem description and motivations]. With data exceeding sin-gle machine memory limits, real-time processing demands, and relatively high commu-nication overhead that scales linearly with the number of processing elements, volumerendering (and interactive visualization in general) presents one of the worst case per-formance demands. It is a perfect example to demonstrate the power of mixing sharedand distributed memory to achieve the highest possible performance. Atomic operations and concurrent access prevention usually degrade the GPU performance significantly. See for details. nleashing thePowerof Distributed CPU / GPUArchitectures 3
Figure 1. Schematic diagram showing the main components of the framework.The framework is utilized to synchronize the communication between K distributednodes with N GPUs each.
The presented framework is developed as a server-side rendering back-end, with aremote visualization QT desktop viewer to enable user interactivity and result display.The CUDA driver API was utilized to implement the GPU part, with MPI as the maincommunication software backbone between nodes.The performance in Table 1 shows the framework performance [presented as num-ber of frames per second (fps)] against the dataset size in GB for the same number ofGPUs (128 GPUs)and processing nodes (64 nodes with 2 GPU each). The maximumachieved performance is 2.5 teravoxel processing per second. The amount of data ex-changed is theoretically related to the output resolution which is megapixel / GPU. Dueto di ff erent communication optimization and two-level gathering described before, theamount of data exchanged is reduced by at least 50% (Hassan et al. 2011). The maindistributed communication processing pattern was master-slave communication withno data compression. Table 1. Performance output of the larger-than-memory volume rendering prob-lem with di ff erent datasets ranging from 4 to 204 GB cubes over 128 GPUs and 64nodes (2 GPUs per node). Dimensions File Size Tesla C1060 Tesla C2050 (Data Points) (240 cores and 4GB memory) (448 cores and 3GB memory) http://qt.nokia.com/products/ A.H.Hassan, C.J.Fluke,and D.G.Barnes
3. Discussion
The presented framework aims mainly to address the design and processing constraintsof real-time problems, or problems which need di ff erent processing elements to com-municate and exchange data in order to produce the final output results (e.g. global viewdata visualization or calculating the data median). This framework can address dataexchange and synchronization demands of di ff erent data analysis tasks for datasets ex-ceeding current single machine memory limits, especially when an in-situ data analysisis required to minimize I / O overhead. If we take, for example, an expected AustralianSquare Kilometre Array Pathfinder (ASKAP) spectral data cube (around 1TB), to doany processing on it using a single GPU would require partitioning the cube into 170sub-cubes, with a di ff erent data loading for each one of them. This might be possiblefor a single-pass accumulative operation like calculating the data minimum and max-imum, but cannot solve other multiple-pass problems such as calculating the medianor standard deviation. More sophisticated data analysis tasks usually require the wholedata to be in memory to perform measurement of global properties, and that is whereour framework is more useful.Addressing data analysis and visualization processes for such data volumes willneed a clever resource utilization and data movement minimization to achieve reason-able computational performance. We think distributed GPUs can play a key role inenabling such tasks with reasonable response time. We showed in Hassan et al. (2011)that for a computationally intensive problem like volume rendering, replacing CPUswith GPUs as the main processing element can dramatically reduce the number ofprocessing nodes required. Consequently, this reduction decreases the communicationoverhead and the size of the computing facility required to address such problem.Another aspect is working in a muti-user environment. With such data intensiveproblems we need a configurable, on-demand resource sharing model, which can fit ourfuture needs [see Ostberg (2011) for a review of di ff erent available high performancecomputing resource management models]. We think the private cloud service orientedarchitecture may be a better resource sharing paradigm, with software, infrastructureand data o ff ered as a service to the user via a remote thin-client. Preparing our soft-ware to integrate with such model can o ff er large scale distributed architecture in amore a ff ordable way, reduce the total cost of ownership for both the software tools andinfrastructure, and enhance access to large datasets. Acknowledgments.
A. Hassan thanks the Astronomical Society of Australia fortheir financial support.