Application of Graphics Processing Units to Search Pipeline for Gravitational Waves from Coalescing Binaries of Compact Objects
Shin Kee Chung, Linqing Wen, David Blair, Kipp Cannon, Amitava Datta
AApplication of Graphics Processing Units to SearchPipeline for Gravitational Waves from CoalescingBinaries of Compact Ob jects
Shin Kee Chung , Linqing Wen ‡ , David Blair , Kipp Cannon ,Amitava Datta School of Physics, The University of Western Australia, 35 Stirling Highway,Crawley WA 6009, Australia Division of Physics, Mathematics and Astronomy, California Institute ofTechnology, 1200 California Blvd., Pasadena CA 91125, USA School of Computer Science and Software Engineering, The University of WesternAustralia, 35 Stirling Highway, Crawley WA 6009, Australia
Abstract.
We report a novel application of a graphics processing unit (GPU) forthe purpose of accelerating the search pipelines for gravitational waves from coalescingbinaries of compact objects. A speed-up of 16 fold in total has been achieved with anNVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz IntelQ9300 central processing unit (CPU). We show that substantial improvements arepossible and discuss the reduction in CPU count required for the detection of inspiralsources afforded by the use of GPUs.PACS numbers: 04.25.Nx,04.30.Db,04.80.Nn,95.55.Ym ‡ Also, International Centre for Radio Astronomy Research, UWA, 35 Stirling Hwy, Crawley, WA6009, Australia. Email: [email protected] a r X i v : . [ g r- q c ] M a y PU-accelerated inspiral searches
1. Introduction
It is an exciting time for gravitational-wave astronomy. Several ground-basedgravitational-wave (GW) detectors have reached (or approached) their design sensitivity,and are coordinating to operate as a global array. These include the three LIGOdetectors in Louisiana and Washington states of USA, and the Virgo detector in Italy.The LIGO detectors have already completed their ground-breaking fifth science run.An integrated full year’s worth of science data summing up to more than 10 terabyteshas been accumulated from all three interferometers in coincidence. Enhanced LIGOstarted to operate in June 2009 and plan to continue until the end of 2011 with similarsensitivity [1]. Starting in 2011, an upgrade of Enhanced LIGO to Advanced LIGO(expected to operate in 2014) will enable a 10-fold improvement in sensitivity, allowingthe detectors to monitor a volume of the universe 1000 times larger than currentdetectors. The detection rate of signals from coalescing binaries of compact objectsfor these advanced detectors is estimated to be tens to hundreds of events per year [1].The detection of the first GW is virtually assured with Advanced LIGO.
Coalescing binaries of compact objects consisting of neutron stars and black holesare among the most important GW sources targeted by current large scale GWdetectors [2, 3] as these sources produce a very distinct pattern of gravitational wave.The optimal way to detect known waveforms in noisy data is to perform a matchedfiltering. The matched filtering technique is performed by calculating the correlationbetween the gravitational wave data and a set of known or predicted waveformtemplates [4, 5]. The post-Newtonian expansion method is used to approximate thenon-linear equations that describe the motion of coalescing binaries and wave generationin the creation of waveform templates [6]. For spinless, circular, binary systems eachwaveform is specified by a set of parameters including a pair of individual masses I = ( m , m ), constant orbital phase offset at formal coalescence α and effective distance D eff from the detector. In our tests, second order post-Newtonian orbital phases andNewtonian amplitude were used.The application of the matched filtering technique to on-going searches forgravitational waves from coalescing binaries of compact objects is described in [2] and issummarized as follows. The template waveforms corresponding to α = 0 and α = π/ I = ( m , m ), the waveforms with α = 0 and α = π/
2, denoted h Ic and h Is are approximately related by ˜ h Ic ( f ) = − i ˜ h Is ( f )where ˜ h is the Fourier transform of h . Exploiting this, the matched-filter output z ( t ) isa complex time series defined as z ( t ) = x ( t ) + i y ( t ) = 4 (cid:90) ∞ ˜ h Ic ( f )˜ s ∗ ( f ) S n ( f ) e π i ft d f (1)where S n ( f ) is the one-sided strain noise power spectral density, ˜ s ∗ ( f ) is the Fourier PU-accelerated inspiral searches s ( t ), x ( t ) is thematched-filter response of h Ic and y ( t ) is the matched-filter response of h Is . For initialLIGO detectors, S n ( f ) is defined as S n ( f ) = (cid:18) . ff (cid:19) − + 0 . (cid:18) ff (cid:19) − . + 0 .
52 + 0 . (cid:18) ff (cid:19) (2)where f = 150 Hz [8]. In practice, the noise power spectral density from the ScienceRequirement Document is used [9]. Maximizing over the coalescence phase α , the signal-to-noise ratio (SNR) ρ ( t ) is the absolute value of the scalar product between a normalizedtemplate and the detector output in frequency domain [2] ρ ( t ) = | z ( t ) | σ (3)where the normalization factor σ is calculated from the variance σ = 4 (cid:90) ∞ | ˜ h Ic ( f ) | S n ( f ) d f. (4)For stationary and Gaussian noise, this ρ is the optimal detection statisticc for a singledetector.The number of templates needed depends on the parameter volume needed tobe searched. The two masses of the compact binary objects were used as the mainparameters in our experiment as we were focusing on spinless and circular binaries ofcompact objects. The low frequency cutoff was set to be 40 Hz while high frequencycutoff is the Nyquist frequency (half of the sampling frequency, 2 kHz in our case) or thefrequency of innermost stable circular orbit (ISCO, e.g. [10]) for the analyzed template,whichever is lower. In our experiments, the mass ranges were varied and the numberof templates corresponding for each mass range was calculated. The mass ranges andtheir corresponding number of templates are listed in Table 1. In order to achieve <
5% mismatch (i.e., 1 − ρ/ρ exp <
5% where ρ exp is the expected SNR when templatewaveform exactly matches the signal in the data), thousands of templates are required [4]to analyze a data segment for mass ranges of 1.4–11 solar masses for each individualmember of the binary. In the currently running search pipeline described in [2], eachdata segment is made up of 256 seconds of detector data down-sampled to 4096 Hzgiving 2 data points. This means that thousands of FFTs, each of approximately 1million data points, are required to filter one data segment through the template bank. χ Consistency Test
In order to verify the signals and reject non-Gaussian transient noise, the χ consistencytest [11] is used as a time-frequency veto. The integral in Eq. (1) is split into p frequencybands such that each contributes an equal amount to the SNR, and this yeilds p timeseries, z l ( t ), where l ranges from 1 to p . In stationary Gaussian noise with or without agravitational wave signal, the statistic [2] χ ( t ) = pσ p (cid:88) l =1 | z l ( t ) − z ( t ) /p | . (5) PU-accelerated inspiral searches Table 1.
Mass ranges and their corresponding number of templates obtained fromrunning the actual template generating program in the LIGO searching pipeline validfor LIGO Hanford detector H1.
Mass Range (solar masses) Number of templates10.0 - 11 79.0 - 11 158.0 - 11 267.0 - 11 486.0 - 11 855.0 - 11 1634.0 - 11 3173.0 - 11 7182.0 - 11 21111.6 - 11 37341.4 - 11 5222is a χ -distributed random variable with ν = 2 p − χ statistic, and this can be usedto reject such noise events [2]. As the above discussion implies, much computing power is required to search for GWsfrom these sources. With current technology, more than 50 CPU cores are requiredto finish the detection phase of the analysis within the duration of the data. The χ waveform consistency test requires another 50 CPU cores processing power in orderto complete the analysis in real-time. Furthermore, determination of sky directions ofthese sources requires hours to days of CPU-core time. The long time scales requiredfor detection, verification, and localization pose a serious problem for prompt follow-up observations of these sources by optical telescopes. Such follow-up observations areexpected not only to provide firm proof that a gravitational wave has been detectedbut also to provide insight into the physics associated with the events. Much fasterprocessing is therefore required for real-time detection and determination of sourcedirections,In this paper, we propose a cost-effective and user-friendly alternative to reducethe computational cost in GW searches by using the graphics processing unit (GPU)to accelerate the data processing. The on-going searches for gravitational-wave signalsfrom detector data are ideal for these massively parallel processors. This is due to thefact that the same algorithm is applied to different data segments independently andrepeatedly and that the latency of transferring data between GPU and CPU is negligiblecompared to analysis time. We report here the result of a first test, using a GPU in a PU-accelerated inspiral searches inspiral search pipeline ). A previousreport can be found in [12].
2. Graphics Processing Units and CUDA
Graphics processing units (GPUs) were originally designed to render detailed real-timevisual effects in computer games. The demands for GPUs in the gaming industryhave enabled GPUs to become low-cost but very efficient computing devices. Dueto the nature of its hardware architecture, it is advantageous to use GPUs for solvingparallel problems that fit the single-instruction-multiple-data (SIMD) model. While thecapability of GPUs in high performance computing has been recognized since 2005 [13],general purpose GPU (GPGPU) parallel computing has really become viable onlyrecently. This is due to the release of the C-programming interface CUDA (ComputeUnified Device Architecture) by NVIDIA Corporationr in February 2007 [14]. Theintroduction of CUDA enabled scientists in a much broader community to program onGPUs by calling C-libraries. Previously, one would have to translate a general probleminto graphical pixel models in order to make use of the GPUs. Remarkable speed-ups ofup to a factor of hundreds have been reported in many applications including those ofimportant astronomical applications [13, 15]. A sizable CUDA library is now availablefor basic scientific computations. This includes linear algebraic computation, FFT andtools for Monte-Carlo simulation. The use of GPGPU techniques as an alternative todistributed computer clusters has also become a real possibility.One successful application of GPU acceleration was in the computation of moleculardynamics. Anderson, Lorenz and Travesset [16] implemented CUDA algorithms tohandle the core calculations for molecular dynamics. Anderson et al. in fact slightlyaltered one of the core algorithms of molecular dynamics for CUDA so that someof the calculations will be repeated instead of accessing the same information frommemory. This avoided the problem of inefficiency of CUDA in accessing random memorylocations. The CUDA program running in a system with one single GPU and one singleCPU was found to be performing at equivalent level of a fast computer cluster with36 cores. Such a cluster consumes more power compared to a single computer with aGPU [16].There exist several studies of CUDA implementations in the field of astronomy.Belleman et al. [17] studied the CUDA implementations of N-body simulations, followingthe studies of Zwart et al. [13] who used GPUs in N-body simulations before the releaseof CUDA. The CUDA implementation of N-body simulations developed by Belleman etal. was able to achieve up to 100 times speed-up compared to that of CPU. Harris etal. [15] conducted the CUDA implementations for calculating the signal convolutions fora radio astronomy array. In this application, signals from each antenna are combinedusing convolutions. This allows an array of antennas be used to achieve high angularresolution. The CUDA implementation of this process showed two orders of magnitude
PU-accelerated inspiral searches
Programming for the GPU with CUDA is different from general purpose programmingon the CPU due to the extremely multi-threaded nature of the device. For an algorithmto execute efficiently on the GPU, it must be cast into a data-parallel form with manythousands of independent threads of execution running the same lines of code, but ondifferent data. Because of this simultaneous execution, one thread cannot depend uponthe output of another, which can pose a serious challenge when trying to cast somealgorithms into a data-parallel implementation.Specifically, in CUDA, these independent threads are organized into blocks whichcan contain anywhere from 32 to 512 threads each, but all blocks must have the samenumber of threads. Each block executes identical lines of code and is given an indexto identify which piece of the data it is to operate on. Within each block, threads arenumbered to identify the location of the thread within the block. Any real hardware canonly have a finite number of processing elements that operate in parallel. In particular,a single GeForce 8800 Ultra GPU that was used in our work contains 16 multiprocessors.A single multiprocessor can execute a number of blocks concurrently (up to resourcelimits) in warps of 32 threads.CUDA programs are divided into 2 parts, host functions and kernel functions.The host functions are the code that run in the CPU and are able to invoke kernelfunctions. Kernel functions are the code that run in the GPU. These functions areautomatically executed by the threads in each block of the GPU. CUDA programmersneed to specify the number of blocks and the number of threads per block for thekernel functions. Threads in the same block can be synchronized (hold and wait untilall threads have finished previous tasks) and communicate (access the data or outputof other threads). However threads in different blocks cannot be synchronized. Thisimposes a great restriction in CUDA programming. We need to carefully choose thenumber of blocks and threads to obtain the highest performance.The memory structure of a GPU is organized in a convenient way. There is aglobal memory (768 MB for GeForce 8800 Ultra) that can implement read-and-writeoperations simultaneously. This hides the latency of data accessing between processorsand memory. Meanwhile, each block of threads executing on a multiprocessor has accessto a smaller but faster shared memory (16 KB for GeForce 8800 Ultra GPU). Properuse of the threads and memory access are therefore crucial for optimization of theperformance of GPU-computing. It is also necessary to copy data between CPU andGPU. This introduces a latency that needs to be taken into account when optimizingthe GPU-computing. Specifically, this means that the algorithms must be designed sothat the computational time is much larger than the latency.
PU-accelerated inspiral searches
3. Application of GPUs to Inspiral Search Pipeline
The search pipelines for GWs from coalescing compact binaries have been developedsince 2000 [19]. The current pipeline [2] has been used for the past five successfulscience runs on real data from the LIGO detectors [20] and the source code is publicallyavailable in the LSC Algorithm Library (LAL) [21].According to our experiments, the most time-consuming part (80% of the total runtime) of the existing search pipeline is the forward Fourier transform and its inverse. Ourimplementation replaces the Fastest Fourier Transform in the West (FFTW) [22] usedby the existing pipeline with the CUDA Fast Fourier Transform [14], denoted as CUDAFFT in this paper. The CUDA FFT also provides functions that can calculate severalFFTs in parallel, as in FFTW. We identified modules in the pipeline that perform FFTsin sequence and rewrote them using batched CUDA FFTs.Our second task was to accelerate the χ waveform consistency test described insection 1.2. This is by far the most computational-intensive module in the pipeline.Within the program, the most time-consuming part lies in a loop of FFTs that operateon different data segments in series, and a double loop that calculate the χ statisticsfrom the output of the FFTs. We took advantage of data parallelism by copyinglarge segments of data into the global memory of the GPU card, and perform the χ calculation in parallel on these data segments.In summary, our applications of the GPUs on the inspiral search pipeline includesthe use of the existing CUDA FFTs for SNR and χ calculations, and the directimplementation of parallel computation for the χ calculation. A stand-alone versionof the inspiral search pipeline (that normally runs on computer clusters) was run ona single core of a Dell Inspiron 530 computer with a 2.5 GHz Intel Core 2 Quad 9300CPU. The GPU card used for timing comparison is an NVIDIA GeForce 8800 Ultrainstalled in the same computer, using CUDA version 1.1. Stationary coloured Gaussiannoise with a spectrum matching the Initial LIGO Science Requirement Document [9]for Hanford detector H1 was used for the test. In all our tests and implementation,we purposely kept all relevant search parameters close to a real search as implementedin the past few science runs. Further acceleration could be expected once we considerflexible search parameters or rewrite a much larger fraction of the code. As described in the previous subsection, the GW search pipeline spends the majority ofthe time in performing FFTs. LAL uses Fastest Fourier Transform in the West (FFTW)for performing FFTs. FFTW was developed by Frigo and Johnson for improving theperformance of FFTs calculations by CPUs [22]. The main feature of FFTW is that ituses a planner to learn the fastest way to perform FFT in a computer. This planner constructs plans for executing FFTs in a fast way, and the plans are re-used for eachexecution of the FFT in a particular computer [22]. Similarly, CUDA has a completeFFT library developed by NVIDIA which also uses a planner following the design of
PU-accelerated inspiral searches datapoints, while FFTW can execute about 8 FFTs per second. That means CUDA FFTis performing 5 times faster than FFTW. Figure 2 shows about 7.5 times speed-up for4 × data points. In the inspiral search pipeline, the FFTs are performed on 2 datapoints. Our results indicate that more speed-up can be achieved if more data points areused. Overall, it is only worth the effort to use CUDA FFTs if a large number of FFTsare to be executed.We developed an interface of the CUDA FFT to be used by the GW search PU-accelerated inspiral searches Figure 1.
FFTs executed per second as a function of the total number of FFTsexecuted with 2 data points each. The green solid line shows the number of FFTsexecuted by CUDA, while the red dashed line shows the number of FFTs executed byFFTW. Figure 2.
FFTs executed per second as a function of the total number of FFT executedwith 4 × data points each. The green solid line shows the number of FFTs executedby CUDA, while the red dashed line shows the number of FFTs executed by FFTW. PU-accelerated inspiral searches Figure 3.
The comparison flow chart of the sequential FFTs and batched FFTsexecution in the χ calculation. community in general. This was done by adding the interface into LAL for callingthe CUDA FFT library which can be manually activated or deactivated by LAL users.This is the first time that a GPU interface was successfully developed for LAL andtested with real applications. We applied the GPU data parallelism to the most computationally intensive and timeconsuming stage of the gravitational-wave search pipeline, the χ test. The χ testsplits the inspiral template into 16 pieces in the frequency domain, convolves theFourier transform of the input data with the split 16 time series representing thecontribution to the template’s net SNR (as a function of time) from each of its 16pieces. Suitably normalized, the sum of the square magnitudes of these 16 time series is χ -distributed when the input data contain stationary Gaussian noise and a possibly-absent gravitational-wave signal, and testing for this forms the basis of a waveformconsistency test.The conversion of the χ test to a GPU implementation was done in two parts.Firstly, 16 sequential inverse FFTs were replaced with 16 parallel inverse FFTs. Thispart was implemented by calling existing CUDA functions from the host code. Thecomparison of this implementation to the original one is shown in Figure 3. Secondly,we implemented the GPU data parallelism on the χ test described in section 1.2.Table 2 shows the χ implementation in C and the GPU-accelerated implementation. a i,l and b i,l represent the real and imaginary part of z l ( t ) − z ( t ) /p in Eq. (5) respectively. N is the number of data points being analyzed, i ranges from 0 to N − p isthe total number of frequency bands. In the original implementation, a double loop PU-accelerated inspiral searches χ values sequentially with p = 16 and N = 2 . A total of16 × χ values were thus calculated independently. For each time t , values fromall 16 frequency bands were then added (c.f. Eq. (5)), yielding a total of 2 outputs.In the CUDA implementation, we replaced this double loop with a single loop in theparallel threads. This part was implemented in a custom kernel function where 4 × blocks of 2 threads each were used. Each thread calculates eight χ values and sumsthem together sequentially. Adjacent threads were used to calculate χ values of thesame frequency band. The results from every two adjacent threads are then summedat the end of this single loop after a synchronization was executed. This approach wasfound to give optimal performance. The thread numbers are chosen to be multiples ofthe GPU warp size 32 (explained in Section 2.1) and able to divide the loop numberexactly. Table 2.
Algorithm comparison of χ implementation in C and CUDA . Original C implementation CUDA implementation for(i=0; i Thread i : for(l=0; l
The run time of the inspiral searching pipeline without performing the χ test. The green solid line shows the run time of inspiral search with GPU, while thered dashed line shows the run time of inspiral search without GPU. Figure 5. The speed-up factor, calculated as run time without GPU divided by runtime with GPU. PU-accelerated inspiral searches Figure 6. The run time for executing the inspiral search with χ veto enabled, bothwith and without GPU acceleration. The green solid line shows the run time of inspiralsearch with the GPU, while the red dashed line shows the run time of inspiral searchwithout the GPU. GPU implementation. About 16 times speed-up was observed. This means that thenumber of computers needed to perform the analysis in the same amount of time canbe significantly reduced. A normal computer with integrated graphics should consumeabout 220 W of power, or 3520 W for 16 single core computers, or 880 W for 4 quadcore computers. In comparison, a single computer with GeForce 8800 Ultra consumesabout 340 W of power. We could save some hardware costs and also reduce powerconsumption by a significant amount.An accuracy test was performed by calculating the fractional difference between theoutputs produced by the new inspiral search with CUDA FFT including the parallel χ implementation and the original inspiral search with FFTW in the mass range of3.0 - 11.0 solar masses. About 5 × events were identified from the data, each withmeasured SNR and χ values. We found that more than 99% of the SNRs had less than0.03% difference, while 99% of the χ values had less than 0.5% difference. 4. Conclusion We have shown that GPUs can significantly improve the speed of gravitational wavedata analysis. A speed-up of 4 to 5 fold in the existing inspiral search pipeline can beachieved by simply enabling the CUDA FFT. Note that the CUDA FFT has alreadybeen introduced to LAL, meaning that other GW search pipelines can use it providedGPUs are available. We achieved a 16 fold speed-up in total by using a specially-written PU-accelerated inspiral searches Figure 7. The speed-up factor (see text in Section 3.3) using the CUDAimplementation of the χ calculation. parallel GPU implementation of the χ test, a waveform consistency test used withinthe pipeline. We expect further speed-ups if we are allowed to change some of the searchparameters. For instance, if we change the number of data points for FFTs from thecurrently used 1 million to 4 millions, another factor of 2 speed-up can be achieved.Also, further acceleration is expected if we replace more components in the pipelinewith specially-written GPU implementations.Our experiments were performed using a single GPU, while current new personalcomputers can be equipped with more than 3 GPUs. We would expect more than 48fold speed-up using a 3 GPUs system when running a single-threaded search pipeline.Furthermore, if we can use the newest GPU on the market, which has about 1 TFLOPSof computing power, and assuming that the performance of these GPUs scales linearly,we would expect more than a 100 fold speed-up in a single core desktop computer. Acknowledgments We are grateful to Chad Hanna, Drew Keppel, Phil Ehrens, Stuart Anderson, AlanWeinstein, Yanbei Chen, Karen Haines, Shaun Hooper, Adam Mercer and Oliver Bockfor discussions of this work. This work is supported in part by David and Barbara Grocestart-up fund at Caltech, the Caltech SURF program, the Alexander von HumboldtFoundation’s Sofja Kovalevskaja Programme funded by the German Federal Ministryof Education and Research, the Australian Research Council and the WA Centres ofExcellence Program. PU-accelerated inspiral searches 5. References χ2