[PDF] Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs

Abstract

Low-latency detections of gravitational waves (GWs) are crucial to enable prompt follow-up observations to astrophysical transients by conventional telescopes. We have developed a low-latency pipeline using a technique called Summed Parallel Infinite Impulse Response (SPIIR) filtering, realized by a Graphic Processing Unit (GPU). In this paper, we exploit the new \textit{Maxwell} memory access architecture in NVIDIA GPUs, namely the read-only data cache, warp-shuffle, and cross-warp atomic techniques. We report a 3-fold speed-up over our previous implementation of this filtering technique. To tackle SPIIR with relatively few filters, we develop a new GPU thread configuration with a nearly 10-fold speedup. In addition, we implement a multi-rate scheme of SPIIR filtering using Maxwell GPUs. We achieve more than 100-fold speed-up over a single core CPU for the multi-rate filtering scheme. This results in an overall of 21-fold CPU usage reduction for the entire SPIIR pipeline.

Full PDF

AAcceleration of low-latency gravitational wavesearches using

Maxwel l -microarchitecture GPUs

Xiangyu Guo , Qi Chu ‡ , Shin Kee Chung , Zhihui Du § andLinqing Wen (cid:107) Tsinghua National Laboratory for Information Science and Technology, Departmentof Computer Science and Technology, Tsinghua University, Beijing, 100084, China School of Physics, The University of Western Australia, M468, 35 Stirling Hwy,Crawley, WA 6009, Australia

Abstract.

Maxwell memory access architecture inNVIDIA GPUs, namely the read-only data cache, warp-shuﬄe, and cross-warp atomictechniques. We report a 3-fold speed-up over our previous implementation of thisﬁltering technique. To tackle SPIIR with relatively few ﬁlters, we develop a new GPUthread conﬁguration with a nearly 10-fold speedup. In addition, we implement a multi-rate scheme of SPIIR ﬁltering using Maxwell GPUs. We achieve more than 100-foldspeed-up over a single core CPU for the multi-rate ﬁltering scheme. This results in anoverall of 21-fold CPU usage reduction for the entire SPIIR pipeline.PACS numbers: 04.80.Nn, 95.75.-z, 97.80.-d, 97.60.Gb

1. Introduction

We are entering an exciting time of gravitational wave (GW) astronomy. The ﬁrstGW signal is detected in September 2015 by the Laser Interferometric Gravitational-Wave Observatory (LIGO) [1]. This opens up a new window to multi messengerastronomy with unprecedented power of discovery, in probes of some of the mostenigmatic transients in the sky for their emissions in both the electro-magnetic andgravitational wave spectrum, e.g., short or long gamma-ray bursts produced in binarycoalescence and core-collapse supernovae [2, 3]. In particular, low-latency detection andlocalization of GW sources are gaining priority in order to enable prompt electromagnetic(EM) follow-up observations of GW sources. Capture of transient EM events triggered ‡ Contributed equally to this work with Xiangyu Guo, Corresponding Author’s Email:[email protected] § Corresponding Author’s Email: [email protected] (cid:107)

Corresponding Author’s Email: [email protected] a r X i v : . [ a s t r o - ph . I M ] F e b PU-accelerated CBC search pipelines , which additionally allow inputfrom multiple detectors and employ template-based statistical tests to veto transientnoises and produce GW alerts to the community. As of this writing, these three pipelineshave achieve a medium latency of less than one minute. This paper is focused on theSPIIR method. Compared to the FFT technique, it is expected to be more eﬃcient atlatencies lower than tens of seconds for advanced detectors [9].Our previous work of GPU accelerated SPIIR ﬁltering method uses NVIDIA’s

Fermi

GPUs with a speedup a factor in the order of 50-fold over a single core Intel i7 CPU [12].However, that GPU optimization targeted the SPIIR ﬁltering with number of SPIIRﬁlters in the range of 128 to 256. While in the multi-rate ﬁltering scheme, the totalﬁlters are split to be applied to data at diﬀerent sampling rates, that the number ofﬁlters can be as small as a few in some sampling rates. In this paper, we extend the GPUoptimization of SPIIR ﬁltering for various number of ﬁlters and explore the features ofthe recent

Maxwell

GPUs. Our optimization here achieves 3-fold improvement over theprevious targeted range and achieves up to 10-fold speedup beyond the range, comparedto the previous work. Compared to the SPIIR ﬁltering on a single-core CPU, our GPUacceleration is now 60 to 125 fold faster depending on the number of ﬁlters to be applied.In addition to the SPIIR ﬁltering, we have extended the GPU acceleration to othercomponents of the SPIIR pipeline. Another bottleneck of the SPIIR pipeline in termsof computational eﬃciency is the sampling-rate alterations of multi-rate ﬁltering. Inmulti-rate ﬁltering, a set of ﬁlters is applied to data at diﬀerent sampling rates, theresults of which are combined to the initial rate. Our acceleration of the multi-rateﬁltering scheme realizes a 100-fold improvement in CPU resource reduction and hencea 21 fold reduction in CPU resource for the entire SPIIR pipeline. The ﬁltering processin this paper uses single-precision ﬂoating-point number format, the result of which hasignorable diﬀerence with the results using double-precision format.

PU-accelerated CBC search

2. Multi-rate SPIIR ﬁltering

Matched ﬁltering is known to be the optimal detection method in deep searches forsignals in stationary and Gaussian noise. Here, we consider a CBC waveform templatein the time domain h ( t ). The optimal detection output, known as the signal-to-noiseratio (SNR), is given by: z ( t ) = (cid:90) t −∞ x ( t (cid:48) ) h ( t (cid:48) − t ) dt (cid:48) , (1)i.e. the cross-correlation between whitened waveform template h ( t ) and the whiteneddetector strain x ( t ), which is given by: x t = (cid:90) ∞−∞ ˜ s ( f ) (cid:112) S n ( f ) e πift df , (2)where ˜ s ( f ) is the Fourier transform of detector data s( t ) and S n ( f ) is the one-sidednoise spectral density of a detector deﬁned through the expectation E : E (˜ n ( f )˜ n ∗ ( f (cid:48) )) = 12 S n ( f ) δ ( f − f (cid:48) ) . (3)By sampling at discrete instances t = kδt at a sampling rate 1 /δt ( k = 0 , , · ), Eq. 1can be rewritten in a discrete form, z k = k (cid:88) j = −∞ x j h j − k ∆ t. (4)CBC searches usually adopt the matched ﬁlter h ∗ ( − t ), that takes Eq. 1 into aconvolution integral. Quite diﬀerent from the common FIR method, SPIIR methodutilizes IIR ﬁlters to reconstruct the matched ﬁlter. A ﬁrst-order IIR ﬁlter bears asimple form shown in Eq. 5 and Eq. 6: y k = a y k − + b x k , (5) PU-accelerated CBC search y k is the ﬁlter output at time step k ( t k = k ∆ t ), x k is the ﬁlter input, and a and b are complex coeﬃcients. A solution to this ﬁrst-order linear inhomogeneous diﬀerenceequation is: y k = k (cid:88) j = −∞ x j b a k − j . (6)For a target as complex as a CBC signal, a group of ﬁrst-order IIR ﬁlters aredeveloped with each ﬁlter representing a small segment of the matched ﬁlter [9, 11]. Thenumber of IIR ﬁlters that are needed to reconstruct a highly accurate matched ﬁltervaries from a dozen to several hundreds, depending on the complexity of the waveformand the the limit of the detection band. The ﬁlter construction procedure can be foundin [9, 11]. Here, we express the output of a SPIIR ﬁlter as: y k,l = a ,l y k − ,l + b ,l x k − d l , (7)where y k,l is the output for the l th ﬁlter and d l is the time-delay for this ﬁlter. Thediscrete form of SNR output from a group of SPIIR ﬁlters is given by: z k (cid:39) t (cid:88) l y k,l . (8) filtersfiltersfilters Figure 1: Schematic diagram of multi-rate implementation of SPIIR ﬁltering scheme. Theinput and output sampling rates are R Hz. M , M , ..., M H represent the number of ﬁlters tobe applied to the corresponding rates. Our implementation of the multi-rate ﬁltering scheme for GPU acceleration is shownin Fig. 1. For implementation conveniency, the data is downsampled by a factor of 2 insuccession. Each sample rate stream is ﬁltered using the corresponding SPIIR ﬁlters.The ﬁltering output of the lowerest rate will be upsampled and added to the ﬁltering

PU-accelerated CBC search a ,l coeﬃcient of the ﬁlter. The a ,l coeﬃcient determines the upper bound ofthe frequency band of the ﬁlter, and thus the Nyquist rate [13] for the ﬁlter to function.We round the Nyquist rate of the ﬁlter to the nearest biggest available rate for the ﬁlterto work on. x ( mT (cid:48) s ) = n = K (cid:88) n = − ( K − x ( nT s ) K ( mT (cid:48) s − nT s ) (9)To avoid the known problems of spectral leakage caused by the squared window,we adopt the popular Kaiser-window-tapered low-pass ﬁlter implemented in the open-source gstreamer library ¶ to perform the interpolation for resampling. The resamplingformula is shown in Eq. 9 where x is the data, and K represents the Kaiser-window-tapered low-pass ﬁlter. m and n are discrete sampling points. F s = 1 /T s is the originalsampling rate, F s (cid:48) = 1 /T (cid:48) s is the resampled rate. x is assumed to be bandlimited to ± F s / K is the length of the ﬁlter. We call the Kaiser-window-tapered low-pass ﬁlter‘Kaiser ﬁlter’ in the rest of the paper.There is a single parameter that controls the quality, measured by the stop-bandattenuation, of Kaiser ﬁlter. The gstreamer library provides 11 Kaiser ﬁlters withstop-band attenuation up to 100 units of decibels (dB). To avoid the band aliasingof downsampling, we choose the Kaiser ﬁlter with stop-band attenuation of ∼

100 dB.We choose the Kaiser ﬁlter with stop-band attenuation of ∼

60 dB for upsampling thatwe found work most eﬃciently while maintaing signal recovery quality in practice.The expected computational eﬃciency of our multi-rate scheme can be estimatedas follows. We denote the number of total SPIIR ﬁlters of a given template as M ; thefull-rate we are considering as R ; and the number of search templates as N . Accordingto Eq. 7 and Eq. 8, ﬁltering on one data point requires 12 ﬂoating operations. Thusthe total ﬂoating point operations per second (ﬂops) for SPIIR ﬁltering at full-rate R is 12 N M R . The 100 dB downsampling Kaiser ﬁlter has 384 steps in gstreamer and the60 dB upsampling Kaiser ﬁlter has 32 steps. The resampling and summation of ﬁlteringresult at diﬀerent sample rates are negligible. If half of the ﬁlters can be applied tosub-rate 12 R , the cost will reduce by 25%. A factor of a few savings on the computationcost is expected if more ﬁlters can be applied to even lower rates..

3. Optimization of Multi-rate SPIIR ﬁltering on Maxwell GPUs

There are several interfaces to use GPU for general purpose problem, includingNVIDIA’s Compute Uniﬁed Device Architecture (CUDA) [14] language, KhronosGroup’s Open Computing Language (OpenCL) [15], Microsoft’s C++ Accelerated ¶ gstreamer library: http://gstreamer.freedesktop.org/ PU-accelerated CBC search

Maxwell

GPUs,namely the warp-shuﬄe and atomic operation techniques (sec 3.1.3).We provide a general explanation of the relation of a GPU hardware and the CUDAsemantics for reference here. A GPU chip consists of several Streaming Multiprocessors(SMs). The Maxwell GPU features improved SM architecture renamed as SMM. OneSMM has many processing cores upon which one is capable of create, managing,scheduling, and executing CUDA threads. CUDA threads are executed in groups of32, which are named ”warps”. A group of CUDA warps aggregate to a CUDA block.While one block is limited to one SMM, one SMM can have several blocks runningconcurrently. A GPU has a hierarchy of memory spanning a range of access speedsand storage sizes. The choice of memory to use may greatly aﬀect the overall GPUperformance.

Our previous GPU optimization [12]considers SPIIR ﬁltering with the number of ﬁlters a few hundred. It is not highly-optimized toward ﬁltering with small number of ﬁlters, which is likely the case in themulti-rate ﬁltering scheme. Here, we develop a CUDA kernel for templates with ≤ > number of threads = multiple of warpsize . This is considered to be optimal as it helps avoid idled threads. For a templatewith N >

32 ﬁlters, we assign M threads for this template, where M is rounded tothe next multiple of 32 after N . For a template with N ≤

32 ﬁlters, we assigned M threads where M is rounded to the next power of 2 after N . Multiple templates maybe executed in a warp if they are able to ﬁt into the warp. For instance, a templatewith 513 ﬁlters will be assigned to 544 threads, while a template with 5 ﬁlters will beassigned to 8 CUDA threads. Four small templates (with 5 ﬁlters) will be executed PU-accelerated CBC search M u l : SPIIR : WarpThreadBlockN>32

N32

CUDA M u l : FilterTemplate 1:1 1:1Mul:1

Figure 2: Schematic of the SPIIR template-ﬁlters hierarchy mapped onto the CUDA block-warp-threads hierarchy. ‘Mul’ means ‘Multiple’ in this ﬁgure. N is the number of ﬁlters ofany given SPIIR template. within a single warp. With this assignment, it can be guaranteed that the number ofidle threads will not be more than 31 in the worst case.We use the Maxwell architecture-based

GeForce

GTX 980 GPU as our test machine,it has maximumly 2048 active threads, To maximize the active number of threads ineach SM, the minimum number of threads for one block (the number of blocks per SMis 32) must be larger than = 64. For our warp-based CUDA kernel, we chose touse 256 threads for each block as recommended in CUDA Programming Guide [14]. Forour block-based CUDA kernel, only one template is mapped to a single block.

Data access is a critical aspect which can greatly aﬀectthe GPU performance. For unoptimized GPU programs, it may take longer to performmemory access than to do the core calculation. We analyzed the data access patternof SPIIR ﬁltering and applied an eﬃcient data mapping method to improve the GPUperformance. Three types of memory are investigated in our implementation.Register is the fastest accessible memory. It is local to individual threads andcannot be accessed by other threads. At the block level, there are shared memoriesaccessible by all the threads of the same block (but not necessarily by the same warp).For the GTX 980 GPUs, one Maxwell streaming multiprocessor features 96 KB ofdedicated shared memory. Although shared memory features a broader memory access,it is much slower than the register, and usually requires synchronization within theblock. Global memory is located oﬀ the chip and its speed is the slowest in the GPUmemory hierarchy. A new feature is introduced with the Kepler GK110 architecture(a predecessor of the Maxwell architecture)—the read-only data cache. It can be usedto cache read-only global memory load which reduces the global memory access time.The GTX 980 has 4GB of global memory shared by all threads. To achieve high globalmemory throughput, we applied a coalesced memory access technique which can combinemultiple memory accesses into as few as one cache transaction.Fig. 3 is a schematic on how we organized and mapped the data of one SPIIRtemplate to the GPU memory hierarchy in order to achieve low-latency data access.

PU-accelerated CBC search Y k,N-1 ... SNRY k-1,j Z -1 a b Filter jX k-d Y k,0 Filter 0 ... X k-d j X k-d N-1

Filter N-1 Y k,j

One Template with N filters

RegisterRead-only Data CacheGlobal Memory

GPU Memory Hierarchy D a t a M a pp i n g Figure 3: A schematic on how the data are mapped onto the GPU memory hierarchy toachieve low-latency data access for SPIIR ﬁltering. The left part shows the color-coded GPUmemory hierarchy. From top to bottom, the data access speed decreases but the data storagecapacity increases. The right diagram illustrates the memory types created for variables forSPIIR ﬁltering. X k is the input data to be ﬁltered, a, b are ﬁlter coeﬃcients, Y is intermediateoutput from each ﬁlter, Z − represents the iterative process, and SNR is the ﬁltering result. The output of the SPIIR method is given by Eq. 7 and Eq. 8. Due to the iterativenature of the IIR ﬁlter, the left-hand-side of Eq. 7, y k,j , will be reused by the nextiteration. As there are a huge number of iterations, we therefore chose to store y into aregister to utilize the fastest memory access in GPU. The input parameter x k − d j cannotbe reused by the same ﬁlter, but other ﬁlters of the same template may need to accessit. The access pattern of x k has a good temporal and spatial locality so they are storedin the CUDA read-only data cache. Other input parameters, such as a ,j , b ,j , cannot bereused and so they are stored in the slower global memory, but we utilized the eﬃcientcoalesced memory access feature. Finally, the ﬁnal SNR outputs are stored into globalmemory. Obtaining the ﬁnal SNR inEq. 8 requires summation over all results of ﬁlters. This suggests a synchronization ofoutputting individual results when performing parallel computing. Previously we usedthe implicit synchronization feature of the CUDA warp to signiﬁcantly reduce the costof parallel threads synchronization and a multiple-thread parallel sum reduction methodto reduce the cost of summation steps [12].Here the summation of all results within a warp is further improved by the warp-shuﬄe technique introduced from

Kepler michroarchitecture GPUs. The warp-shuﬄetechnique allows threads to read registers from other threads in the same warp. This isa great improvement over the previous higher-latency shared memory exchange withina warp.For our block-based CUDA kernel, where the template size is larger than 32, theoutput cannot be calculated without any cross-warp communication. We consider threediﬀerent cross-warp summation methods to calculate the ﬁnal SNR from partial SNRs.

PU-accelerated CBC search

Maxwell

GPUs, the atomic operations havebeen improved signiﬁcantly.The second method is Shared memory Warp-shuﬄe (SW) method. It collects allthe partial SNRs of N times iterations into the shared memory of one warp (the batchedcomputation model proposed in [12]) and performs the warp-shuﬄe operation for theﬁnal SNR. Therefore, we need additional shared memory operations to calculate the ﬁnalSNR based on the partial SNRs, one explicit synchronization operation to synchronizeall the threads.The last method is Shared Memory Atomic Summation (SA) method. It collectspartial SNRs into shared memory of one warp and performs atomic operation to computethe ﬁnal SNR. This method also involves the same shared memory loading overheadand synchronization overhead as the SW method. Later we will show that the simplestatomic operations gives the best performance compared with the SW and SA methods. M u l : Downsampling2X ThreadBlockCUDA M u l : Figure 4: Schematic of the downsam-pling data hierarchy mapped onto theCUDA block-threads hierarchy. ‘Mul’means ‘Multiple’, and ’1s’ means ’onesecond’ in this ﬁgure.

Downsampling from N points to N/2 pointsY

N/2-1 ... Y X -190 X -189 X N+191 Y Downsampling filter K Y Downsampling filter K ... Y Y N/2-1 X -191 Data Mapping

Figure 5: Schematic of the downsam-pling GPU kernel function and howdata mapped onto the CUDA GPUmemory hierarchy. Each color repre-sent a type of GPU memory. Green:shared memory; red: register mem-ory; blue: global memory.

We optimize the usage of CUDA threads and various types of memory as in Fig. 4and Fig. 5 for the downsampling process. As shown in Fig. 4, we map the production ofone downsampled data point to one CUDA thread. The number of downsampled datapoints can vary over a power of 2 series ranging from 32 to 2048. For downsamplingwith output data points more than the number of GPU cores, the GPU cores will be

PU-accelerated CBC search M u l : Upsampling2X+ Upstream summation ThreadBlockCUDA M u l : Figure 6: Schematic of the upsam-pling input data hierarchy mappedonto the CUDA block-threads hierar-chy. ‘Mul’ means ‘Multiple’, and ’1s’means ’one second’ in this ﬁgure.

Upsampling from N/2 points to N points and upstream summation ... Y ’ X -14 X -13 X N/2-1 Y Upsampling filter KUpsampling filter K ... X -15 ... Y Y ’ Y ’ Y ’ Y N-2 ’ Y N-1 ’ Data Mapping

Figure 7: Schematic of the upsam-pling and combination GPU kernelfunction and how data mapped ontothe CUDA GPU memory hierarchy.The left part shows the GPU mem-ory hierarchy with explanation of thecolor code in Fig. 5. fully occupied. It is inevitable for some sub-rate downsamplings, there will be idle GPUcores. As downsampling is the least computational process, that it is only performedonce for all templates, in our multi-rate scheme, we do not further optimize the idle cores.The number of blocks allocated is designed as N min { , N } where N is the number ofdownsampled points. For the three types of GPU memories, we map the input anddownsampling Kaiser ﬁlter to the shared memory as they will be reused a number oftimes. Kaiser ﬁltering is calculated iteratively and the intermediate ﬁltering result isstored in the register memory for fastest data access. Finally, The global memory willstore the downsampling output.Similarly, Fig. 6 and Fig. 7 show our CUDA design for the combined function ofupsampling and upstream summation. As a one-second SNR series has a real and animaginary component, we map each component of the SNR series to one CUDA block.Each thread is mapped to 2 N min { , N } upsampled SNR points where N is the numbertotal SNR points in a second. Diﬀerent from the downsampling that only is perforceonce, the upsampling and upstream summation needs to be performed for each template.The number of blocks will double the number of templates. The number of threads willbe mostly likely more than the number of GPU cores, making a full occupancy of GPUcores. For the memory mapping, the upsampling Kaiser ﬁlter is mapped to the sharedmemory and the intermediate ﬁltering output is mapped to the register memory, thesame as the downsampling memory mapping. The input SNR series can not ﬁt in theshared memory and they are stored in GPU global memory. PU-accelerated CBC search

4. Results of multi-rate SPIIR ﬁltering

In this section, we ﬁrst show the GPU performance of SPIIR ﬁltering and theimprovement from each optimization step (Sec. 4.1). In the second part (Sec. 4.2)., weshow the GPU performance of several scenarios of multi-rate SPIIR ﬁltering. Thispromises a lucrative performance improvement by GPU acceleration for our SPIIRpipeline.All GPU implementations are tested on a

GeForce

GTX 980 (

Maxwell microarchitecture) GPU equipped desktop computer where Tab. 1 shows itsconﬁguration. As the CPU counterparts are implemented in single CPU thread fashion,we use the elapsed time as our performance measurement to reﬂect the usage of CPUresource. We run each experiment ten times and use the average time as the timingresult. We set the number of templates to 4096 for performance testing purpose in thissection. Note that the number of templates for a search can range from a few hundredsto hundreds of thousands.

Table 1: Testbed conﬁguration

Hardware CPU Intel

Core i7-3770 3.40 GHzHost Memory 8 GB DDR3GPU NVIDIA

GeForce

GTX 980GPU Memory 4 GB DDR5Software Operating System Fedora 20 64-bitHost Compilation gcc 4.8.3 -O2CUDA Compilation nvcc 6.5

In this section , We ﬁrst show the overall improved performance of our new GPUacceleration (noted as

New Kernel ) over the previous GPU acceleration [12] (notedas

Pre-Kernel ). We then show the breakdown of the improvement by exploiting newfeatures of

Maxwell

GPU. s and

Pre-Kernel s are tested against thethe ﬁltering using a single-core CPU. Tab. 2 shows the speedup ratio of

New Kernel s incomparison to

Pre-Kernel s with diﬀerent number of ﬁlters for ﬁltering. For number ofﬁlters ≥

32 where both kernels use the same thread conﬁguration, the improvements bythe

New Kernel s over

Pre-Kernel s increase as the number of ﬁlters increase, as shownby Tab. 2. This is mainly due to that the

New Kernel s improve the data access andexchange speed, the eﬀects of which are manifested with more number of ﬁlters used.For the previous targeted optimization range (i.e. number of ﬁlters 128 to 256), the

New Kernel have about 3-fold speed improvement over the

Pre-Kernel . For the size of

PU-accelerated CBC search Figure 8: Performance comparison in per-forming parallel summation reduction betweenshared memory and warp-shuﬄe using 4096templates Figure 9: Memory access performance com-parison between caching parameter x in L2cache and read-only data cache using 4096templates template (i.e. the number of ﬁlters) is relatively low (template size < Table 2: Speedup ratios of diﬀerent GPU kernels compared with the CPU counterpart.

Template Size 4 8 16 32 64 128 256 512Kernel

N ewKernel

P re − Kernel

Fig. 8 shows the performance of two diﬀerent implementations with and withoutwarp-shuﬄe. The ﬁrst implementation uses warp-shuﬄe to access data within a warpto calculate the summation. The second implementation uses shared memory toaccess data and calculate summation. It shows that warp-shuﬄe summation performssigniﬁcantly faster than shared memory summation.Fig. 9 shows the eﬀect of using the read-only data cache. The

Maxwell

GPU loadsits global memory data to L2 cache rather than to L1 cache. In Fig. 9, GPU L2 cacheillustrates the kernel performance when storing parameter x of Eq. 7 in the L2 cache,which is the default cache for global memory. Because variables such as x of Eq. 7 hasa favorable temporal and spatial locality, using the read-only data cache can be veryeﬃcient as shown in the ﬁgure. Fig. 10 illustrates the performance of the block-based SPIIR kernels using threediﬀerent cross-warp summation methods: DA (directly using atomic operation in globalmemory),SW (using warp-shuﬄe and shared memory) and SA (using atomic operation

PU-accelerated CBC search Figure 10: Execution time of diﬀerent crosswarp summation methods with 4096 templatesin diﬀerent execution conﬁgurations. in shared memory) methods. To ensure objective comparison, we attempt to optimizeeach implementation as much as possible. GPU bank conﬂicts are perfectly avoided inthe SW and SA methods where shared memory usage is involved. As one would expectthat shared memory access is much faster than global memory access, the SW and SAmethods could be faster than DA method. Surprisingly, the DA method surpasses theother two methods as shown in Fig. 10.We design two additional experiments to ﬁgure out the reason for the inferiorperformance of the shared memory methods (SW and SA), whether it is because ofthe cost of shared memory accesses or the explicit synchronization. To reduce the totalsynchronizaiton cost, both SW and SA methods take advantage of shared memory toexecute many iterations before one explicit synchronization operation. Fig. 11 clearlyillustrates that the cost of synchronization can be signiﬁcantly reduced or amortizedinto many iterations. The larger the iteration number, the better the performance.However, when the iteration number reaches a certain point, the performance beneﬁtbecomes almost unchanged as the iteration number continue growing. Fig.12 showsthat the reason is that the synchronization overhead has been completely hiddenby calculation in such circumstances. In Fig.12 we disabled the synchronizationoperation in these shared memory GPU kernels to observe the performance inﬂuenceof explicit synchronization operations in suﬃciently large iteration number. Thoughthe computation result may not be correct, the comparison itself does show theimpact of explicit synchronization operations. As can be seen, the performancebetween synchronization and no synchronization GPU kernels are almost unidentiﬁable,illustrating that we can completely remove the overhead of explicit synchronization byimproving the iteration number. Therefore, the additional shared memory operationsintroduced in SW and SA methods lead to their lower performance compared with DAmethod.The impact of atomic operations is also tested for the DA method. Atomicoperations are typically considered costly and should be avoided whenever possible,

PU-accelerated CBC search Figure 11: The optimization eﬀect when wechange the iteration number. Template size of256 is used (with 4096 templates). Figure 12: The overhead of explicit synchro-nization can be ignored when we use large it-eration number. Iteration number 256 is usedhere (with 4096 templates)Figure 13: Execution time of atomic and non-atomic kernels with 4096 templates in diﬀerentexecution conﬁgurations. however Fig.13 shows the cost of atomic operation can be almostly ignored because theexecution time of two kernels with and without atomic operations is so close. All theexperiments explain why DA method is better than SW and SA methods.

The results shown in the last section is SPIIR ﬁltering in single-rate. Here we showthe performance of the GPU-accelerated multi-rate implementation of SPIIR ﬁltering,which includes GPU accelerated rate alterations. We test several multi-rate scenariosdenoted as ”number of rates” in Tab. 3. “Number of rates” as 1 in the table denotethe single rate at full rate (4096 Hz). The number of ﬁlters at full rate for the 4096templates is set as 1024. The sub-rates are formed according to the design shown inSec. 2.2. The initial ﬁlters are equally divided to each sub-rate for testing purpose.Tab. 3 shows elapsed times for the CPU and the GPU-accelerated multi-rate ﬁlteringin diﬀerent scenarios. The eﬃciency of our multi-rate design shown here is consistent

PU-accelerated CBC search

Table 3: Elapsed times of CPU and GPU-accelerated multi-rate ﬁltering scheme in diﬀerentscenarios.

Number of rates 1 2 3 4 8CPU multi-rate scheme 34815s 26379s 20817s 16680s 8841sSPIIR ﬁltering 34815s 26187s 20517s 16342s 8453sGPU multi-rate scheme 287s 220s 199s 143s 85sSpeedup 121x 120x 104x 116x 104x

5. Results of the GPU-accelerated low-latency SPIIR pipeline

Calibrateddata Coincident post-processing GW trigger databaseData whitening

Multi-rateSPIIR filtering

L Alerts to EM communitySourcelocalizationData whitening

Multi-rateSPIIR filtering

CalibrateddataH

Figure 14: Schematic ﬂowchart of the low-latency SPIIR pipeline gstlal iir inspiral with datainput from two detectors. Solid blocks depict main components of this pipeline. Dashed blocksdepict external processes.

The low-latency SPIIR detection pipeline we are considering here is publiclyavailable through distribution of gstlal software library + . gstlal provides a varietyof components from LIGO Algorithm Library (LAL) ∗ for LIGO data processing anduses the gstreamer framework to control streaming data. We use the existing gstlal components for reading calibrated data, data whitening, performing coincident post-processing for our pipeline. We use our own multi-rate SPIIR ﬁltering for templateﬁltering. The pipeline has the code name gstlal iir inspiral in gstlal . A schematicﬂowchart of the pipeline is shown by Fig. 14.The gstreamer framework inherently employs multi-threading technique to takeadvantage of the multiple cores of a CPU. Therefore, the CPU implementation of theSPIIR pipeline is by default multi-threaded. We propose a diﬀerent criterion, the CPUtime used by all CPU cores, rather than the elapsed time for the single-threaded CPUimplementation, to measure the performance of the GPU accelerated pipeline versusthe original CPU pipeline. + gstlal library: https://wiki.ligo.org/DASWG/GstLAL ∗ lal PU-accelerated CBC search Table 4: Computation proﬁling of the CPU low-latency SPIIR pipeline gstlal iir inspiral . Multi-rate SPIIR ﬁltering OtherSPIIR ﬁltering Resampling Upstream summation93% 3.1% 0.9% 3%We present the used CPU time of the main components of the CPU pipelinein percentages, shown by Tab. 4. This computation proﬁling is performed using theplatform given by Tab. 1. The data we process is a short segment of recolored LIGO5th Science run data, which represents a set of clean data and produces a reasonablenumber of single events for coincidence analysis. The SPIIR ﬁltering dominates thecomputation, taking over 93% of the CPU time while the resampling and summationcome next at 4%.We now can estimate the expected CPU resource reduction of the pipeline, if partof the whole the multi-rate ﬁltering component is executed on GPU instead of CPU.The expected resource reduction κ app is determined by the resource reduction broughtby accelerated module κ mod and the computational cost of this module P mod in thepipeline. It is given by: κ app = 11 − P mod + P mod /κ mod , . (10)If we only apply GPU acceleration only on the module of SPIIR ﬁltering in our pipelineand we choose a somewhat 100-fold from Tab. 2 for this component. The total resourcereduction ratio of our pipeline will be not more than 13-fold. If on top of that, we applyGPU acceleration on resampling and upstream summation components, and we choosea somewhat 100-fold from Tab. 3 for this multi-rate SPIIR ﬁltering. The total resourcereduction of our pipeline will be about 25-fold.We measure the CPU resource reduction eﬃciency and also the elapsed time gainusing GPU acceleration on multi-rate SPIIR ﬁltering. The setup of the performancetest is explained in detail here. We simulate two sets of data using Advanced LIGOnoise for the twin LIGO detectors, respectively. The duration of each data set is 1000seconds. These data sets are injected coherently into a simulated binary neutron starcoalescence GW signal. To prepare the templates used by our pipeline for the search,we generate two template banks for each detector, which covers the parameters of theinjection. Each bank consists of 1024 geometric templates. We generate SPIIR ﬁltersfor each bank and divided the ﬁlters into four groups corresponding to four designatedsub-rates. We insert ﬁlters with zero coeﬃcients into each group so that the number ofﬁlters of each group in a bank will be a power of 2. While it is not necessary to do theﬁlter insertion for the search, the insertion here is for performance testing purpose. Thenumber of SPIIR ﬁlters of each group in each bank is shown in table 5.Tab. 6 shows the CPU resource reduction and elapsed time gain by the GPU-accelerated SPIIR pipeline. It achieves 21-fold improvement on CPU resource reduction.This is close to our expectation shown earlier. Besides, we have signiﬁcantly reduce the PU-accelerated CBC search Table 5: Number of ﬁlters in each frequency band of a bank and the data sampling ratesrequired by Nyquist theorem.

Sampling rate (Hz) 4096 2048 1024 512Frequency band (Hz) < < < < Table 6: Performance of the low-latency SPIIR pipeline merely using CPU power (CPUpipeline) and with GPU acceleration of multi-rate SPIIR ﬁltering (GPU pipeline).

Pipeline Used CPU time CPU resource reduction ratio Elapsed time Speed-upCPU pipeline 31610s 1x 4560s 1xGPU pipeline 1520s 21x 430s 11x running time of the pipeline by 11-fold. The diﬀerence of the SNRs between the GPUpipeline and the CPU pipeline on the injection event is within 0 .

6. Conclusions and Future Work

Low-latency and real-time detections of gravitational-wave (GW) signals are gainingpriority for their potential to enable prompt electromagnetic follow-up observations. Thelow-latency SPIIR detection pipeline is a CBC detection pipeline with that latencies oftens of seconds.In this paper, we develop the GPU acceleration for the main computationalcomponent of the SPIIR pipeline — multi-rate SPIIR ﬁltering. We ﬁrst improvethe GPU optimization of the ﬁltering part, by employing a new kind of GPU threadconﬁguration and exploiting new memory features of

Maxwell

GPUs. This providesa notable 1 . Fermi

GPUs. Weimplement GPU acceleration of resampling and upstream summation parts for the multi-rate ﬁltering procedure. Our tests show the multi-rate ﬁltering has been acceleratedover 100 fold given diﬀerent ﬁltering scenarios. This leads to a 21-fold CPU resourcereduction for the entire pipeline and a 11-fold reduction on elapsed time.

Acknowledgments

We would like to thank Maurice H.P.M. van Putten for discussion and commentson the details of the paper. This research is supported in part by National NaturalScience Foundation of China (Grant No. 61440057, 61272087, 61363019 and 61073008),Beijing Natural Science Foundation (Grant No. 4082016 and 4122039), the Sci-TechInterdisciplinary Innovation and Cooperation Team Program of the Chinese Academy ofSciences, the Specialized Research Fund for State Key Laboratories, and the AustralianResearch Council Discovery Grants and Future Fellowship programs. QC gratefullyacknowledges the support of an International Postgraduate Research Scholarship funded

PU-accelerated CBC search

References [1] B. P. Abbott, R. Abbott, T. D. Abbott, et al. Observation of gravitational waves from a binaryblack hole merger.

Phys. Rev. Lett. , 116:061102, Feb 2016.[2] Bangalore Suryanarayana Sathyaprakash and Bernard F Schutz. Physics, astrophysics andcosmology with gravitational waves.

Living Reviews in Relativity , 12(2):18–19, 2009.[3] Maurice HPM van Putten, Gyeong Min Lee, Massimo Della Valle, Lorenzo Amati, and AmirLevinson. On the origin of short grbs with extended emission and long grbs without associatedsn.

Monthly Notices of the Royal Astronomical Society: Letters , 444(1):L58–L62, 2014.[4] Q. Chu, E. J. Howell, A. Rowlinson, et al. Capturing the electromagnetic counterparts of binaryneutron star mergers through low-latency gravitational wave triggers.

MNRAS , 459:121–139,June 2016.[5] B. J. Owen and B. S. Sathyaprakash. Matched ﬁltering of gravitational waves from inspiralingcompact binaries: Computational cost and template placement.

PRD , 60(2):022002–+, July1999.[6] B. Allen, W. G. Anderson, P. R. Brady, D. A. Brown, and J. D. E. Creighton. FINDCHIRP: analgorithm for detection of gravitational waves from inspiraling compact binaries.

ArXiv GeneralRelativity and Quantum Cosmology e-prints arXiv:gr-qc/0509116 , September 2005.[7] D. Buskulic, Virgo Collaboration, and LIGO Scientiﬁc Collaboration. Very low latency searchpipeline for low mass compact binary coalescences in the LIGO S6 and Virgo VSR2 data.

Classical and Quantum Gravity , 27(19):194013, October 2010.[8] K. Cannon, R. Cariou, A. Chapman, et al. Toward Early-Warning Detection of GravitationalWaves from Compact Binary Coalescence.

ArXiv e-prints arXiv:1107.2665 , July 2011.[9] J. Luan, S. Hooper, L. Wen, and Y. Chen. Towards low-latency real-time detection of gravitationalwaves from compact binary coalescences in the era of advanced detectors.

ArXiv e-prints1108.3174 , August 2011.[10] S. Hooper, L. Wen, D. Blair, et al. Low-Latency Detection of Gravitational Waves. In

AmericanInstitute of Physics Conference Series , volume 1246 of

American Institute of Physics ConferenceSeries , pages 211–214, June 2010.[11] S. Hooper, S. K. Chung, J. Luan, et al. Summed parallel inﬁnite impulse response ﬁlters forlow-latency detection of chirping gravitational waves.

PRD , 86(2):024012, July 2012.[12] Yuan Liu, Zhihui Du, Shin Kee Chung, et al. Gpu-accelerated low-latency real-time searchesfor gravitational waves from compact binary coalescence.

Classical and Quantum Gravity ,29(23):235018, 2012.[13] H. S. Black.

Modulation Theory . New York, Van Nostrand, 1953.[14] CUDA Nvidia. Nvidia cuda c programming guide.

NVIDIA Corporation , 120, 2011.[15] Khronos OpenCL Working Group et al. The opencl speciﬁcation. version , 1(29):8, 2008.[16] Kate Gregory and Ade Miller.

C++ AMP: Accelerated Massive Parallelism with Microsoft R (cid:13) Visual C++ R (cid:13)(cid:13)