Acceleration of low-latency gravitational wave searches using Maxwell-microarchitecture GPUs
AAcceleration of low-latency gravitational wavesearches using
Maxwel l -microarchitecture GPUs
Xiangyu Guo , Qi Chu ‡ , Shin Kee Chung , Zhihui Du § andLinqing Wen (cid:107) Tsinghua National Laboratory for Information Science and Technology, Departmentof Computer Science and Technology, Tsinghua University, Beijing, 100084, China School of Physics, The University of Western Australia, M468, 35 Stirling Hwy,Crawley, WA 6009, Australia
Abstract.
Low-latency detections of gravitational waves (GWs) are crucial to enableprompt follow-up observations to astrophysical transients by conventional telescopes.We have developed a low-latency pipeline using a technique called Summed ParallelInfinite Impulse Response (SPIIR) filtering, realized by a Graphic Processing Unit(GPU). In this paper, we exploit the new
Maxwell memory access architecture inNVIDIA GPUs, namely the read-only data cache, warp-shuffle, and cross-warp atomictechniques. We report a 3-fold speed-up over our previous implementation of thisfiltering technique. To tackle SPIIR with relatively few filters, we develop a new GPUthread configuration with a nearly 10-fold speedup. In addition, we implement a multi-rate scheme of SPIIR filtering using Maxwell GPUs. We achieve more than 100-foldspeed-up over a single core CPU for the multi-rate filtering scheme. This results in anoverall of 21-fold CPU usage reduction for the entire SPIIR pipeline.PACS numbers: 04.80.Nn, 95.75.-z, 97.80.-d, 97.60.Gb
1. Introduction
We are entering an exciting time of gravitational wave (GW) astronomy. The firstGW signal is detected in September 2015 by the Laser Interferometric Gravitational-Wave Observatory (LIGO) [1]. This opens up a new window to multi messengerastronomy with unprecedented power of discovery, in probes of some of the mostenigmatic transients in the sky for their emissions in both the electro-magnetic andgravitational wave spectrum, e.g., short or long gamma-ray bursts produced in binarycoalescence and core-collapse supernovae [2, 3]. In particular, low-latency detection andlocalization of GW sources are gaining priority in order to enable prompt electromagnetic(EM) follow-up observations of GW sources. Capture of transient EM events triggered ‡ Contributed equally to this work with Xiangyu Guo, Corresponding Author’s Email:[email protected] § Corresponding Author’s Email: [email protected] (cid:107)
Corresponding Author’s Email: [email protected] a r X i v : . [ a s t r o - ph . I M ] F e b PU-accelerated CBC search pipelines , which additionally allow inputfrom multiple detectors and employ template-based statistical tests to veto transientnoises and produce GW alerts to the community. As of this writing, these three pipelineshave achieve a medium latency of less than one minute. This paper is focused on theSPIIR method. Compared to the FFT technique, it is expected to be more efficient atlatencies lower than tens of seconds for advanced detectors [9].Our previous work of GPU accelerated SPIIR filtering method uses NVIDIA’s
Fermi
GPUs with a speedup a factor in the order of 50-fold over a single core Intel i7 CPU [12].However, that GPU optimization targeted the SPIIR filtering with number of SPIIRfilters in the range of 128 to 256. While in the multi-rate filtering scheme, the totalfilters are split to be applied to data at different sampling rates, that the number offilters can be as small as a few in some sampling rates. In this paper, we extend the GPUoptimization of SPIIR filtering for various number of filters and explore the features ofthe recent
Maxwell
GPUs. Our optimization here achieves 3-fold improvement over theprevious targeted range and achieves up to 10-fold speedup beyond the range, comparedto the previous work. Compared to the SPIIR filtering on a single-core CPU, our GPUacceleration is now 60 to 125 fold faster depending on the number of filters to be applied.In addition to the SPIIR filtering, we have extended the GPU acceleration to othercomponents of the SPIIR pipeline. Another bottleneck of the SPIIR pipeline in termsof computational efficiency is the sampling-rate alterations of multi-rate filtering. Inmulti-rate filtering, a set of filters is applied to data at different sampling rates, theresults of which are combined to the initial rate. Our acceleration of the multi-ratefiltering scheme realizes a 100-fold improvement in CPU resource reduction and hencea 21 fold reduction in CPU resource for the entire SPIIR pipeline. The filtering processin this paper uses single-precision floating-point number format, the result of which hasignorable difference with the results using double-precision format.
PU-accelerated CBC search
2. Multi-rate SPIIR filtering
Matched filtering is known to be the optimal detection method in deep searches forsignals in stationary and Gaussian noise. Here, we consider a CBC waveform templatein the time domain h ( t ). The optimal detection output, known as the signal-to-noiseratio (SNR), is given by: z ( t ) = (cid:90) t −∞ x ( t (cid:48) ) h ( t (cid:48) − t ) dt (cid:48) , (1)i.e. the cross-correlation between whitened waveform template h ( t ) and the whiteneddetector strain x ( t ), which is given by: x t = (cid:90) ∞−∞ ˜ s ( f ) (cid:112) S n ( f ) e πift df , (2)where ˜ s ( f ) is the Fourier transform of detector data s( t ) and S n ( f ) is the one-sidednoise spectral density of a detector defined through the expectation E : E (˜ n ( f )˜ n ∗ ( f (cid:48) )) = 12 S n ( f ) δ ( f − f (cid:48) ) . (3)By sampling at discrete instances t = kδt at a sampling rate 1 /δt ( k = 0 , , · ), Eq. 1can be rewritten in a discrete form, z k = k (cid:88) j = −∞ x j h j − k ∆ t. (4)CBC searches usually adopt the matched filter h ∗ ( − t ), that takes Eq. 1 into aconvolution integral. Quite different from the common FIR method, SPIIR methodutilizes IIR filters to reconstruct the matched filter. A first-order IIR filter bears asimple form shown in Eq. 5 and Eq. 6: y k = a y k − + b x k , (5) PU-accelerated CBC search y k is the filter output at time step k ( t k = k ∆ t ), x k is the filter input, and a and b are complex coefficients. A solution to this first-order linear inhomogeneous differenceequation is: y k = k (cid:88) j = −∞ x j b a k − j . (6)For a target as complex as a CBC signal, a group of first-order IIR filters aredeveloped with each filter representing a small segment of the matched filter [9, 11]. Thenumber of IIR filters that are needed to reconstruct a highly accurate matched filtervaries from a dozen to several hundreds, depending on the complexity of the waveformand the the limit of the detection band. The filter construction procedure can be foundin [9, 11]. Here, we express the output of a SPIIR filter as: y k,l = a ,l y k − ,l + b ,l x k − d l , (7)where y k,l is the output for the l th filter and d l is the time-delay for this filter. Thediscrete form of SNR output from a group of SPIIR filters is given by: z k (cid:39) t (cid:88) l y k,l . (8) filtersfiltersfilters Figure 1: Schematic diagram of multi-rate implementation of SPIIR filtering scheme. Theinput and output sampling rates are R Hz. M , M , ..., M H represent the number of filters tobe applied to the corresponding rates. Our implementation of the multi-rate filtering scheme for GPU acceleration is shownin Fig. 1. For implementation conveniency, the data is downsampled by a factor of 2 insuccession. Each sample rate stream is filtered using the corresponding SPIIR filters.The filtering output of the lowerest rate will be upsampled and added to the filtering
PU-accelerated CBC search a ,l coefficient of the filter. The a ,l coefficient determines the upper bound ofthe frequency band of the filter, and thus the Nyquist rate [13] for the filter to function.We round the Nyquist rate of the filter to the nearest biggest available rate for the filterto work on. x ( mT (cid:48) s ) = n = K (cid:88) n = − ( K − x ( nT s ) K ( mT (cid:48) s − nT s ) (9)To avoid the known problems of spectral leakage caused by the squared window,we adopt the popular Kaiser-window-tapered low-pass filter implemented in the open-source gstreamer library ¶ to perform the interpolation for resampling. The resamplingformula is shown in Eq. 9 where x is the data, and K represents the Kaiser-window-tapered low-pass filter. m and n are discrete sampling points. F s = 1 /T s is the originalsampling rate, F s (cid:48) = 1 /T (cid:48) s is the resampled rate. x is assumed to be bandlimited to ± F s / K is the length of the filter. We call the Kaiser-window-tapered low-pass filter‘Kaiser filter’ in the rest of the paper.There is a single parameter that controls the quality, measured by the stop-bandattenuation, of Kaiser filter. The gstreamer library provides 11 Kaiser filters withstop-band attenuation up to 100 units of decibels (dB). To avoid the band aliasingof downsampling, we choose the Kaiser filter with stop-band attenuation of ∼
100 dB.We choose the Kaiser filter with stop-band attenuation of ∼
60 dB for upsampling thatwe found work most efficiently while maintaing signal recovery quality in practice.The expected computational efficiency of our multi-rate scheme can be estimatedas follows. We denote the number of total SPIIR filters of a given template as M ; thefull-rate we are considering as R ; and the number of search templates as N . Accordingto Eq. 7 and Eq. 8, filtering on one data point requires 12 floating operations. Thusthe total floating point operations per second (flops) for SPIIR filtering at full-rate R is 12 N M R . The 100 dB downsampling Kaiser filter has 384 steps in gstreamer and the60 dB upsampling Kaiser filter has 32 steps. The resampling and summation of filteringresult at different sample rates are negligible. If half of the filters can be applied tosub-rate 12 R , the cost will reduce by 25%. A factor of a few savings on the computationcost is expected if more filters can be applied to even lower rates..
3. Optimization of Multi-rate SPIIR filtering on Maxwell GPUs
There are several interfaces to use GPU for general purpose problem, includingNVIDIA’s Compute Unified Device Architecture (CUDA) [14] language, KhronosGroup’s Open Computing Language (OpenCL) [15], Microsoft’s C++ Accelerated ¶ gstreamer library: http://gstreamer.freedesktop.org/ PU-accelerated CBC search
Maxwell
GPUs,namely the warp-shuffle and atomic operation techniques (sec 3.1.3).We provide a general explanation of the relation of a GPU hardware and the CUDAsemantics for reference here. A GPU chip consists of several Streaming Multiprocessors(SMs). The Maxwell GPU features improved SM architecture renamed as SMM. OneSMM has many processing cores upon which one is capable of create, managing,scheduling, and executing CUDA threads. CUDA threads are executed in groups of32, which are named ”warps”. A group of CUDA warps aggregate to a CUDA block.While one block is limited to one SMM, one SMM can have several blocks runningconcurrently. A GPU has a hierarchy of memory spanning a range of access speedsand storage sizes. The choice of memory to use may greatly affect the overall GPUperformance.
Our previous GPU optimization [12]considers SPIIR filtering with the number of filters a few hundred. It is not highly-optimized toward filtering with small number of filters, which is likely the case in themulti-rate filtering scheme. Here, we develop a CUDA kernel for templates with ≤ > number of threads = multiple of warpsize . This is considered to be optimal as it helps avoid idled threads. For a templatewith N >
32 filters, we assign M threads for this template, where M is rounded tothe next multiple of 32 after N . For a template with N ≤
32 filters, we assigned M threads where M is rounded to the next power of 2 after N . Multiple templates maybe executed in a warp if they are able to fit into the warp. For instance, a templatewith 513 filters will be assigned to 544 threads, while a template with 5 filters will beassigned to 8 CUDA threads. Four small templates (with 5 filters) will be executed PU-accelerated CBC search M u l : SPIIR : WarpThreadBlockN>32
N32
CUDA M u l : FilterTemplate 1:1 1:1Mul:1
Figure 2: Schematic of the SPIIR template-filters hierarchy mapped onto the CUDA block-warp-threads hierarchy. ‘Mul’ means ‘Multiple’ in this figure. N is the number of filters ofany given SPIIR template. within a single warp. With this assignment, it can be guaranteed that the number ofidle threads will not be more than 31 in the worst case.We use the Maxwell architecture-based
GeForce
GTX 980 GPU as our test machine,it has maximumly 2048 active threads, To maximize the active number of threads ineach SM, the minimum number of threads for one block (the number of blocks per SMis 32) must be larger than = 64. For our warp-based CUDA kernel, we chose touse 256 threads for each block as recommended in CUDA Programming Guide [14]. Forour block-based CUDA kernel, only one template is mapped to a single block.
Data access is a critical aspect which can greatly affectthe GPU performance. For unoptimized GPU programs, it may take longer to performmemory access than to do the core calculation. We analyzed the data access patternof SPIIR filtering and applied an efficient data mapping method to improve the GPUperformance. Three types of memory are investigated in our implementation.Register is the fastest accessible memory. It is local to individual threads andcannot be accessed by other threads. At the block level, there are shared memoriesaccessible by all the threads of the same block (but not necessarily by the same warp).For the GTX 980 GPUs, one Maxwell streaming multiprocessor features 96 KB ofdedicated shared memory. Although shared memory features a broader memory access,it is much slower than the register, and usually requires synchronization within theblock. Global memory is located off the chip and its speed is the slowest in the GPUmemory hierarchy. A new feature is introduced with the Kepler GK110 architecture(a predecessor of the Maxwell architecture)—the read-only data cache. It can be usedto cache read-only global memory load which reduces the global memory access time.The GTX 980 has 4GB of global memory shared by all threads. To achieve high globalmemory throughput, we applied a coalesced memory access technique which can combinemultiple memory accesses into as few as one cache transaction.Fig. 3 is a schematic on how we organized and mapped the data of one SPIIRtemplate to the GPU memory hierarchy in order to achieve low-latency data access.
PU-accelerated CBC search Y k,N-1 ... SNRY k-1,j Z -1 a b Filter jX k-d Y k,0 Filter 0 ... X k-d j X k-d N-1
Filter N-1 Y k,j
One Template with N filters
RegisterRead-only Data CacheGlobal Memory
GPU Memory Hierarchy D a t a M a pp i n g Figure 3: A schematic on how the data are mapped onto the GPU memory hierarchy toachieve low-latency data access for SPIIR filtering. The left part shows the color-coded GPUmemory hierarchy. From top to bottom, the data access speed decreases but the data storagecapacity increases. The right diagram illustrates the memory types created for variables forSPIIR filtering. X k is the input data to be filtered, a, b are filter coefficients, Y is intermediateoutput from each filter, Z − represents the iterative process, and SNR is the filtering result. The output of the SPIIR method is given by Eq. 7 and Eq. 8. Due to the iterativenature of the IIR filter, the left-hand-side of Eq. 7, y k,j , will be reused by the nextiteration. As there are a huge number of iterations, we therefore chose to store y into aregister to utilize the fastest memory access in GPU. The input parameter x k − d j cannotbe reused by the same filter, but other filters of the same template may need to accessit. The access pattern of x k has a good temporal and spatial locality so they are storedin the CUDA read-only data cache. Other input parameters, such as a ,j , b ,j , cannot bereused and so they are stored in the slower global memory, but we utilized the efficientcoalesced memory access feature. Finally, the final SNR outputs are stored into globalmemory. Obtaining the final SNR inEq. 8 requires summation over all results of filters. This suggests a synchronization ofoutputting individual results when performing parallel computing. Previously we usedthe implicit synchronization feature of the CUDA warp to significantly reduce the costof parallel threads synchronization and a multiple-thread parallel sum reduction methodto reduce the cost of summation steps [12].Here the summation of all results within a warp is further improved by the warp-shuffle technique introduced from
Kepler michroarchitecture GPUs. The warp-shuffletechnique allows threads to read registers from other threads in the same warp. This isa great improvement over the previous higher-latency shared memory exchange withina warp.For our block-based CUDA kernel, where the template size is larger than 32, theoutput cannot be calculated without any cross-warp communication. We consider threedifferent cross-warp summation methods to calculate the final SNR from partial SNRs.
PU-accelerated CBC search
Maxwell
GPUs, the atomic operations havebeen improved significantly.The second method is Shared memory Warp-shuffle (SW) method. It collects allthe partial SNRs of N times iterations into the shared memory of one warp (the batchedcomputation model proposed in [12]) and performs the warp-shuffle operation for thefinal SNR. Therefore, we need additional shared memory operations to calculate the finalSNR based on the partial SNRs, one explicit synchronization operation to synchronizeall the threads.The last method is Shared Memory Atomic Summation (SA) method. It collectspartial SNRs into shared memory of one warp and performs atomic operation to computethe final SNR. This method also involves the same shared memory loading overheadand synchronization overhead as the SW method. Later we will show that the simplestatomic operations gives the best performance compared with the SW and SA methods. M u l : Downsampling2X ThreadBlockCUDA M u l : Figure 4: Schematic of the downsam-pling data hierarchy mapped onto theCUDA block-threads hierarchy. ‘Mul’means ‘Multiple’, and ’1s’ means ’onesecond’ in this figure.
Downsampling from N points to N/2 pointsY
N/2-1 ... Y X -190 X -189 X N+191 Y Downsampling filter K Y Downsampling filter K ... Y Y N/2-1 X -191 Data Mapping
Figure 5: Schematic of the downsam-pling GPU kernel function and howdata mapped onto the CUDA GPUmemory hierarchy. Each color repre-sent a type of GPU memory. Green:shared memory; red: register mem-ory; blue: global memory.
We optimize the usage of CUDA threads and various types of memory as in Fig. 4and Fig. 5 for the downsampling process. As shown in Fig. 4, we map the production ofone downsampled data point to one CUDA thread. The number of downsampled datapoints can vary over a power of 2 series ranging from 32 to 2048. For downsamplingwith output data points more than the number of GPU cores, the GPU cores will be
PU-accelerated CBC search M u l : Upsampling2X+ Upstream summation ThreadBlockCUDA M u l : Figure 6: Schematic of the upsam-pling input data hierarchy mappedonto the CUDA block-threads hierar-chy. ‘Mul’ means ‘Multiple’, and ’1s’means ’one second’ in this figure.
Upsampling from N/2 points to N points and upstream summation ... Y ’ X -14 X -13 X N/2-1 Y Upsampling filter KUpsampling filter K ... X -15 ... Y Y ’ Y ’ Y ’ Y N-2 ’ Y N-1 ’ Data Mapping
Figure 7: Schematic of the upsam-pling and combination GPU kernelfunction and how data mapped ontothe CUDA GPU memory hierarchy.The left part shows the GPU mem-ory hierarchy with explanation of thecolor code in Fig. 5. fully occupied. It is inevitable for some sub-rate downsamplings, there will be idle GPUcores. As downsampling is the least computational process, that it is only performedonce for all templates, in our multi-rate scheme, we do not further optimize the idle cores.The number of blocks allocated is designed as N min { , N } where N is the number ofdownsampled points. For the three types of GPU memories, we map the input anddownsampling Kaiser filter to the shared memory as they will be reused a number oftimes. Kaiser filtering is calculated iteratively and the intermediate filtering result isstored in the register memory for fastest data access. Finally, The global memory willstore the downsampling output.Similarly, Fig. 6 and Fig. 7 show our CUDA design for the combined function ofupsampling and upstream summation. As a one-second SNR series has a real and animaginary component, we map each component of the SNR series to one CUDA block.Each thread is mapped to 2 N min { , N } upsampled SNR points where N is the numbertotal SNR points in a second. Different from the downsampling that only is perforceonce, the upsampling and upstream summation needs to be performed for each template.The number of blocks will double the number of templates. The number of threads willbe mostly likely more than the number of GPU cores, making a full occupancy of GPUcores. For the memory mapping, the upsampling Kaiser filter is mapped to the sharedmemory and the intermediate filtering output is mapped to the register memory, thesame as the downsampling memory mapping. The input SNR series can not fit in theshared memory and they are stored in GPU global memory. PU-accelerated CBC search
4. Results of multi-rate SPIIR filtering
In this section, we first show the GPU performance of SPIIR filtering and theimprovement from each optimization step (Sec. 4.1). In the second part (Sec. 4.2)., weshow the GPU performance of several scenarios of multi-rate SPIIR filtering. Thispromises a lucrative performance improvement by GPU acceleration for our SPIIRpipeline.All GPU implementations are tested on a
GeForce
GTX 980 (
Maxwell microarchitecture) GPU equipped desktop computer where Tab. 1 shows itsconfiguration. As the CPU counterparts are implemented in single CPU thread fashion,we use the elapsed time as our performance measurement to reflect the usage of CPUresource. We run each experiment ten times and use the average time as the timingresult. We set the number of templates to 4096 for performance testing purpose in thissection. Note that the number of templates for a search can range from a few hundredsto hundreds of thousands.
Table 1: Testbed configuration
Hardware CPU Intel
Core i7-3770 3.40 GHzHost Memory 8 GB DDR3GPU NVIDIA
GeForce
GTX 980GPU Memory 4 GB DDR5Software Operating System Fedora 20 64-bitHost Compilation gcc 4.8.3 -O2CUDA Compilation nvcc 6.5
In this section , We first show the overall improved performance of our new GPUacceleration (noted as
New Kernel ) over the previous GPU acceleration [12] (notedas
Pre-Kernel ). We then show the breakdown of the improvement by exploiting newfeatures of
Maxwell
GPU. s and
Pre-Kernel s are tested against thethe filtering using a single-core CPU. Tab. 2 shows the speedup ratio of
New Kernel s incomparison to
Pre-Kernel s with different number of filters for filtering. For number offilters ≥
32 where both kernels use the same thread configuration, the improvements bythe
New Kernel s over
Pre-Kernel s increase as the number of filters increase, as shownby Tab. 2. This is mainly due to that the
New Kernel s improve the data access andexchange speed, the effects of which are manifested with more number of filters used.For the previous targeted optimization range (i.e. number of filters 128 to 256), the
New Kernel have about 3-fold speed improvement over the
Pre-Kernel . For the size of
PU-accelerated CBC search Figure 8: Performance comparison in per-forming parallel summation reduction betweenshared memory and warp-shuffle using 4096templates Figure 9: Memory access performance com-parison between caching parameter x in L2cache and read-only data cache using 4096templates template (i.e. the number of filters) is relatively low (template size < Table 2: Speedup ratios of different GPU kernels compared with the CPU counterpart.
Template Size 4 8 16 32 64 128 256 512Kernel
N ewKernel
P re − Kernel
Fig. 8 shows the performance of two different implementations with and withoutwarp-shuffle. The first implementation uses warp-shuffle to access data within a warpto calculate the summation. The second implementation uses shared memory toaccess data and calculate summation. It shows that warp-shuffle summation performssignificantly faster than shared memory summation.Fig. 9 shows the effect of using the read-only data cache. The
Maxwell
GPU loadsits global memory data to L2 cache rather than to L1 cache. In Fig. 9, GPU L2 cacheillustrates the kernel performance when storing parameter x of Eq. 7 in the L2 cache,which is the default cache for global memory. Because variables such as x of Eq. 7 hasa favorable temporal and spatial locality, using the read-only data cache can be veryefficient as shown in the figure. Fig. 10 illustrates the performance of the block-based SPIIR kernels using threedifferent cross-warp summation methods: DA (directly using atomic operation in globalmemory),SW (using warp-shuffle and shared memory) and SA (using atomic operation
PU-accelerated CBC search Figure 10: Execution time of different crosswarp summation methods with 4096 templatesin different execution configurations. in shared memory) methods. To ensure objective comparison, we attempt to optimizeeach implementation as much as possible. GPU bank conflicts are perfectly avoided inthe SW and SA methods where shared memory usage is involved. As one would expectthat shared memory access is much faster than global memory access, the SW and SAmethods could be faster than DA method. Surprisingly, the DA method surpasses theother two methods as shown in Fig. 10.We design two additional experiments to figure out the reason for the inferiorperformance of the shared memory methods (SW and SA), whether it is because ofthe cost of shared memory accesses or the explicit synchronization. To reduce the totalsynchronizaiton cost, both SW and SA methods take advantage of shared memory toexecute many iterations before one explicit synchronization operation. Fig. 11 clearlyillustrates that the cost of synchronization can be significantly reduced or amortizedinto many iterations. The larger the iteration number, the better the performance.However, when the iteration number reaches a certain point, the performance benefitbecomes almost unchanged as the iteration number continue growing. Fig.12 showsthat the reason is that the synchronization overhead has been completely hiddenby calculation in such circumstances. In Fig.12 we disabled the synchronizationoperation in these shared memory GPU kernels to observe the performance influenceof explicit synchronization operations in sufficiently large iteration number. Thoughthe computation result may not be correct, the comparison itself does show theimpact of explicit synchronization operations. As can be seen, the performancebetween synchronization and no synchronization GPU kernels are almost unidentifiable,illustrating that we can completely remove the overhead of explicit synchronization byimproving the iteration number. Therefore, the additional shared memory operationsintroduced in SW and SA methods lead to their lower performance compared with DAmethod.The impact of atomic operations is also tested for the DA method. Atomicoperations are typically considered costly and should be avoided whenever possible,
PU-accelerated CBC search Figure 11: The optimization effect when wechange the iteration number. Template size of256 is used (with 4096 templates). Figure 12: The overhead of explicit synchro-nization can be ignored when we use large it-eration number. Iteration number 256 is usedhere (with 4096 templates)Figure 13: Execution time of atomic and non-atomic kernels with 4096 templates in differentexecution configurations. however Fig.13 shows the cost of atomic operation can be almostly ignored because theexecution time of two kernels with and without atomic operations is so close. All theexperiments explain why DA method is better than SW and SA methods.
The results shown in the last section is SPIIR filtering in single-rate. Here we showthe performance of the GPU-accelerated multi-rate implementation of SPIIR filtering,which includes GPU accelerated rate alterations. We test several multi-rate scenariosdenoted as ”number of rates” in Tab. 3. “Number of rates” as 1 in the table denotethe single rate at full rate (4096 Hz). The number of filters at full rate for the 4096templates is set as 1024. The sub-rates are formed according to the design shown inSec. 2.2. The initial filters are equally divided to each sub-rate for testing purpose.Tab. 3 shows elapsed times for the CPU and the GPU-accelerated multi-rate filteringin different scenarios. The efficiency of our multi-rate design shown here is consistent
PU-accelerated CBC search
Table 3: Elapsed times of CPU and GPU-accelerated multi-rate filtering scheme in differentscenarios.
Number of rates 1 2 3 4 8CPU multi-rate scheme 34815s 26379s 20817s 16680s 8841sSPIIR filtering 34815s 26187s 20517s 16342s 8453sGPU multi-rate scheme 287s 220s 199s 143s 85sSpeedup 121x 120x 104x 116x 104x
5. Results of the GPU-accelerated low-latency SPIIR pipeline
Calibrateddata Coincident post-processing GW trigger databaseData whitening
Multi-rateSPIIR filtering
L Alerts to EM communitySourcelocalizationData whitening
Multi-rateSPIIR filtering
CalibrateddataH
Figure 14: Schematic flowchart of the low-latency SPIIR pipeline gstlal iir inspiral with datainput from two detectors. Solid blocks depict main components of this pipeline. Dashed blocksdepict external processes.
The low-latency SPIIR detection pipeline we are considering here is publiclyavailable through distribution of gstlal software library + . gstlal provides a varietyof components from LIGO Algorithm Library (LAL) ∗ for LIGO data processing anduses the gstreamer framework to control streaming data. We use the existing gstlal components for reading calibrated data, data whitening, performing coincident post-processing for our pipeline. We use our own multi-rate SPIIR filtering for templatefiltering. The pipeline has the code name gstlal iir inspiral in gstlal . A schematicflowchart of the pipeline is shown by Fig. 14.The gstreamer framework inherently employs multi-threading technique to takeadvantage of the multiple cores of a CPU. Therefore, the CPU implementation of theSPIIR pipeline is by default multi-threaded. We propose a different criterion, the CPUtime used by all CPU cores, rather than the elapsed time for the single-threaded CPUimplementation, to measure the performance of the GPU accelerated pipeline versusthe original CPU pipeline. + gstlal library: https://wiki.ligo.org/DASWG/GstLAL ∗ lal PU-accelerated CBC search Table 4: Computation profiling of the CPU low-latency SPIIR pipeline gstlal iir inspiral . Multi-rate SPIIR filtering OtherSPIIR filtering Resampling Upstream summation93% 3.1% 0.9% 3%We present the used CPU time of the main components of the CPU pipelinein percentages, shown by Tab. 4. This computation profiling is performed using theplatform given by Tab. 1. The data we process is a short segment of recolored LIGO5th Science run data, which represents a set of clean data and produces a reasonablenumber of single events for coincidence analysis. The SPIIR filtering dominates thecomputation, taking over 93% of the CPU time while the resampling and summationcome next at 4%.We now can estimate the expected CPU resource reduction of the pipeline, if partof the whole the multi-rate filtering component is executed on GPU instead of CPU.The expected resource reduction κ app is determined by the resource reduction broughtby accelerated module κ mod and the computational cost of this module P mod in thepipeline. It is given by: κ app = 11 − P mod + P mod /κ mod , . (10)If we only apply GPU acceleration only on the module of SPIIR filtering in our pipelineand we choose a somewhat 100-fold from Tab. 2 for this component. The total resourcereduction ratio of our pipeline will be not more than 13-fold. If on top of that, we applyGPU acceleration on resampling and upstream summation components, and we choosea somewhat 100-fold from Tab. 3 for this multi-rate SPIIR filtering. The total resourcereduction of our pipeline will be about 25-fold.We measure the CPU resource reduction efficiency and also the elapsed time gainusing GPU acceleration on multi-rate SPIIR filtering. The setup of the performancetest is explained in detail here. We simulate two sets of data using Advanced LIGOnoise for the twin LIGO detectors, respectively. The duration of each data set is 1000seconds. These data sets are injected coherently into a simulated binary neutron starcoalescence GW signal. To prepare the templates used by our pipeline for the search,we generate two template banks for each detector, which covers the parameters of theinjection. Each bank consists of 1024 geometric templates. We generate SPIIR filtersfor each bank and divided the filters into four groups corresponding to four designatedsub-rates. We insert filters with zero coefficients into each group so that the number offilters of each group in a bank will be a power of 2. While it is not necessary to do thefilter insertion for the search, the insertion here is for performance testing purpose. Thenumber of SPIIR filters of each group in each bank is shown in table 5.Tab. 6 shows the CPU resource reduction and elapsed time gain by the GPU-accelerated SPIIR pipeline. It achieves 21-fold improvement on CPU resource reduction.This is close to our expectation shown earlier. Besides, we have significantly reduce the PU-accelerated CBC search Table 5: Number of filters in each frequency band of a bank and the data sampling ratesrequired by Nyquist theorem.
Sampling rate (Hz) 4096 2048 1024 512Frequency band (Hz) < < < < Table 6: Performance of the low-latency SPIIR pipeline merely using CPU power (CPUpipeline) and with GPU acceleration of multi-rate SPIIR filtering (GPU pipeline).
Pipeline Used CPU time CPU resource reduction ratio Elapsed time Speed-upCPU pipeline 31610s 1x 4560s 1xGPU pipeline 1520s 21x 430s 11x running time of the pipeline by 11-fold. The difference of the SNRs between the GPUpipeline and the CPU pipeline on the injection event is within 0 .
6. Conclusions and Future Work
Low-latency and real-time detections of gravitational-wave (GW) signals are gainingpriority for their potential to enable prompt electromagnetic follow-up observations. Thelow-latency SPIIR detection pipeline is a CBC detection pipeline with that latencies oftens of seconds.In this paper, we develop the GPU acceleration for the main computationalcomponent of the SPIIR pipeline — multi-rate SPIIR filtering. We first improvethe GPU optimization of the filtering part, by employing a new kind of GPU threadconfiguration and exploiting new memory features of
Maxwell
GPUs. This providesa notable 1 . Fermi
GPUs. Weimplement GPU acceleration of resampling and upstream summation parts for the multi-rate filtering procedure. Our tests show the multi-rate filtering has been acceleratedover 100 fold given different filtering scenarios. This leads to a 21-fold CPU resourcereduction for the entire pipeline and a 11-fold reduction on elapsed time.
Acknowledgments
We would like to thank Maurice H.P.M. van Putten for discussion and commentson the details of the paper. This research is supported in part by National NaturalScience Foundation of China (Grant No. 61440057, 61272087, 61363019 and 61073008),Beijing Natural Science Foundation (Grant No. 4082016 and 4122039), the Sci-TechInterdisciplinary Innovation and Cooperation Team Program of the Chinese Academy ofSciences, the Specialized Research Fund for State Key Laboratories, and the AustralianResearch Council Discovery Grants and Future Fellowship programs. QC gratefullyacknowledges the support of an International Postgraduate Research Scholarship funded
PU-accelerated CBC search
References [1] B. P. Abbott, R. Abbott, T. D. Abbott, et al. Observation of gravitational waves from a binaryblack hole merger.
Phys. Rev. Lett. , 116:061102, Feb 2016.[2] Bangalore Suryanarayana Sathyaprakash and Bernard F Schutz. Physics, astrophysics andcosmology with gravitational waves.
Living Reviews in Relativity , 12(2):18–19, 2009.[3] Maurice HPM van Putten, Gyeong Min Lee, Massimo Della Valle, Lorenzo Amati, and AmirLevinson. On the origin of short grbs with extended emission and long grbs without associatedsn.
Monthly Notices of the Royal Astronomical Society: Letters , 444(1):L58–L62, 2014.[4] Q. Chu, E. J. Howell, A. Rowlinson, et al. Capturing the electromagnetic counterparts of binaryneutron star mergers through low-latency gravitational wave triggers.
MNRAS , 459:121–139,June 2016.[5] B. J. Owen and B. S. Sathyaprakash. Matched filtering of gravitational waves from inspiralingcompact binaries: Computational cost and template placement.
PRD , 60(2):022002–+, July1999.[6] B. Allen, W. G. Anderson, P. R. Brady, D. A. Brown, and J. D. E. Creighton. FINDCHIRP: analgorithm for detection of gravitational waves from inspiraling compact binaries.
ArXiv GeneralRelativity and Quantum Cosmology e-prints arXiv:gr-qc/0509116 , September 2005.[7] D. Buskulic, Virgo Collaboration, and LIGO Scientific Collaboration. Very low latency searchpipeline for low mass compact binary coalescences in the LIGO S6 and Virgo VSR2 data.
Classical and Quantum Gravity , 27(19):194013, October 2010.[8] K. Cannon, R. Cariou, A. Chapman, et al. Toward Early-Warning Detection of GravitationalWaves from Compact Binary Coalescence.
ArXiv e-prints arXiv:1107.2665 , July 2011.[9] J. Luan, S. Hooper, L. Wen, and Y. Chen. Towards low-latency real-time detection of gravitationalwaves from compact binary coalescences in the era of advanced detectors.
ArXiv e-prints1108.3174 , August 2011.[10] S. Hooper, L. Wen, D. Blair, et al. Low-Latency Detection of Gravitational Waves. In
AmericanInstitute of Physics Conference Series , volume 1246 of
American Institute of Physics ConferenceSeries , pages 211–214, June 2010.[11] S. Hooper, S. K. Chung, J. Luan, et al. Summed parallel infinite impulse response filters forlow-latency detection of chirping gravitational waves.
PRD , 86(2):024012, July 2012.[12] Yuan Liu, Zhihui Du, Shin Kee Chung, et al. Gpu-accelerated low-latency real-time searchesfor gravitational waves from compact binary coalescence.
Classical and Quantum Gravity ,29(23):235018, 2012.[13] H. S. Black.
Modulation Theory . New York, Van Nostrand, 1953.[14] CUDA Nvidia. Nvidia cuda c programming guide.
NVIDIA Corporation , 120, 2011.[15] Khronos OpenCL Working Group et al. The opencl specification. version , 1(29):8, 2008.[16] Kate Gregory and Ade Miller.
C++ AMP: Accelerated Massive Parallelism with Microsoft R (cid:13) Visual C++ R (cid:13)(cid:13)