[PDF] Deep searches for broadband extended gravitational-wave emission bursts by heterogeneous computing

Abstract

We present a heterogeneous search algorithm for broadband extended gravitational-wave emission (BEGE), expected from gamma-ray bursts and energetic core-collapse supernovae. It searches the (f,\dot{f})-plane for long duration bursts by inner engines slowly exhausting their energy reservoir by matched filtering on a {\em Graphics Processor Unit} (GPU) over a template bank of millions of one-second duration chirps. Parseval's Theorem is used to predict the standard deviation \sigma of filter output, taking advantage of near-Gaussian noise in LIGO S6 data over 350-2000 Hz. Tails exceeding a mulitple of \sigma are communicated back to a {\em Central Processing Unit} (CPU). This algorithm attains about 65\% efficiency overall, normalized to the Fast Fourier Transform (FFT). At about one million correlations per second over data segments of 16 s duration (N=2^{16} samples), better than real-time analysis is achieved on a cluster of about a dozen GPUs. We demonstrate its application to the capture of high frequency hardware LIGO injections. This algorithm serves as a starting point for deep all-sky searches in both archive data and real-time analysis in current observational runs.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] A ug Draft version July 2, 2018

Typeset using L A TEX preprint style in AASTeX61

DEEP SEARCHES FOR BROADBAND EXTENDED GRAVITATIONAL-WAVE EMISSIONBURSTS BY HETEROGENEOUS COMPUTING

Maurice H.P.M. van Putten a Yeongsil-Gwan, Room 614, Physics and Astronomy, Sejong University, Seoul, South Korea

ABSTRACTWe present a heterogeneous search algorithm for broadband extended gravitational-wave emission(BEGE), expected from gamma-ray bursts and energetic core-collapse supernovae. It searches the( f, ˙ f )-plane for long duration bursts by inner engines slowly exhausting their energy reservoir bymatched ﬁltering on a Graphics Processor Unit (GPU) over a template bank of millions of one-second duration chirps. Parseval’s Theorem is used to predict the standard deviation σ of ﬁlteroutput, taking advantage of near-Gaussian noise in LIGO S6 data over 350-2000 Hz. Tails exceedinga mulitple of σ are communicated back to a Central Processing Unit (CPU). This algorithm attainsabout 65% eﬃciency overall, normalized to the Fast Fourier Transform (FFT). At about one millioncorrelations per second over data segments of 16 s duration ( N = 2 samples), better than real-timeanalysis is achieved on a cluster of about a dozen GPUs. We demonstrate its application to thecapture of high frequency hardware LIGO injections. This algorithm serves as a starting point fordeep all-sky searches in both archive data and real-time analysis in current observational runs. INTRODUCTIONGravitational radiation oﬀers a potentially powerful new channel to discover the physical natureand population statistics of core-collapse supernovae and their association with neutron stars andblack holes. Recently, LIGO identiﬁed a black hole binary progenitor of GW150914 (LIGO-Virgo2016) with remarkably low spin. Stellar mass black holes are believed to be remnants of extremetransient events such as gamma-ray bursts and core-collapse supernovae. Shortly after birth in thelatter, possibly including the superluminous variety, black holes may encounter strong interactionswith high density matter. This outlook opens a window to release their angular momentum ingravitational radiation, leaving slowly spinning remnants with a/M ≃ . BeppoSAX (Frontera et al. 2009). EM-triggers obtained from transient surveysallow oﬀ-line GW-analysis of LIGO-Virgo archive data. On the other hand, GW-triggers require rel-atively low latency in EM follow-up, that may be challenging by modest localisation of LIGO-Virgodetections.While core-collapse supernovae are relatively numerous, only a small fraction is known to be asso-ciated with extreme events. For instance, the true event rate (corrected for beaming) of long GRBsis about 1 per year within a distance of 100 Mpc. Achieving sensitivity to tens of Mpc to emissionslimited to E GW = O (1 M ⊙ c ), where c denotes the velocity of light, poses a challenge for deep searches in gravitational wave data.Broadband extended gravitational-wave emission (BEGE) from aforementioned extreme eventsmay be produced with durations lasting up to tens of seconds. In chirp-based spectrograms, suchmay appear as trajectories marked by frequencies slowly wandering in time, featuring ascendingand descending chirps (Levinson et al. 2015). To search for these signatures in the ( f, ˙ f )-plane, werecently devised a dedicated butterﬂy ﬁltering using chirp templates of intermediate duration τ of,e.g., one second, targeting a time scale of phase coherence that may capture tens to hundreds of waveperiods associated with non-axisymmetric accretion ﬂows. Using millions of chirp templates it candetect complex signals such as Kolmogorov scaling in noisy time-series, recently in BeppoSAX gamma-ray light curves with an average photon count down to 1.26 photons per 0.5 ms bin (van Putten et al.2014a). This kind of sensitivity suggests to explore its further applications to strain amplitudegravitational wave data (van Putten 2016).Deep searches covering a complete science run of LIGO requires considerable computing resourcesin the application of butterﬂy ﬁltering with a dense bank of templates. We here report on a novel al-gorithm by heterogeneous computing comprising both

Graphics and

Central Processing Units (GPUs,respectively, CPUs) using the

Open Compute Language (OpenCL) (Gaster et al. 2011; Khronos 2015).A primary challenge in heterogeneous computing is circumventing GPU-CPU bottlenecks arisingfrom potentially vast discrepancies in data throughput over the

Peripheral Component Interface (PCI). Our algorithm exploits near-Gaussian noise in the high frequency bandwidth of 350-2000 Hzin LIGO data, whereby the output of matched ﬁltering is essentially Gaussian as well. Near-optimaleﬃciency is obtained by retaining only tails of relatively high signal-to-noise ratios back to the CPUfrom the GPU output, whose cut-oﬀ is predicted by Parseval’s Theorem. Including overhead inthe latter, our algorithm achieves about 65% eﬃciency normalized to GPU-accelerated Fast FourierTransform (FFT) in complex-to-complex (C2C), single precision (SP) and interleaved out-of-place memory allocation.Our choice of chirp templates is guided by inner engines involving black holes described by the Kerrmetric, interacting with high density matter, expected in core-collapse of massive stars and mergersinvolving neutron stars, the latter envisioned in association with short GRBs with Extended Emission(SGRBEE) and long GRBs with no supernovae (LGRBN) such as GRB060614 (van Putten et al.2014b).In the application to LIGO S6, we give a detailed description of our data-base, which comprisesbandpass ﬁltering to aforementioned 350-2000 Hz (over 64 s segments of data, N = 2 samples) anda restriction to simultaneous H1-L1 detector output (29.4% of total S6 data). On a modern GPU, werealize approximately 80,000 correlations per second over 16 s segments of data ( N = 2 samples).On a cluster of about a dozen GPUs, about 1 million correlations per second realizes better thanreal time analysis. As such, the presented method is applicable to both archive analysis and low-latency searches in essentially real-time, pioneered following GW150914 (Abbott et al. 2016a) andfor current Advanced LIGO runs (e.g. Guo et al. 2017). For the present archive analysis of LIGO S6,however, our focus is on deep searches for an exhaustive search, all-sky and blind without triggersfrom electromagnetic data.Existing comprehensive, blind searches for bursts (Abbott et al. 2004, 2007, 2009a,b; Abadie et al.2010, 2012; Ando et al. 2005; Aasi et al. 2013) cover various broadband emissions over 16-500Hz (Abbott et al. 2017a), 40-1000 Hz (Abbott et al. 2016b) and, for short bursts, 32-4096 Hz(Abbott et al. 2017b). Around a rotating black hole of mass M newly formed in core-collapsesupernovae, gravitational wave emission from non-axisymmetric mass-ﬂow around the Inner MostStable Circular Orbit (ISCO) is expected to be potentially luminous (van Putten 2001), featuringa broadband descending chirp (van Putten 2008, 2016) with late-time frequency (van Putten et al.2011) f GW ≃ (595 − (cid:18) M ⊙ M (cid:19) , (1)where the range in the frequency refers to dependency on initial black hole spin. This motivatesour present focus on the high-frequency bandwidth 350-2000 Hz in LIGO S6. In this frequencybandwidth, LIGO noise is essentially Gaussian, which shall be exploited in our GPU-CPU methodof analysis.In § §

3, our heterogeneous computingalgorithm is described with use of pre- and post-callback functions. Benchmark results are given in § § § BANDPASS FILTERED H1 AND L1 DATA IN S6 time (units of 4096 seconds) -21 -20 -19 -18 l og s t r a i n mean H1 = 6.261e-21mean L1 = 1.3587e-20 -22 -21.5 -21 -20.5 -20 -19.5 -19 -18.5 -18 log strain l og c oun t mean H1 = 6.2612e-21mean L1 = 1.3587e-20 Figure 1.

Overview of LIGO S6, showing standard deviations over 64 s data segments ( N = 2 samples)of H1 ∧ L1. H1 ∧ L1 ranged from 0-68% with an average yield of 29.4% of H1 ∨ L1. Strain noise H1 and L1 wasbetter than 10 − over 89%, respectively, 64% of the time. Performance of H1 and L1 became somewhatmore consistent after the ﬁrst 3000 hours. Table 1.

Overview of the data-base of H1 ∧ L1 when both H1 and L1 were taking data (measured over 64 sdata segments), extracted from a total of 12726 LIGO S6 frames. Frames on the LIGO Open Science Center(LOSC) comprise 4096 s ( N = 2 samples) of H1 or L1 data, here bandpass ﬁltered to 350-2000 Hz over 64s data segments ( N = 2 samples). H1 ∧ L1 data for analysis is in 36 ﬁles of 4096 ×

64 s segments (Table 2).Data 64 s segments LOSC frames (4096 s) Memory Source, TargetH1 422912 6608 - LOSCL1 391552 6118 - LOSCH1 ∨ L1 499712 7867 1.05 TB DiskH1 ∧ L1 147000 - 305 GB DiskFile 4096 - 8.59 GB Compute node

LIGO S6 covers the period July 7 2009 through October 20 2010. In our analysis of LIGO S6, wefocus on epochs when H1 and L1 are both taking data. These H1 ∧ L1 data represent 29.4% of datawhen either H1 or L1 were taking data (H1 ∨ L1), measured over 64 second segments (Fig. 1).In our search for gravitational wave emission from core-collapse supernovae associated with stellarmass black holes, we focus on the frequency bandwidth of 350-2000 Hz. Bandpass ﬁltering (over 64 ssegments of data, N = 2 samples), LIGO noise is essentially Gaussian (e.g. van Putten 2016). Thisbandwidth may contain gravitational wave emission from non-axisymmetric mass motion about theInner Most Stable Circular Orbit (ISCO) around stellar mass black holes (van Putten et al. 2011;van Putten 2016). BUTTERFLY FILTERING BY HETEROGENEOUS COMPUTINGTo search for slowly evolving trajectories in time-frequency space, we consider matched ﬁlteringover a large bank of chirp templates covering a range in f and time rate-of-change of frequency ˙ f ,i.e., a butterﬂy 0 < δ < (cid:12)(cid:12)(cid:12) ˙ f (cid:12)(cid:12)(cid:12) < δ (2)for some δ , >

0. Over a ﬁnite bandwidth of frequencies, the resulting output is a chirp-basedspectrogram . The chirps are generated from a long duration template, produced by solving a pairof ordinary diﬀerential equations modeling black hole spin down against high density matter at theISCO (van Putten et al. 2014a), the results of which are illustrated in Fig. 2.

200 400 600 800 1000 1200 1400 1600 1800 2000050100150

Bank ABank B

200 400 600 800 1000 1200 1400 1600 1800 200000.010.020.030.040.050.060.070.080.090.1

Bank ABank B

Figure 2.

Overview of template banks of one-second duration chirps (dots) covering 350-2000 Hz, shownby frequency f and their change ∆ f in frequency, illustrated with a bank of small size. The chirps used inbutterﬂy ﬁltering are symmetric in time, obtained by superposition of chirp forward and backward in time,suitable in searches for both ascending and descending chirps. Banks A and B are similar, except Bank Ais larger by including more pronounced chirps (larger ∆ f ) at the lower bound of 350 Hz. Matched ﬁltering of a time series y ( t ) against chirps templates w ( t ) is deﬁned by correlations ρ ( t ) = Z ∞−∞ w ( s ) y ( t + s ) dt. (3)In the present application to LIGO strain data, y ( t ) and w ( t ) have zero mean. This integral isconveniently evaluated in the Fourier domain as ˜ ρ ( k ) = ˜ w ∗ ( k )˜ y ( k ), where˜ f ( k ) = 12 π Z ∞−∞ f ( t ) e − ikt dt, f ( t ) = Z ∞−∞ ˜ f ( k ) e ikt df. (4)Discretizing (3) to samples at equidistant instances t n ( n = 0 , , · · · , N ), we evaluate (4) by FFT.This is more eﬃcient compared to direct evaluation of (3) in the time domain, whenever the numberof samples N exceeds a few hundred. This may be readily observed by comparing compute times,convolving two vectors u and v by FFT versus direct evaluation in, e.g., MatLab ; see also Smith(2016).

DataTemplates

FFT → ˜ ρ Parseval: σ → ˜ ρ IFFT: ρ > σ M -sized batch iterated over M templates pre-callback post-callback tails Figure 3.

Butterﬂy ﬁltering by heterogeneous computing applied to M H1 ∧ L1-data 16 s segments ( N = 2 samples) by CPU (thin lines) and GPU (thick lines). Parseval’s Theorem computes M standard deviations σ of essentially Gaussian correlations ρ ( t ) obtained by matched ﬁltering on the GPU. Potentially relevantresults are contained in the tails of ρ ( t ). Retaining tails ρ ( t ) > κσ for a threshold κ realizes near-optimalheterogeneous GPU-CPU computing, eﬀectively circumventing PCI bandwidth limitations when κ is a few. For reference, recall that correlating vectors y and w comprises three steps: twice forward FFT,pointwise products ˜ ρ involving complex conjugation, and one inverse FFT: { ˜w , ˜y } = FFT { w , y } , ˜ ρ = ˜w ∗ · ˜y , ρ = FFT − { ˜ ρ } . (5)For LIGO S6, the (downsampled) sampling rate is 4096 s − , whence N = 2 for 16 s data segments.With vanishing mean values, the standard deviation of ρ ( t n ), σ = 1 √ N vuut N − X n =0 ρ ( t n ) , (6)satisﬁes Parseval’s Theorem σ = √ N vuut N/ − X n =1 | c n | , (7) Table 2.

Partitioning ﬁles of the H1 ∧ L1 data-base on a heterogenous compute node into blocks allocatedin Global Memory on a GPU for processing by FFT

N,M with transforms of size N = 2 in batch mode ofsize M = 2048.Unit Array length Memory size TargetFile 8 blocks 8.59 GB Disk storage, hostBlock N M M = 2048 1.1 GB FFT/GPUFFT data segment N = 2 where c n denote the Fourier coeﬃcients of ρ ( t ) according to the FFT pair c k = 1 N N − X n =0 f n e − ikt n , f n = N − X k =0 c k e ikt n . (8)Bandpass ﬁltered to 350-2000 Hz, H1 ∧ L1 (Table 1) has noise which is essentially Gaussian (e.g.van Putten 2016). This property is inherited by ρ ( t ) in (3). Hence, ρ ( t ) is eﬀectively described by σ in (7) for a given pair of data segment and template. Therefore, (7) provides a predictive step tothe output of (5). In processing (5) on a GPU, a threshold in a post-callback function can be used toretain only tails (Fig. 3) ρ ( t n ) > κσ (9)for feedback to the CPU over the PCI. In (9), we implicitly apply the inequality to the absolute valueof ρ n = ρ ( t n ). Thus, (9) circumvents vast discrepancies in throughput of GPUs and CPUs whenever κ is on the order of a few. This step is essential for an optimal heterogeneous computing algorithm,to be benchmarked further below.It should be mentioned that below 350 Hz, LIGO data is non-Gaussian, giving rise to distributionsof ρ ( t ) that occasionally show multiple peaks. (This depends on the pair of data segment andtemplate.) In this event, σ inadequately describes ρ ( t n ), whereby tails deﬁned by (9) become lessmeaningful in deﬁning candidate detections.Processing is applied to batches of M = 2048 of H1 ∧ L1 16 s data. Such block of about 9 hours ofdata comprising about 1 GByte, suitable for allocation in

Global Memory of a typical GPU. Chirptemplates are extracted by time slicing from a model of black hole spindown (van Putten et al. 2014a).While these emissions are of relatively high frequency when the black hole spins rapidly, late timeemission following spin down reaches an asymptotic frequency satisfying (1). Analysis is performedin groups of M such templates by FFT in batch mode . Batch mode operation is essential to reachingoptimal FFT performance on a GPU.3.1. Teraﬂops compute requirements

Sensitivity to arbitrary, slowly varying transients is realised by banks suﬃciently large to denselycover the ( f, df /dt )-parameter space. For matched ﬁltering, a bank of chirps of one second durationcovering f = O ( N ) Hz with frequency changes O ( f ) will be dense with step sizes order of 1 /N Hzin f and df /dt , setting a minimum bank size of order O ( N ). For f on the order of one kHz, theminimum bank size is O (1M), needed to ensure a reasonable probability to match a signal (a “hit”when ρ > κσ ).For a better than real-time analysis by butterﬂy ﬁltering of data segments of duration T over atemplate bank of size K M , the required compute performance is˙ n = 5 N log N × K M T − = 2 .

75 teraﬂops , (10)where the right hand side refers to our choice of T = 16 seconds and a template bank of α =1 , , · · · , K sets of size M = 2048 each.Hardware requirements are considerably higher, since FFT’s tend to be memory limited (not com-pute limited) on GPUs, especially when FFT array sizes exceed the size of Local Memory privy toindividual

Compute Units (CU). At typical eﬃciencies of η ≃

7% in these cases, (10) points to aminimum requirement of about 50 teraﬂops at GPU maximal compute-performance, assuming (10)is realized at approximately optimal eﬃciency normalized to FFT.In what follows, we consider partitioning template bank by K M and, respectively, data in β =1 , , · · · , K blocks in W α = { w αk } Mk =1 , Y β = { y βk } Mk =1 . (11)In our application, K M = 2 (up to 8 million) K M = 288 for LIGO S6. The total number ofcorrelations for a full LIGO S6 analysis is K K M = 5 × . (12)For our choice of 16 second segments ( N = 2 ), (12) deﬁnes a compute requirement of 2 . × ﬂoating point operations for a complete LIGO S6 analysis over a bank of 8M templates.3.2. Batch mode with pre- and post-callback functions

Fig. 3 shows the butterﬂy ﬁltering by our GPU-CPU heterogeneous computing algorithm, basedon detailed partitioning of data and work listed in Table 2. For (5), we choose FFT with C2C, SPand with interleaved out-of-place memory allocation by one-dimensional FFT

N,M of length N = 2 in batch mode of size M = 2048:(i) FFT N,M of M pairs of 16 s data segments of H1 ∧ L1 comprising a block of

M N

CSP inAllocatable Memory of size 1GByte. (FFT is applied to arrays of complex numbers, mergingpairs of real H1 and L1 data.) Transforms ˜ Z = ( ˜ H , ˜ L ) comprise M sub-arrays ˜Z k ( k =1 , , · · · , M ), each of length N ;(ii) A chirp template w of duration τ = 1 s is extended by zeros to length N and its transform ˜w is loaded into Global Memory. A pre-callback function computes M transforms ˜ ρ from M pointwise array multiplications ˜ ρ ( k ) = ˜Z ( k ) · ˜w ∗ ( k = 1 , , · · · , M );(iii) Inverse FFT N,M applied to ˜Z ( k ) produce M corrections ρ k ( t n ) over N samples, representing themost computationally (but memory limited) intensive step on the GPU;(iv) (ii) and (iii) are repeated M times, once for each of M chirp templates w at a total computa-tional eﬀort of M inverse-FFT N .At 5 N log N ﬂops per FFT N , these combined steps for above mentioned N and M involve ∼ M × M convolutions ˜ ρ ( k ) ( k = 1 , , · · · , M ), i.e. candidate events exceeding a multiple of σ km , one for each16 s segment k of data and chirp template l ( k, m = 1 , , · · · , M ).The σ km are pre-computed by Parseval’s Theorem (7). As norms of complex Fourier coeﬃcients,(7) is computationally demanding, requiring oﬀ-loading to the GPU as well (Fig. 3). For κ = 5 . byte s − , well below the PCI bandwidth of severalGByte s − , allowing near-optimal computing at about 65% eﬃciency overall (including Parseval’sstep), normalised to FFT alone. Retaining tails over the PCI by the CPU is realized as follows.3.3. Gathering GPU-tails over the PCI

The tails of correlations satisfying (9) are gathered in two steps (Figs. 3-4). In correlations of atemplate w k ǫ W α and a segment y m ǫY β (1 ≤ α ≤ K , 1 ≤ β ≤ K , k, m = 1 , , · · · , M ), (9) isobtained for each σ km . Thus, w k gives M tails in correlation with Y β referenced by time t m = t n m + mN (13)of maximal correlation satisfying ρ m ≥ κσ km , (14)where ρ m = ρ ( t m ). To circumvent limited PCI bandwidth, (13-14) is converted to pointers projectedinto an array A αk of size N , A αk = { ( t m , ρ m ) | ρ m ≥ ρ m ′ ≥ κσ km ′ (all m ′ ) } . (15)We evaluate (15) by post-callback function on the GPU by updating ( t m , ρ m ) with ( t m ′ , ρ m ′ ) when-ever ρ m ′ > ρ m and ρ m ′ > σ km ′ . As an asynchronous read/write by pointwise index on GlobalMemory, this may lead to indeterministic behavior when two processors operate concurrently on thesame index. When κ is appreciable, A αk is sparse, and this anomalous behavior is exceedingly rare.Repeating (15) for all w k ǫ W α obtains M tails by pointers A α = M [ k =1 A k . (16)Collecting all pointers in A α is evaluated by the CPU.Gathering results over the complete template bank obtains by repeating (16) for all 1 ≤ α ≤ K ,each time dereferencing A α into an array B of block size N M on the host, B = K [ α =1 ∗ A α , (17)evaluated by the CPU. In collecting B , we select data with maximal ρ values at t n + mN from the ∗ A α .Gathering all hits by removing selection of maximal ρ in collecting B in (17) produces extendedoutput with up to two orders of magnitude more output in case of a signal. For a burst injectiondiscussed below (Fig. 7), for instance, this increases output to tens of GByte for a bank of 8Mtemplates. Such extended output may be of interest to second runs, following up on selected datasegments covering candidate events, but less so to ﬁrst runs through all data such as LIGO S6.0 GPUCPU CPUA Bdata blockchirp templates tails (by pointer) tails

Figure 4.

Tails of correlations between chirps with a block of

N M H1 ∧ L1 (thick lines) projected by pointerinto an array A of size N in Global Memory, that will sparse whenever κ is on the order of a few. Gatheredover the PCI, ∗ A are stored in an array B of size N M on the host. B is stored to disk after completingcorrelations with a complete bank of chirp templates.4. BENCHMARKS UNDER OPENCL AND FILTER OUTPUTThe algorithm shown in Figs. 3-4 is implemented in Fortran90 and C++ using AMD’s clFFT (inC99) under OpenCL. Following Table 2, clFFT operates on blocks of ﬁltered H1 ∧ L1 data in 1 GByteblocks allocatable in Global Memory for clFFT

N,M (C2C, SP) with interleaved out-of-place memorystorage.Under OpenCL, a GPU is partitioned in CU’s with fast but privy Local Memory and registers.Only Global Memory is shared across all CU’s. Performance hereby critically depends on eﬃcientuse of Local Memory and minimal use of Global Memory, since access to the latter is relatively slow.With a Local Memory size of typically 32 kByte, clFFT performance for C2C SP will be essentiallymaximal N ≤ . In our application, N = 2 , whereby clFFT performance is practically memorylimited.Fig. 5 shows clFFT performance on GPUs with varying numbers of CUs (each comprising a numberof Stream Processors ) and Global Memory bus bandwidth (GByte s − ), namely the R9 nano (4096,64, 512), the R9 390 (2560, 40, 384) and the D700 (2048, 32, 264). For the ﬁrst, performance is over600 Gﬂop s − for N > (about 1000 Gﬂop s − for N ≤ ). This is a direct result of the 32 kByteLocal Memory size and 8 N bytes in complex single precision storage and the need to access GlobalMemory when N > . For N = 2 , the net result is overall eﬃciency of about 7% of peak ﬂoatingpoint compute-performance by Stream Processors alone.We implement Parseval’s Theorem by partial sums oﬀ-loaded to the GPU, the results of which aresummed by the CPU. At a few hundred Gﬂop s − performance thus achieved, wall clock compute1

10 11 12 13 14 15 16 17 18 19 20 log N G F L O P s - ( N l og N / FFT ) clFFT benchmarks (C2C,SP,OUT-OF-PLACE) D700 [MEM=512MB]D700 [MEM=1024MB]R9 390 [MEM=512MB]R9 390 [MEM=1024MB]R9 Nano[MEM=512MB]R9 Nano[MEM=1024MB]

Figure 5.

Performance of FFT

N,M by clFFT under OpenCL on the AMD GPU’s D700, R9 390 and R9Nano GPUs, expressed in GFLOP s − as a function of transform array size N in C2C SP with interleavedout-of-place data storage and no output back to the CPU. Results are shown for two diﬀerent batch sizes withcorrespondingly diﬀerent allocations in Global Memory. These results deﬁne a practical limit on performancein FFT-based correlations, that involve additional communications to a CPU over a PCI. time is about 25% compared to that of clFFT on the GPU. Including overhead in ( i − iv ) of § § ∼ × correlations per secondper GPU. On a cluster of about a dozen GPU’s, we hereby realise about 1 million correlations persecond, suﬃcient for a real-time analysis by up to 16 million templates according to (10).Filter output stored to disk is listed by block in ﬁles B n , n = 1 , , · · · , TAILS AND LIGO BURST INJECTIONSTo illustrate a full analysis, Fig. 6 shows a pseudo-spectrum of the tails (9) of H1 ∧ L1 LIGO S6,obtained by averaging results of blocks using a template bank of intermediate size of 0.5 M chirps.The detailed structure shown represents the non -Gaussian features that carry any potentially relevantinformation, visible only by zooming in on tails in an otherwise overall near-Gaussian PDF of theinternal GPU output ρ (Fig. 3). This has been veriﬁed numerically, in obtaining completely smoothspectra of tails of ρ following time-randomisation of H1 or L1 data (Fig. 6).Fig. 6 shows various pronounced features, some of which are probably associated with unsteadybehaviour in various instrumental lines familiar from conventional Fourier spectra of S6 strain noise(LOSC 2017a). The details of which remain to be understood in more detail, especially so given thenon-trivial residual spectrum of simultaneous hits with frequency pairs ( f , f ) of H1 and L1 that arerelatively close, here shown with | f − f | < ∆, ∆ = 50 Hz. (A similar spectrum obtains for ∆ = 100Hz.) For the analysis with a bank of 0.5M templates shown, the total counts per block for H1 and2 Table 3.

Butterﬂy ﬁltering output B n of a block n ( n = 1 , , · · · , ρ i > κσ ( i = 1 ,

2) listsdata sample oﬀset i ǫ { , , · · · , } , ρ i and f i , the latter the initial frequency of associated chirp template.Multiplication of ρ i by 1000 allows storage of all entries in 4 byte integers. Sample shown of B161 (6,388,647rows produced by a bank of 4M templates) highlights some simultaneous hits. Zeros represent no hit.Sample oﬀset i × ρ i (H1) 1000 × ρ i (L1) f i (H1) [Hz] f i (L1) [Hz] · · · · · · Table 4.

Sample of a LIGO S6 injection in H1 and L1 (LOSC 2017b), comprising a sequence of sine-Gaussian signals (Mottin et al. 2010) stepwise covering 50-2000 Hz with injection signal-to-noise ratio SNRlisted by the LOSC.GPS time [s] strain amplitude ( h rss ) Waveform ( f [Hz], Q ) SNR(LOSC)H1958413408.20 3 . × − sine-Gaussian (393,9) 93.27958413413.50 4 . × − sine-Gaussian (554,9) 92.37958413418.30 6 . × − sine-Gaussian (850,9) 98.23958413423.40 9 . × − sine-Gaussian (1304,9) 91.20958413428.70 7 . × − sine-Gaussian (2000,9) 23.15L1958413408.20 3 . × − sine-Gaussian (393,9) 65.37958413413.50 4 . × − sine-Gaussian (554,9) 68.50958413418.30 7 . × − sine-Gaussian (850,9) 72.22958413423.40 1 . × − sine-Gaussian (1304,9) 67.98958413428.70 8 . × − sine-Gaussian (2000,9) 25.85

200 400 600 800 1000 1200 1400 1600 1800 2000 frequency [Hz] c oun t pe r b l o ck

200 400 600 800 1000 1200 1400 1600 1800 200010 -2 -1 Figure 6. (Left panel.) Pseudo-spectra of simultaneous hits in tails > κσ with κ = 5 . ∧ L1, using abank of type A of 8M chirp templates, along with baseline results following time-randomized data. (Rightpanel.) Pseudo-spectra as an average over all

288 blocks of S6 H1 ∧ L1, using a bank of type A of 0.5M chirptemplates, of H1 and L1 by independent counts and by simultaneous counts with frequency pairs within∆ = 50 Hz.

Figure 7.

Detection of a high-frequency LIGO injection in block 161 by butterﬂy ﬁltering with bank oftype B of 4M chirp templates at large injection SNR (Table 4), seen in simultaneous hits in H1 and L1 (leftpanels; ρ and frequency refer to geometric means of those of H1 and L1). Hits in H1 (red) and L1 (blue)practically overlap (right panel) shown as a function of time based on data sample oﬀset in output ﬁle B161(Table 3). L1 are (2 . × , . × ) with simultaneous counts (19.73%,13.01%), reduced to (6.82%,4.49%)for frequency pairs within ∆ = 50 Hz ((7.55%,4.98%) for frequency pairs within ∆ = 100 Hz).LIGO detectors are routinely given a variety of hardware injections to test the detectors andvarious signal detection pipelines. Of interest to the present analysis are burst injections that cover4 Figure 8.

Selected high frequency injections at 1053 Hz (top panels) and 554 Hz (bottom panels) H1 ∧ L1LIGO S6 at high to low injection SNR(LOSC) (left-to-right columns), here detected by H1 (blue circles)and L1 (red dots) using a bank of type A of 8M chirp templates. The 1053 Hz (554 Hz) injections shownare at the respective GPS times 932380188.50, 935143367.60, 946193522.10 (959322411.40, 934962011.00,946205393.30). the relatively high frequency range 350-2000 Hz. The following uses some LIGO injections for aformal test and validation of software implementation.Fig. 7 shows an injection to both H1 and L1 captured by our algorithm at large injectionSNR(LOSC), detected using a bank of 4M templates in a partial analysis of LIGO S6.For high-conﬁdence detections, correlated H1-L1 output such as illustrated in Fig. 7 is essential.While signal injections are often injected at the same GPS time, astrophysical sources will impactH1 and L1 along some ﬁnite viewing angle. In the time-domain, this is commonly identiﬁed bymaximising correlations over a some ﬁnite time shift, here 0-10 ms given the distance between H1and L1. Here, we make use of the fact that a diﬀerence in arrival time between H1 and L1 from aputative astrophysical source with ﬁnite time rate-of-change in f ( t ) is equivalent to a frequency shift,allowing searches in simultaneous H1-L1 ﬁlter output such as plotted in Fig. 7.5 Figure 9.

The number of hits ρ ( t ) > κσ and maximal values of ρ in Fig. 8 about injections at 554 Hz and1053 Hz in H1 and L1 shows a generic trend with SNR in the injection process. On average, counts improveby a factor of 1.69 and ρ increments by 0.29 with the template bank of 8M compared to 0.5M chirps. Hitsare counted with a frequency margin of ±

50 Hz about these injection frequencies.

Figs. 8-9 shows a validation of sensitivity (see also earlier analyses of van Putten et al. (2014a);van Putten (2016)), here a priori limited to ρ exceeding 5.5 σ by choice of κ in (9), obtained in apartial LIGO S6 analysis using 8M templates. Overall, it appears that sensitivity in H1 is slightlybetter than L1 when signals are small. Searches for signals fainter than those shown would requirea re-run of the analysis with κ < . κ σ > ρ ( t n ) > κ σ with κ − κ . § ρ in a detection of a sample of high frequency burst injections. CONCLUSIONS AND OUTLOOKProbing inner engines to gamma-ray bursts and core-collapse supernovae require deep searches inLIGO data. Taking full advantage of modern GPU hardware, we present a GPU-CPU implementationof butterﬂy ﬁltering to search for broadband extended emission in gravitational waves from accretingﬂows around black holes, potentially relevant to the most extreme transient events.Our benchmarks demonstrate near-optimal performance using banks of up to millions of chirptemplates at better than real-time analysis, facilitating deep searches in LIGO archive data such asS6, advanced LIGO O1 and the currently ongoing O2 run.Speciﬁc applications of the proposed method include correlation analysis of the H1 and L1 detectorsand identiﬁcation of mysterious or peculiar events of interest to further analysis. A leading orderindication of correlations may derive, for instance, from counting statistics of hits, comparing simul-taneous hit counts with total hit counts in H1 and L1. Speciﬁc events of interest may be followed upby second runs, gathering all hits by removing selection of maxima in collecting B in (17).In butterﬂy ﬁltering, signal detection typically comprises a large number hits, representing ap-proximate matches with no single template providing a perfect match to the full signal at hand, asillustrated in Fig. 8. This combined output can in principle recover essentially maximal sensitivity6(van Putten 2016). For automated searches of candidate events, clustering algorithms might apply(e.g George et al. 2017), that may also facilitate quantifying the level of conﬁdence for such complexdetection output. Acknowledgments.

The author gratefully acknowledges detailed constructive comments from thereferee and J.B. Kanner. This work was partially supported by the National Research Foundationof Korea under grants 2015R1D1A1A01059793 and 2016R1A5A1013277 and made use of LIGO S6data from the LIGO Open Science Center (losc.ligo. org), provided by the LIGO Laboratory andLIGO Scientiﬁc Collaboration. LIGO is funded by the U.S. National Science Foundation. Additionalsupport is acknowledged from MEXT, JSPS Leading- edge Research Infrastructure Program, JSPSGrant-in- Aid for Specially Promoted Research 26000005, MEXT Grant-in-Aid for Scientiﬁc Researchon Innovative Areas 24103005, JSPS Core-to-Core Program, A. Advanced Re- search Networks, andthe joint research program of the Institute for Cosmic Ray Research.REFERENCES − overview.pdf Levinson, A., van Putten, M.H.P.M., & Pick, G.,2015, ApJ, 812, 124LIGO-Virgo Collaboration, 2016, Phy. Rev. Lett.,116, 241102Sathyaprakash, B.S., & Shutz, B.F., 2009, LivingRev. Relativity 12, 2van Putten, M.H.P.M., 2001, Phys. Rev. Lett., 87,091101van Putten, M.H.P.M., 2008, ApJ, 684, L91van Putten, M. H. P. M., Kanda, N., Tagoshi, H.,et al. 2011b, PhRvD, 83, 044046van Putten, M.H.P.M., Guidorzi, C., & Frontera,P., 2014, ApJ, 786, 146van Putten, M.H.P.M., Lee, G.M., Della Valle, M.,Amati, L., & Levinson, A., 2014, MNRAS, 444,L58van Putten, M.H.P.M., 2016, ApJ, 819, 169van Putten, M.H.P.M., & Della Valle, M., 2017,MNRAS, 464, 3219Smith, J.O., 2016,https://ccrma.stanford.edu/ jos/ReviewFourier/FFT − Convolution − vs − Direct.html).S6 Instrumental Lines, losc.ligo.org/s6speclinesS6 Burst Injections,losc.ligo.org/s/injections/s6/burst/H1 − s6burst − simple.txt;losc.ligo.org/s/injections/s6/burst/L1 − s6burst −−