[PDF] A GPU-based Correlator X-engine Implemented on the CHIME Pathfinder

Abstract

We present the design and implementation of a custom GPU-based compute cluster that provides the correlation X-engine of the CHIME Pathfinder radio telescope. It is among the largest such systems in operation, correlating 32,896 baselines (256 inputs) over 400MHz of radio bandwidth. Making heavy use of consumer-grade parts and a custom software stack, the system was developed at a small fraction of the cost of comparable installations. Unlike existing GPU backends, this system is built around OpenCL kernels running on consumer-level AMD GPUs, taking advantage of low-cost hardware and leveraging packed integer operations to double algorithmic efficiency. The system achieves the required 105TOPS in a 10kW power envelope, making it among the most power-efficient X-engines in use today.

Full PDF

AA GPU-based Correlator X-engineImplemented on the CHIME Pathﬁnder

Nolan Denman, ∗† Mandana Amiri, § Kevin Bandura, ‡ Jean-Franc¸ois Cliche, ‡ Liam Connor, ∗†¶

Matt Dobbs, ‡(cid:107)

Mateus Fandino, § Mark Halpern, § Adam Hincks, § Gary Hinshaw, § Carolin H¨ofer, § Peter Klages, ∗ Kiyoshi Masui, §(cid:107)

Juan Mena Parra, ‡ Laura Newburgh, ∗ Andre Recnik, ∗ J. Richard Shaw, ¶ Kris Sigurdson, § Kendrick Smith, ∗∗ and Keith Vanderlinde ∗†∗ Dunlap Institute, University of Toronto † Department of Astronomy and Astrophysics, University of Toronto § Department of Physics and Astronomy, University of British Columbia ‡ Department of Physics, McGill University ¶ Canadian Institute for Theoretical Astrophysics (cid:107)

Canadian Institute for Advanced Research, CIFAR Program in Cosmology and Gravity ∗∗ Perimeter Institute for Theoretical PhysicsContact E-Mail: [email protected]

Abstract —We present the design and implementation of acustom GPU-based compute cluster that provides the correlationX-engine of the CHIME Pathﬁnder radio telescope. It is amongthe largest such systems in operation, correlating 32,896 baselines(256 inputs) over 400 MHz of radio bandwidth. Making heavyuse of consumer-grade parts and a custom software stack, thesystem was developed at a small fraction of the cost of comparableinstallations. Unlike existing GPU backends, this system is builtaround OpenCL kernels running on consumer-level AMD GPUs,taking advantage of low-cost hardware and leveraging packedinteger operations to double algorithmic efﬁciency. The systemachieves the required 105 TOPS in a 10 kW power envelope,making it one of the most power-efﬁcient X-engines in use today.

I. I

NTRODUCTION

The Canadian Hydrogen Intensity Mapping Experiment(CHIME) is an interferometric radio telescope, presently underconstruction at the Dominion Radio Astrophysical Observatory(DRAO) in British Columbia, Canada, which will map thenorthern sky over a radio band from 400 to 800 MHz. Withover 2000 inputs and a 400 MHz bandwidth, the correlationtask (measured as the bandwidth-baselines product) on CHIMEwill be an order-of-magnitude larger than on any currentlyexisting telescope array. The correlator follows an FX splitdesign, with a ﬁrst stage Field Programmable Gate Array(FPGA)-based F-engine which digitizes, Fourier transforms(channelizes), and bundles the data into independent frequencybands, followed by a second-stage Graphics Processing Unit(GPU)-based X-engine which produces a spatial correlationmatrix consisting of the integrated pairwise products of all theinputs at each frequency.The CHIME Pathﬁnder instrument [1] features 128 dual-polarization feeds and a reduced-scale prototype of the fullCHIME correlator. This paper describes the X-engine ofthe Pathﬁnder’s 256-input hybrid FPGA/GPU FX correlator,among the largest such systems in operation. Through exten-sive use of off-the-shelf consumer-grade hardware and heavyoptimization of custom data handling and processing software, the system achieves high performance at a small fraction ofthe hardware cost of comparable installations.This paper focuses on the system architecture and imple-mentation, while two companion papers describe the customsoftware stacks, one focusing on an innovative OpenCL-basedX-engine GPU kernel [2], and one on the handling of the vastdata volume ﬂowing through the system [3].The paper is structured as follows: design considerationsand constraints are discussed in § II; the hardware componentsof the system are described in § III, and the software in § IV;the scaling of the X-engine to the full-size CHIME telescopeis described in § V, and a summary and conclusion follow in § VI. II. D

ESIGN C ONSIDERATIONS

While most components in CHIME scale linearly withnumber of inputs N , the computational cost of pairwisecorrelation scales as N , making efﬁciency in the X-enginea primary concern. There are correlation techniques whichrely on the redundancy of CHIME feed separations to scaleas N log N , but the real-time calibrations these require forprecision observations remain largely unproven in an astro-physical context. Design decisions were guided by the need toproduce an inexpensive system capable of scaling to supportfull CHIME, and which would support rapid development anddeployment of new data processing algorithms. These require-ments of computational power and ease of development drovethe decision to build the X-engine around GPUs rather thanApplication-Speciﬁc Integrated Circuits (ASICs) or FPGAs.The computational cost η of pairwise element correlationfor N elements across a bandwidth of ∆ ν is η = ∆ ν · N · ( N + 1) / (1)measured in complex multiply-accumulate (cMAC) operationsper second; for the CHIME Pathﬁnder, η = 13 TcMAC/s.For large N this dominates the cost of any other processingproposed for the X-engine. Top-end GPUs in 2014 provided of a r X i v : . [ a s t r o - ph . I M ] J un rder 4 TFLOPS of processing power per chip, equivalent to0.5 TcMAC/s; a na¨ıve but efﬁcient X-engine implementationfor the CHIME Pathﬁnder would require of order 26 GPUs,although network topology considerations favor a baselinetarget of 32 GPUs in the cluster.Several factors favor a densely-packed conﬁguration forthe X-engine. The proximity of the Pathﬁnder correlator sys-tem to the telescope itself, in a national radio-quiet zone,requires substantial Faraday shielding around the entire system,the cost of which increases rapidly with size. The 10 GbE4xSFP+ ↔ QSFP+ cabling chosen for the Pathﬁnder systemshows a steep increase in cost above a length of 7 m, whereactive cabling becomes necessary. Together, these apply sub-stantial pressure to minimize the physical dimensions of thecorrelator system. The nodes hosting the GPUs also form thebulk of the hardware cost, so maximizing the density of GPUswithin each node is an efﬁcient cost control measure.

A. I/O Requirements

Data arrive from the F-engine as 4+4-bit offset-encoded complex numbers, arranged into 1024 independent frequencysub-bands. Assuming the baseline 32-GPU layout, each GPU isindependently processing 32 of these bands, corresponding to a12.5 MHz band of radio data. In this layout each GPU requiresa continuous 25.6 Gbps input stream of sky data drawn fromfour 10 GbE connections.If limited to standard bus widths, 8-lane PCI ExpressRevision 3 becomes the baseline interconnect, meaning 16lanes of PCIe3 are required for each of the 32 GPUs – 8 to feedthe GPU, and 8 to receive data from a NIC. While PCIe3x4or PCIe2x8 links could theoretically support this transfer rate,tests showed it to be unreliable and prone to lost packets andbottlenecking. The use of PCIe3 interconnects restricts thechoice of CPU to Intel varieties, and the Ivy Bridge line ofconsumer processors provide up to 40 lanes of PCIe3, allowingeach host system to support two GPUs, fed from eight 10 GbElinks.The rate at which correlated output data are produced de-pends linearly on the choice of accumulation period, but is nota signiﬁcant driver in the design; for the Pathﬁnder system’sdefault 21 s integration, the output is a modest ∼ B. GPU Selection

Consumer-grade hardware was generally preferred to pro-fessional for its increased availability, interoperability of com-ponents, and drastically reduced cost. Dedicated scientiﬁcGPUs are differentiated by their use of error-correcting mem-ory and the availability of native double-precision functions,neither of which are signiﬁcant considerations in a radiocorrelator. The AMD offerings feature higher computationalthroughput per unit cost than the comparable NVIDIA GPUs,which led to an early preference for AMD-brand consumer-grade GPUs. This is a signiﬁcant departure from prior GPU X-engines, which typically use the xGPU [4] software package, ahighly optimized X-engine implementation built on NVIDIA’s That is, a range of [-8,7] is set to [0,15] by adding 8 to the values. TABLE I. O

PERATIONAL THERMAL PARAMETERS OF DENSE - PACKED

AMD R9 280X GPU

S FROM DIFFERENT MANUFACTURERS

Manufacturer Max T ( ◦ C) Power (W) BW (MHz)

MSI OC Edition >

99 825 13.6XFX >

99 800 13.0Club3D RoyalQueen 96 875 14.2TurboDuo 96 800 13.6HIS IceQ 89 730 14.5Sapphire Dual-X OC 85 715 13.5Sapphire, Watercooled T water + 10 ◦ C 510 14.5 proprietary CUDA programming framework. A custom soft-ware X-engine was developed in its place ( § IV-C1, [2]), builton the vendor-independent OpenCL standard. [5]Proof-of-concept software was developed on 2012-eraAMD Southern Islands GPUs similar to those later adoptedfor the CHIME Pathﬁnder. This software was able to processall 32,896 baselines of simulated Pathﬁnder data at the required12.5 MHz/GPU rate. Further software optimization brought theper-GPU throughput up to ∼

19 MHz, leaving ample headroomfor additional processing ( § IV-C).Consumer-grade GPU boards are manufactured by a varietyof third parties, with proprietary modiﬁcations to the referencedesign which can result in higher or lower power draw,faster or slower default clocking, variable power-saving orhigh-temperature operation; and which vary dramatically intheir solutions for heat dissipation. The cooling systems aregenerally not designed for the close packing required for theCHIME Pathﬁnder: of the 6 varieties of R9 280X boardstested in a packed 4U case, many reached die temperaturesclose to 100 ◦ C. Operation at these temperatures lowers theGPUs’ efﬁciency and increases their failure rate while drawingadditional power for cooling fans. Table I shows the maximumdie temperature, mean power draw, and processing throughputof GPUs running an unoptimized ‘stress test’ kernel in aclosely-packed case similar to those used in the Pathﬁnder X-engine. These measurements were performed at an ambientair temperature of ∼ ◦ C, and differences in power draw andthroughput are attributable to a combination of changes inclocking, efﬁciency, and variable fan speeds.The best performers were Sapphire-branded “Dual-X OC”boards, which maintained average die temperatures of ∼ ◦ Cin the test conﬁguration, while matching the computationalperformance of the other boards and consuming less power.Other strong performers (e.g., the HIS-branded boards) showedhigher pricing markups and proved more difﬁcult to source.III. I

MPLEMENTATION - H

ARDWARE

The X-engine is a cluster of 16 nodes, each housed in a 4Ucase. It ﬁts in 64U spread across two 42U racks, dubbed Eastand West, containing 6 and 10 nodes respectively. Each rackis powered by a Tripp Lite PDU3VSR10H50 managed PDU,allowing remote monitoring and power cycling of all nodes.The remaining space in the East rack contains the FPGA-based F-engine as well the 1GbE interchange switch, whilethe West rack additionally houses a 1U control system dubbed gamelan . A diagram of the X-engine computational cluster An Indonesian percussion ensemble featuring chimes; seehttp://en.wikipedia.org/wiki/Gamelan

PUNodes(x16) ......

FPGADigitizer / ChannelizerBoards(x16) x G b E x G b E F-engine

Control Node GbE Switch ...

X-engine

Fig. 1. A schematic of the data ﬂow between the F- and X-engine portionsof the correlator. The interchange switch and the gamelan control node arealso shown. is shown in Figure 1.

A. Node Description

A schematic of the hardware layout in each GPU node isshown in Figure 2. The processing nodes are built on EVGAX79 Dark motherboards, Intel i7 4820k CPUs, and 4x4GBDDR3 RAM overclocked to 2133 MHz. They are poweredby 1000 W high-efﬁciency power supplies and housed in 4UChenbro RM41300-FS81 cases. Each node is fed by eight10 GbE lines connected to a pair of Silicom PE310G4SPi9quad-10GbE network interface cards.Three AMD R9-series GPUs – two 280X and one 270X –receive and process a total of 25 MHz of radio bandwidth fromall the spatial inputs. The 270X is not required for the pairwisecorrelation, but was added to ensure surplus computationalpower was available to explore additional real-time processingand alternate correlation algorithms ( § IV-C).The nodes are diskless, both booting off and streamingdata back to the gamelan control node via onboard GbE,through a Cisco SRW2048-K9-NA switch. A second on-boardGbE connection allows for future expansion and data exchangebetween nodes, not required in standard operation but availablefor planned algorithmic upgrades.The average per-node power usage is ∼

680 W for theair-cooled and ∼

630 W for the liquid-cooled; the total sys-tem exceeds the 105 TOPS requirement with a direct powerconsumption of ≈

10 kW. The X-engine therefore achieves ∼

11 GOPS/W, though we note that these are 4-bit arith-metic operations, not directly comparable to single-precisionﬂoating-point operations.

B. Liquid Cooling Upgrade

In the summer of 2014, 10 of the GPU nodes wereretroﬁtted to cool their CPU and GPUs via direct-to-chip liquidcooling systems. This upgrade signiﬁcantly eases the strainupon the existing A/C system, and is intended as a provingground for the full CHIME correlator, where traditional HVACsolutions become prohibitively expensive.Liquid cooling was implemented using aftermarketheatsinks attached to each GPU, and a Swiftech Apogee II

80 Plus Platinum, corresponding to ≥ efﬁciency at load CPU

Intel i7-4820k

Motherboard

EVGA X79 Dark

RAM G B DD R M H z G B DD R M H z G B DD R M H z G B DD R M H z GPU

AMD Radeon r9 270x

GPU

AMD Radeon r9 280x

GPU

AMD Radeon r9 280x

5x PCIe3 x8

Fig. 2. The internal components of the GPU nodes, including networkinterfaces, as described in § III-A. combination pump/heatsink attached to the CPU. Lab testingshowed a reduction in GPU die temperatures to ◦ C above theinput water temperature, with minimal dependence on the spe-ciﬁc variety of heatsink. Varieties from several vendors wereultimately deployed, with similar performance. In operation,the watercooled GPUs maintained stable temperatures of 40-60 ◦ C, compared to ∼ ◦ C for the air-cooled GPUs. The cooleroperating point and the removal of the low-efﬁciency GPUcooling fans signiﬁcantly dropped the power draw per nodewith no impact on performance, and is expected to reduce thelong-term failure rates.Figure 3 shows the overall structure of the watercoolingsystem as currently deployed. A sealed loop circulates acoolant consisting of 50% water and 50% ethylene glycolthrough the GPU nodes in parallel. Each node has a smallpump contained in the CPU heat sink and passive water blockson each of the GPUs, while a constant pressure differential isheld across the nodes by a larger external pump. The coolantruns through a heat exchanger which transfers heat from thenode-cooling loop to an external liquid loop. That loop travelsout of the RFI enclosure and is then cooled to the ambientexternal air temperature in a liquid-air heat exchanger. Atemperature-dependent mixing valve permits some fraction ofthe hot coolant to immediately recirculate, to avoid cooling thesystem below the dew point inside the RFI enclosure duringwinter operation.IV. I

MPLEMENTATION - S

OFTWARE

A. Data Flow

Channelized data, ﬂags, and metadata from the F-enginearrive at the GPU nodes on eight 10 GbE lines. Each 10 GbElink carries all spatial inputs for 8 of the 1024 total frequencychannels. With eight links per node, each node processesdata covering / of the full CHIME frequency band. The ig. 3. A diagram showing the liquid-cooling structure in the CHIMEPathﬁnder, with red and blue indicating hot and cold coolant, respectively.The heat exchanger uses a large fan to cool the liquid using ambient outsideair. The object marked ‘M’ is a temperature-controlled mixing valve, whichregulates the temperature of the ‘cold’ sections of the loop. kotekan software pipeline manages the data ﬂow andprocessing within GPU nodes. Due to the high I/O demands( ∼

820 Gb/s in total), the system must make maximally efﬁcientuse of the available bandwidth at each stage. A packetized andloss-tolerant data handling system, similar to that in operationin the PAPER [6] correlator [7] ensures that momentary faultsdo not impede long-term data gathering. Recnik et. al. [3]discuss the data handling in detail; a brief description follows.Data arrive as UDP packets and are buffered by the hostCPU in system memory for inspection and staging prior totransfer into the GPUs for processing. Packet loss, though rare,is tracked along with other ﬂags from the F-engine from e.g.saturation of the ADCs or from FFT channelizing. The countof missing or saturated data is used to renormalize the post-integration correlation matrices.A series of OpenCL kernels are executed on the GPUs;these pre-condition the data, compute and integrate correlationmatrices, and post-process the data if necessary (see [2] formore details). Computed correlation matrices are assembled bythe CPU and forwarded to gamelan , which stitches the full400 MHz band back together using data from all active nodes.This reassembly is robust against individual node failures oroutages; they simply result in loss of data from the inactivenodes.Integrated correlation matrices are recorded onto an arrayof disks in gamelan , and these data are asynchronouslycopied to a remote archive server hosting a much larger arrayof drives, and then copied off-site for scientiﬁc analyses. A style of playing fast interlocking parts in Balinese Gamelan music; seehttp://en.wikipedia.org/wiki/Kotekan Fig. 4. Screenshot showing the correlator status webpage. The monitoringsystem displays the status of each FPGA and GPU node, and of the correlatorsoftware; per-GPU temperatures and per-node power consumption are alsoavailable.Fig. 5. Screenshot showing the live view webpage. The triangle is thefull correlation matrix for a particular frequency, with colour indicating thecomplex value’s phase; the website may be queried for any of the associateddata.

B. Monitoring and Control

The X-engine software pipeline (composed of the kotekan instances on each node, along with the collectionserver software) is launched and controlled through scripts runon gamelan . The nodes run CentOS Linux 6.5 and can beaccessed by remote shell login, while the PDUs allow remotepower cycling to aid in recovery of crashed systems.The status of each of the GPU nodes is tracked bythe gamelan control node, and made available via a webinterface; see Figure 4 for an example of the tracking display.In addition, the last few hours of data are streamed overTCP to a second server, where it is available for live analysisand monitoring. An example of the live-monitoring webpageis shown in Figure 5.

C. GPU Data Processing Tasks

The most computationally expensive operation performedon the GPU nodes is the mission-critical pairwise feed correla-tion. In parallel with this, the Pathﬁnder correlator will explorealternate correlation methods which leverage the redundantlayout of the CHIME baselines. Supplemental tasks includebeamforming, gating, time-shifting of inputs, and RFI excision.Brief descriptions of these tasks follow. ) Full Correlation:

The primary responsibility of the X-engine is to calculate and integrate the correlation matrix ofall the spatial inputs. This involves accumulation of 32,896pairwise products for each of 1024 frequency bands. Thedefault integration period is samples, corresponding to21.47 s, much faster than the (cid:38)

2) Alternate Correlation Techniques:

Interferometric ar-rays with highly redundant baselines can take advantage ofcorrelation techniques that are more efﬁcient than the na¨ıvepairwise method. In the case of feeds which are evenlyspaced, FFT-based transformations can be used to increasethe efﬁciency of the correlation to N log N , at the cost ofstrict calibration requirements. [8][9][10] These correlationstrategies will be tested in parallel with the pairwise N correlation; additionally, they may be used in hybrid form withsome N and some N log N stages.

3) Discrete Beamforming:

The CHIME Pathﬁnder is astationary telescope that cannot physically point at a speciﬁcsource or location on the sky. When observing localizedsources, it is desirable to form one or more beams, ‘pointing’the telescope digitally to an arbitrary location within the mainbeam. This is accomplished in the GPUs by phase-shifting andsumming the data from all antennas, so that signals originatingin one region of the sky interfere constructively. This signal isthen written out at very high cadence, allowing examinationof a localized source with very ﬁne time resolution.

4) Output Gating:

The CHIME Pathﬁnder will observeperiodic sources such as astronomical pulsars and injectedcalibration signals. These sources generally vary faster thanthe default ∼

21 s integration period, but high-cadence gatingmay be used to observe sub-integration signal structure. Gatingconsists of partitioning the output into a set of sub-buffersbased on the time relative to the period of the source, so thatindependent ‘on’ and ‘off’ signals may be constructed.

5) Time Shifting:

Signals from outlying telescope stationscan be fed into the correlator. Large spatial separations in-troduce decorrelation between inputs, which can be correctedfor by time-shifting samples within the GPUs. The currentimplementation permits the correction of any input by up to168 ms, and has been tested with the nearby John A. Galt 26mradio telescope at DRAO.

6) RFI Cleaning:

Anthropogenic radio frequency inter-ference (RFI) introduces a signiﬁcant source of additionalnoise to the astronomical signal. These signals are generallynarrow-band and intermittent, coming and going on timescalesmuch shorter than the default 21 s integration period, but withrelatively low duty cycles. High-cadence identiﬁcation andexcision of RFI can be performed within the GPUs, and avariety of algorithms are under development including robustoutlier and higher-moment statistical tests. [11][12] V. X-

ENGINE S CALABILITY

The X-engine described here was designed for the CHIMEPathﬁnder, and must be scaled up signiﬁcantly for the fullCHIME instrument. Given a scaled F-engine providing chan-nelized data, the X-engine’s design allows it to scale straight-forwardly to a broader band or larger- N arrays.Additional radio bandwidth is trivially added through ad-ditional nodes; increasing the number of inputs adds to thecomputational demand on each node, and can be addressedthrough newer, more powerful GPUs. At the time of writing,the computational power per node could be roughly tripled bysimply replacing the GPUs.To support larger N requirements, the bandwidth handledin each node can be reduced, in exchange for proportionallymore nodes. The bandwidth fed to each GPU can similarly bereduced, and for very large N , when even a single frequencyband is beyond the capacity of a single processing node, datacan be time-multiplexed across multiple GPUs.The expansion to full CHIME ( N = 2048 ) yields an N computational requirement η an order-of-magnitude greaterthan any system currently in existence. Using current tech-nology, a straightforward scaling of the current system — 256nodes each containing 2 dual-chip R9 295X2 GPUs — couldhandle the entire pairwise correlation task, without additionalsoftware development. This density of processors is easilyachievable with the liquid cooling demonstrated, and wouldoccupy a modest physical footprint, at a very low hardwarecost of ∼ $1M. However, it is not expected that the full CHIMEinstrument will rely on a complete N correlation, insteadpursuing a fast alternate correlation technique as discussed in § IV-C2. VI. C

ONCLUSION

We have implemented a low-cost, high-efﬁciency GPU-based correlator X-engine for the CHIME Pathﬁnder. Capa-ble of correlating 32,896 baselines over 400 MHz of radiobandwidth, it makes efﬁcient use of consumer-grade partsand executes a highly optimized software stack. Measuredby the computational requirement of a na¨ıve N correlation– the bandwidth-baseline product η deﬁned by Equation 1– the CHIME Pathﬁnder correlator is among the largest inthe world. Aspects of the system such as the cooling systemshave been substantially modiﬁed, optimizing the X-engine’sefﬁciency and ensuring economical scaling to the full-sizeCHIME instrument.A CKNOWLEDGEMENTS

We are very grateful for the warm reception and skillfulhelp we have received from the staff of the Dominion RadioAstrophysical Observatory, which is operated by the NationalResearch Council of Canada.We acknowledge support from the Canada Foundation forInnovation, the Natural Sciences and Engineering ResearchCouncil of Canada, the B.C. Knowledge Development Fund,le Coﬁnancement gouvernement du Qu´ebec-FCI, the OntarioResearch Fund, the CIFAR Cosmology and Gravity program,the Canada Research Chairs program, and the National Re-search Council of Canada. PK thanks IBM Canada for fundingis research and work through the Southern Ontario SmartComputing Innovation Platform (SOSCIP).We thank Xilinx University Programs for their generoussupport of the CHIME project, and AMD for donation of testunits. R

EFERENCES[1] K. Bandura et al. , “Canadian Hydrogen Intensity Mapping Experiment(CHIME) Pathﬁnder,” in

Society of Photo-Optical InstrumentationEngineers (SPIE) Conference Series , ser. Society of Photo-OpticalInstrumentation Engineers (SPIE) Conference Series, vol. 9145, Jul.2014, p. 22.[2] P. Klages et al. , “Data Packing for High-Speed 4-Bit GPU Correlators,”2015, In press; accepted to IEEE ASAP 2015.[3] A. Recnik et al. , “An Efﬁcient Real-time Data Pipeline for the CHIMEPathﬁnder Radio Telescope X-Engine,” 2015, In press; accepted to IEEEASAP 2015.[4] M. A. Clark, P. C. La Plante, and L. J. Greenhill, “Accelerating RadioAstronomy Cross-Correlation with Graphics Processing Units,”

ArXive-prints , Jul. 2011.[5] Khronos Group: The OpenCL Speciﬁcation. [Online]. Available:khronos.org/opencl[6] A. R. Parsons et al. , “New Limits on 21 cm Epoch of Reionization fromPAPER-32 Consistent with an X-Ray Heated Intergalactic Medium atz = 7.7,”

The Astrophysical Journal , vol. 788, p. 106, Jun. 2014.[7] A. Parsons et al. , “A Scalable Correlator Architecture Based on Mod-ular FPGA Hardware, Reuseable Gateware, and Data Packetization,”

Publications of the Astronomical Society of the Paciﬁc , vol. 120, pp.1207–1221, Nov. 2008.[8] J. D. Bunton, “Antenna Array Geometries to Reduce the Compute Loadin Radio Telescopes,”

IEEE Transactions on Antennas and Propagation ,vol. 59, pp. 2041–2046, Jun. 2011.[9] M. Tegmark and M. Zaldarriaga, “Fast Fourier transform telescope,”

Physical Review D , vol. 79, no. 8, p. 083530, Apr. 2009.[10] ——, “Omniscopes: Large area telescope arrays with only NlogNcomputational cost,”

Physical Review D , vol. 82, no. 10, p. 103501,Nov. 2010.[11] G. M. Nita, D. E. Gary, Z. Liu, G. J. Hurford, and S. M. White, “RadioFrequency Interference Excision Using Spectral-Domain Statistics,”

Publications of the Astronomical Society of the Paciﬁc , vol. 119, pp.805–827, Jul. 2007.[12] G. M. Nita and D. E. Gary, “Statistics of the Spectral KurtosisEstimator,”