[PDF] High-Performance Image Synthesis for Radio Interferometry

Abstract

A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected. In this thesis, a new high-performance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the back-end handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPU-based implementation of the w-projection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPU-based component prepares data for the GPU which is multi-threaded to ensure maximum use of modern multi-core CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently. Test cases show that the imaging tool performs nearly 100 × faster than another general CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and A-projection, whilst finding ways of overcoming the memory limitation.

Full PDF

HHigh-Performance ImageSynthesis for Radio Interferometry

Daniel Muscat

Supervisor: Dr Kristian Zarb Adami

University of MaltaFaculty of Science

August 2013 a r X i v : . [ a s t r o - ph . I M ] M a r The research work disclosed in this publication is partially funded by the Strategic Educational Pathways Scholarship (Malta). This Scholarship is part-financed by the European Union – European Social Fund (ESF) under Operational Programme II – Cohesion Policy 2007-2013, “Empowering People for More Jobs and a Better Quality Of Life”.

Operational Programme II – Cohesion Policy 2007-2013

Empowering People for More Jobs and a Better Quality of Life

Scholarship part-financed by the European Union European Social Fund (ESF) Co-financing rate: 85% EU Funds; 15%National Funds

Investing in your future aculty of Science

Statement of Authenticity

The undersigned declare that this dissertation is based on work carried out underthe auspices of the Department of Physics by the candidate as part fulﬁlment ofthe requirements of the degree of M.Sc.Daniel Muscat

Candidate

Dr Kristian Zarb Adami

Supervisor iii ontents

List of Figures xList of Tables xiList of Algorithms xiiAcknowledgements xiiiAbstract xiv1. Introduction to Radio Interferometry and Image Synthesis 1 delay centre . . . . . . . 71.2 The

UVW co-ordinate system . . . . . . . . . . . . . . . . . . . . . . 81.3 The measurement equation . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Image Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4.1 Non-uniform sampling of the visibility plane . . . . . . . . . . 141.4.2 Incompleteness . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.3 The w-term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Motivation, aims and objectives of the thesis . . . . . . . . . . . . . . 16iv.6 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2. Gridding and W-Projection 19

3. The General Array FrameWork 35

4. The Malta-imager 55

Conﬁguration Manager . . . . . . . . 594.3 The Convolution Function Generator . . . . . . . . . . . . . . . . . . 604.3.1 General points . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3.2 Mathematical treatment and outputs . . . . . . . . . . . . . . 624.3.3 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . 644.4 The

WImager

Component . . . . . . . . . . . . . . . . . . . . . . . . 644.4.1 Nomenclature and Terminology . . . . . . . . . . . . . . . . . 714.4.2 Input, outputs and mathematical equations . . . . . . . . . . 724.4.3 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.4.4 The gridding phase . . . . . . . . . . . . . . . . . . . . . . . . 754.4.5 The Preparation Phase . . . . . . . . . . . . . . . . . . . . . . 804.5 The

Visibility Manager

Component . . . . . . . . . . . . . . . . . . . 88vi.5.1 Loading of Measurement Data from Permanent Storage . . . . 894.5.2 Data sorting and preparation . . . . . . . . . . . . . . . . . . 914.5.3 Conversion of frequency . . . . . . . . . . . . . . . . . . . . . 934.6 The

Image Finaliser

Component . . . . . . . . . . . . . . . . . . . . 934.7 The

Image Manager component . . . . . . . . . . . . . . . . . . . . . 944.8 The

Statistics Manager component . . . . . . . . . . . . . . . . . . . 95

5. Performance analysis 97 (Giga grid point updates per second) . . . . . . 1005.3.2 Record gridding rate (Mega records/second) . . . . . . . . . . 1005.3.3 Compression ratio . . . . . . . . . . . . . . . . . . . . . . . . . 1015.3.4 Record preparation rate (

Mega records/second ) . . . . . . . . 1015.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.4.1 Experiment batch 1:

Compression . . . . . . . . . . . . . . . . 1025.4.2 Experiment batch no 2: Eﬀects of UV-grid pixel length andwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.3 Experiment batch no 3: Eﬀects of oversampling factor . . . . 1065.4.4 Experiment batch no 4: Number of polarisations . . . . . . . . 1095.4.5 Experiment batch no 5: Overall performance analysis . . . . . 1105.4.6 Experiment batch no 6: Multi-GPU analysis . . . . . . . . . . 1155.5 Analysis and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.5.1

Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.5.2 Real gridding rate and gridder scalability . . . . . . . . . . . . 117vii.5.3 Number of polarisations . . . . . . . . . . . . . . . . . . . . . 1185.5.4 Main limiting factor of the gridder . . . . . . . . . . . . . . . 1195.5.5

WImager preparation phase performance . . . . . . . . . . . . 1205.5.6 Overall performance . . . . . . . . . . . . . . . . . . . . . . . 1215.5.7 Multi-GPU scalability . . . . . . . . . . . . . . . . . . . . . . 1225.5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6. Conclusion 124

References 127 viii ist of Figures

General Array Framework calculation tree . . . . . . 363.2 Layers of the

General Array Framework . . . . . . . . . . . . . . . . . 383.3 High-level design of the engine . . . . . . . . . . . . . . . . . . . . . . 463.4 Flow chart of the validation algorithm . . . . . . . . . . . . . . . . . 533.5 The

CUDA Device Manager . . . . . . . . . . . . . . . . . . . . . . . . 544.1 High-level design of the imaging tool . . . . . . . . . . . . . . . . . . 574.2

Calculation tree of the

WImager algorithm . . . . . . . . . . . . . . 684.3

Calculation tree of the

Index Creation and Filtering block . . . . . . . 684.4

Calculation tree of the

Plan Creation block . . . . . . . . . . . . . . . 694.5

Calculation tree of the

Visibility Data Handling and Compression block 694.6 CPU thread concurrency of the

Visibility Manager . . . . . . . . . . 90ix.1 Real and total gridding rate results for experiment batch 1 . . . . . . 1035.2 Real gridding rate results of experiment batch 2 (First Data Set) . . . 1045.3 Real gridding rate results of experiment batch 2 (Second Data Set) . 1055.4 Real gridding rate results of experiment batch 2 (Third Data Set) . . 1055.5 Real gridding rate results of experiment batch 3 (First Data Set) . . . 1075.6 Real gridding rate results of experiment batch 3 (Second Data Set) . 1075.7 Real gridding rate results of experiment batch 3 (Third Data Set) . . 1085.8 Average gridder performance results . . . . . . . . . . . . . . . . . . . 1085.9 Real gridding rate results of experiment batch 4 . . . . . . . . . . . . 1095.10 Multiple increase in real gridding rate results of experiment batch 4 . 1105.11 Execution timelines of mt-imager for experiment batch 5 and 6 . . . 1125.12

Compression ratios obtained for experiment batch 5. . . . . . . . . . 1135.13 Average convolution function support for experiment batch 5 . . . . . 1135.14 Gridding rate results for experiment batch 5 . . . . . . . . . . . . . . 1145.15 Preparation and gridding phase record rate results for experimentbatch 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.16 Plot giving the ratio of

WImager preparation phase execution timeagainst gridder execution time for experiments in batch 5 . . . . . . . 115x ist of Tables mt-imager performance summary . . . . . . . . . . . . . . . . . . . . 123xi ist of Algorithms

Convolution Function Generator component algorithm . . . . . . 654.2 Algorithm of the

Index Create and Filter operator . . . . . . . . . . . 834.3 Algorithm for the

Plan Create operator . . . . . . . . . . . . . . . . . 854.4 Algorithm of the

Compress Index Generate operator . . . . . . . . . . 864.5 Algorithm for the creation of sorting index . . . . . . . . . . . . . . . . 924.6 Algorithm used for image ﬁnalisation . . . . . . . . . . . . . . . . . . . 94xii cknowledgements

As in most ambitious and successful projects, success is only possible with the helpof others.I wish to thank my supervisor, Dr Kristian Zarb Adami, for his 24x7 support,dedication and patience. Special thanks to Dr Oleg Smirnov and Dr FilippeAbdalla, who provided valuable insights and advice. Thanks also to Mr AlessioMagro who gave advice and help on GPU programming. A big thanks goes to DrJohn W. Romein for supplying me with the test code and data set related to hiswork, used in this thesis. Thanks also to Dr Anna Scaife and Dr Cyril Tasse fortheir help.My family also gave their contribution. A big thank you goes to my parentsJoseph and Maria Dolores Muscat for their support. Their support was givenduring a period when they also had to face a number of health problems. Myfather also helped to creating some of the graphics included. A last but not leastheartfelt thanks goes to my girlfriend Ms Nicola Attard, who, notwithstanding thechallenging environment to our relationship, stood by me and gave me her fullsupport. xiii bstract

Compression takes advantage of the behaviour of oversampled convolution functionsand the baseline trajectories. A CPU-based component prepares data for the GPUwhich is multi-threaded to ensure maximum use of modern multi-core CPUs. Bestperformance can only be achieved if all hardware components in a system do workin parallel. The imaging tool is designed such that disk I/O and work on CPU andGPUs is done concurrently.Test cases show that the imaging tool performs nearly 100 × faster than anothergeneral CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPUmemory. Future work will implement deconvolution and

A-projection , whilst ﬁndingways of overcoming the memory limitation.xiv hapter 1

Introduction to RadioInterferometry and ImageSynthesis

The ﬁrst interferometer used in astronomy dates back to 1921 when Albert AbrahamMichelson together with Francis G. Pease made the ﬁrst diameter measurements ofthe star Betelgeuse [1]. The setting used, known as the

Michelson Interferometer ,is today the basis of modern interferometers. Michelson had been discussing theuse of interferometry in astronomy for at least 30 years before the experimentwas done [2, 3]. Throughout the century, the technology of radio interferometershas made enormous advancement and up to this day extraordinarily ambitiousprojects have been commissioned, and more are on the pipeline. An example of arecently commissioned radio interferometer is the

LOw-Frequency ARray (LOFAR)[4]. The

Square Kilometre Array (SKA) [5–9] is the most ambitious project currentlyunder-way.Radio interferometers can be used for various purposes, even for applications outsidethe scope of astronomy, such as geophysics [10, 11]. The scope of this thesis is limitedto the use of the interferometer as a measuring device for the intensity distribution1 hapter Introduction to Radio Interferometry and Image Synthesis of the sky over the celestial sphere. The measured quantity of the interferometer isknown as visibility , and the thesis’ main focus is on how to recover the said intensitydistribution from such measured visibility data.An interferometer is made up of an array of N ≥ θ m is limited by its diﬀractionlimit. The limit is inversely proportional to the diameter D of the dish, that is θ m ∝ /D . On the other hand, the angular resolution of the interferometer islimited by the distance between the furthest two antennas in the array B max in thesame inverse proportional way, that is θ m ∝ /B max . Achieving sub-arcsecondresolution in single-dish antennas requires large diameters that are prohibitory.For interferometers, achieving sub-arcsecond resolution is just a matter of havingantennas as far away as possible from each other. If necessary, part of the array canbe orbiting in space such as in the case of the Very Long Baseline Interferometry (VLBI)

Space Observatory Programme (VSOP) [12].The basic measurement device (based on the

Michelson interferometer ) is composedof just two elements. Each possible two element combination of N ≥ N ( N − / N -element antenna array and each readingdiﬀers in the geographical set-up of the basic device. The geographical set-up isdescribed by the baseline which is the vector covering the distance between the twoantennas. Earth rotation changes the directions of the baseline (assuming a frameof reference stationary with the celestial sphere), and this is taken advantage ofby taking subsequent visibility readings [13]. As it will be shown later on in thischapter, a Fourier relationship exists between visibility as a function of baseline2 hapter Introduction to Radio Interferometry and Image Synthesis and the intensity distribution, provided that certain conditions are met. To takeadvantage of such a relationship is not an easy computational task and is today anactive area of research which this thesis is part of.This introduction aims to give a brief on the theory of interferometry, deﬁnesthe measurement equation of the interferometer and discusses an image synthesispipeline commonly used. The brief serves as a preamble for the discussion onmotivations, aims and objectives of this thesis, which is done in the penultimatesection. The chapter is concluded by giving an outline of the thesis.The theory presented in this chapter is based on Thompson et al. [14] and someonline sources, notably presentations given in a workshop [15] organised by the

National Radio Astronomy Observatory (NRAO), and a course [16] given by thesame organisation.

A basic two-element interferometer is depicted in Figure 1.1. It is observing anarbitrary cosmic source in the sky, in the direction of the unit vector ˆ s . Thebaseline is deﬁned with the vector ~B . The interferometer is assumed to havea quasi-monochromatic response at a central frequency ν c = ω c (2 π ) − . Cosmicsources are in the far ﬁeld of the interferometer, and so the incoming waves at theinterferometer can be considered planar over the baseline. Clearly the output ofantenna 1 ( V ) is the same as that of antenna 2 ( V ), except that V is laggingbehind by say τ g seconds. τ g is called the geometric delay . The wave needs to travelan extra distance ~B · ˆ s to reach antenna 1 thus, letting c represent the speed of light, τ g equates to: τ g = ~B · ˆ sc = | ~B | sinθc (1.1)3 hapter Introduction to Radio Interferometry and Image Synthesis

VisibilityVV VV

B s ^ (cid:537) . B s . ^ C o s m i cs o u r c e i n t h e f a r f i e l d Figure 1.1:

A basic two-element interferometer, with baseline ~B observing anarbitrary cosmic source in the direction of the unit vector ˆ s . hapter Introduction to Radio Interferometry and Image Synthesis

Since the interferometer is quasi-monochromatic then, the voltage output of antenna1 and antenna 2 denoted by V and V respectively can be written as follows: V = V cos( ω c t ) (1.2a) V = V cos( ω c [ t + τ g ]) (1.2b)where t is time and V is proportional to the intensity of the source. To produce a visibility reading, the two voltages get correlated by ﬁrst multiplyingand then averaging over a period of time known as the integration time . As it will beshown in this section, such a correlation is not enough since some intensity data islost. A complex correlator is used to make a second correlation between the outputof the antennas with one of the outputs phase delayed by 90 degrees. The twooutputs of the complex correlator are presented as one complex quantity which isthe visibility. Visibility is deﬁned formally in equation 1.9.Multiplication of V with V results in: V V = V cos( ω c t ) cos( ω c [ t + τ g ]) = V ! [cos(2 ω c t + ω c τ g ) + cos( ω c τ g )] (1.3) V V has a constant term and a sinusoidal term with respect to t . Provided thataveraging is done over a long enough time interval, the sinusoidal term averages outto 0 implying that the average h V V i equates to:5 hapter Introduction to Radio Interferometry and Image Synthesis -10 < V V > θ/degrees Figure 1.2:

Plot of fringe equation 1.4 with V = 1 and ω c | ~B | /c = 30 h V V i = V ! cos( ω c τ g ) = cos  ω c | ~B | c sinθ  (1.4)The correlator response is dependent on the direction θ of the source with asinusoidal behaviour known as a fringe . The ω c τ g term is known as the fringephase .Figure 1.2 shows a plot of a fringe. It clearly indicates that, for some values of θ ,the correlator gives no output. This implies loss of data. If an intensity distribution I ( θ ) is considered, then it is easy to show that the even part of the distribution islost.Let I e ( θ ) and I o ( θ ) be the even and odd part of I ( θ ). Using equations 1.4 and6 hapter Introduction to Radio Interferometry and Image Synthesis integrating over θ , h V V i results to be proportional to: h V V i ∝ Z π I ( θ ) cos( ωτ g ) dθ = Z π I o ( θ ) cos( ωτ g ) dθ (1.5a)since Z π I e ( θ ) cos( ωτ g ) dθ = 0 (1.5b)The second correlation made over the two antenna outputs, with one of the outputphase delayed by 90 degrees, generates an even fringe and will respond to the evenintensity distribution. This implies that the complex correlator responds to all theintensity distribution. delay centre If it is assumed that the interferometer is responsive over a bandwidth ∆ ν , wherebyintensity is constant throughout, then, based on equation 1.4, the correlator outputis: h V V i = (∆ ν ) − Z ν c +∆ ν/ ν c − ∆ ν/ V ! cos (2 πντ g ) dν (1.6a) h V V i = V ! cos( ω c τ g )sinc( πτ g ∆ ν ) (1.6b)where sinc( πτ g ∆ ν ) = sin( πτ g ∆ ν ) πτ g ∆ ν (1.6c)The term sinc( πτ g ∆ ν ) is known as the fringe washing function and degrades theresponse of the interferometer. The degradation is dependent on the frequencybandwidth ∆ ν and should be kept as narrow as possible, especially for wide-ﬁeldimaging. An interferometer splits up a wide bandwidth into many frequencychannels each of narrow bandwidth, and treats each channel independently.7 hapter Introduction to Radio Interferometry and Image Synthesis

The fringe washing function has no eﬀect over the interferometer response for τ g = 0.This is because signals of diﬀerent frequencies will reach the correlator in phase. The geometric delay ( τ g ) can be controlled by inserting an extra delay τ in the circuitbetween the antenna receiving the leading signal and the complex correlator. For τ = τ g , the fringe washing function is nulliﬁed. Since τ g is a function of ˆ s , one canchoose a direction for which the fringe washing function is nulliﬁed. This directionis termed the delay centre . UVW co-ordinate system

Many variables in interferometry are expressed against a reference direction knownas the phase reference centre or phase centre . The phase centre is ﬁxed to the skyand is common practice that it is set to point to the same direction of the delaycentre .The general co-ordinate system used in interferometry is the U V W co-ordinatesystem and is depicted in Figure 1.3. It is a right handed Cartesian system whereaxes U and V are on a plane normal to the phase centre and the W -axis in thedirection of the phase centre . The U -axis is in the East-West direction while the V -axis is in the North-South direction.One of the main uses of the co-ordinate system is to measure the baseline againstthe phase centre . The baseline components expressed over the U V W -axes deﬁnedby ( u, v, w ) are depicted in Figure 1.3. The components are normally given in unitsof number of wavelengths such that:( u, v, w ) = ~B λ = ~Bλ (1.7)where λ is the channel wavelength, and ~B λ is the baseline expressed in number of8 hapter Introduction to Radio Interferometry and Image Synthesis

Figure 1.3:

The UVW co-ordinate system. The diagram on the left shows how abaseline is expressed in its ( u, v, w ) components while the diagram on the right showshow the direction cosines l and m express a position on the celestial sky pointed toby the unit vector ˆ s . The unit vector ˆ s , represents the phase centre . wavelengths.As visibility measurements are taken consecutively in time, the baseline vectorrotates in the UVW-space since the interferometer resides on the Earth surface. TheUV-coverage of an observation is deﬁned as the set of ( u, v ) values for which theinterferometer makes a visibility measurement. Figure 1.4 depicts the UV-coverageof a true LOFAR observation. Each baseline will form an elliptical trajectory, whilethe interferometer samples (measures) visibility in time. A short baseline tends toform a trajectory near the centre, while a longer one tends to form a trajectoryfurther out.A clariﬁcation on the units of the ( u, v, w ) components is now proper. Expressing( u, v, w ) in units of number of wavelengths is mathematically convenient and is the9 hapter Introduction to Radio Interferometry and Image Synthesis

Figure 1.4:

UV-coverage of a true LOFAR observation general approach found in literature [14]. Nevertheless it is not always practical touse, since values are dependent on channel frequency. The MeasurementSet [17] is acommon format used to store visibility measurements. It speciﬁes ( u, v, w ) data tobe stored in meters rendering it valid over all frequency channels. In this thesis the( u, v, w ) components are expressed in number of wavelengths unless stated otherwise.When describing a position on the celestial sphere the direction cosines l and m areused as shown in Figure 1.3. They are deﬁned with respect to the U and V axessuch that the direction vector ˆ s is described as:10 hapter Introduction to Radio Interferometry and Image Synthesis ˆ s = ( l, m, √ − l − m ) (1.8) Figure 1.5:

Interferometer diagram used in section 1.3

Consider Figure 1.5. ˆ s represents the phase reference position . The unit vectorˆ s retains its deﬁnition, that is, a pointer towards an arbitrary position in thecelestial sphere. d Ω is a solid angle, and ~B λ is the baseline expressed in numberof wavelengths.Not considered in the previous section, is the non-uniform polarisation dependentreception behaviour of the antennas known as the A-term . Let A N (ˆ s ) represent anormalised version of the A-term such that A N (ˆ s ) = 1. The complex visibility11 hapter Introduction to Radio Interferometry and Image Synthesis V ( ~B λ ) is deﬁned as: V ( ~B λ ) = Z π A N (ˆ s ) I (ˆ s ) e − j π ~B λ · (ˆ s − ˆ s ) d Ω (1.9)This deﬁnition is in-line with that given in [14] that follows the sign convention ofthe exponent used in [18, 19]. I (ˆ s ) is the intensity distribution.The complex visibility, as deﬁned in equation 1.9, has dimensions of ﬂux density( W m − Hz − ). The dimensions of the intensity distribution are W m − Hz − sr − .In astronomy, the Jansky (Jy) is a commonly used unit deﬁning the ﬂux densitywhere 1 Jansky=10 − W m − Hz − .As described in [14] d Ω equates to: d Ω = dldm √ − l − m (1.10)Using equations 1.7 and 1.8 it results that: ~B λ · ˆ s = ul + vm + w √ − l − m (1.11a) ~B λ · ˆ s = w (1.11b)Substituting in equation 1.9 the following results: V ( u, v, w ) = Z ∞−∞ Z ∞−∞ A N ( l, m ) I ( l, m ) e − j π ( ul + vm + w ( √ − l − m − √ − l − m dldm (1.12)Equation 1.12 is the measurement equation of the interferometer. It deﬁnesthe relationship between the interferometer measured quantities V ( u, v, w ) and theintensity distribution of the sky. 12 hapter Introduction to Radio Interferometry and Image Synthesis

This thesis hardly gives importance to the eﬀects of A N ( l, m ) and the quotient term √ − l − m in the measurement equation 1.12. For convenience, these two termsare subsumed in the intensity distribution which will be referred to, as the measuredintensity distribution I meas ( l, m ). I meas ( l, m ) = I ( l, m ) A N ( l, m ) √ − l − m (1.13)The measurement equation 1.12 is re-written as: V ( u, v, w ) = Z ∞−∞ Z ∞−∞ I meas ( l, m ) e − j π ( ul + vm + w ( √ − l − m − dldm (1.14)This section reviews how I meas ( l, m ) can be recovered from visibility measurementsmade by the interferometer. The term 2 πw ( √ − l − m − w-term . If it equates to 0 then a Fourier relationship between visibility expressed interms of u and v ( w ignored) and the measured intensity distribution is obtained,that is: V ( u, v ) = F I meas ( l, m ) if w ( √ − l − m −

1) = 0 (1.15)where F is the Fourier operator.Most of the image synthesis techniques, exploit this Fourier relationship in orderto use the computationally eﬃcient Inverse Fast Fourier Transforms (IFFT) algorithms. However, there are various issues that need to be circumvented asto apply such algorithms, which mandates some pre- and post-processing of data.The next subsections gives a brief on the main issues and how they are commonlyhandled. Based on this brief, Figure 1.6 depicts an imaging synthesis pipeline thatis commonly used. 13 hapter Introduction to Radio Interferometry and Image Synthesis

Inverse Fast Fourier TransformDeconvolution(Ex: CLEAN) W-projectionalgorithmInterferometric Sampling

Real Sky A priori KnowledgeSynthesised Sky Non-uniform Sampled Visibility DataUniformly Sampled UV-gridDirty Image

Figure 1.6:

A simple imaging pipeline for interferometric data

Fast Fourier Transform algorithms require that the visibility UV-plane (also referredto as the UV-grid) is uniformly sampled. As already discussed in section 1.2 theUV-coverage of the interferometer is not uniform, implying that some processing isrequired to generate a uniformly sampled UV-grid.

Convolutional gridding is the technique used in interferometry to transform thenon-uniform UV-coverage into a uniform one. Details on this technique are given insection 2.1. Measured data is convolved with an appropriately chosen function andthe result is sampled as to generate a uniform UV-grid. After an IFFT takes place,the convolution is reversed by a simple element-by-element division on the intensity14 hapter Introduction to Radio Interferometry and Image Synthesis image. One notes that convolutional gridding is a topic actively researched upon,especially in its application in medical sciences [20–22].

The non-uniform UV-coverage implies that the measured visibility data isincomplete. The intensity distribution can never be recovered in full by relying solelyon the measured visibility samples. The output intensity image that is generated bythe convolution gridding algorithm is called a dirty image in view that it containsartefacts caused by the non-uniform incomplete UV-coverage.Incompleteness is handled by iterative methods commonly known as deconvolution .Based on a priori knowledge on the intensity distribution, deconvolution is appliedover the dirty image in order to recover the intensity distribution. A classic exampleof a deconvolution algorithm is CLEAN [23, 24] which assumes that the intensityis made up of point sources. Starck et al. [25] give a review of other deconvolution methods. w-term

As pointed out by Cornwell et al. [26], if the w-term is much less than unity it canbe neglected and a two dimensional Fourier relationship results. When synthesisinga narrow ﬁeld of view, the w-term can in many cases, be neglected. Neglecting the w-term when synthesising a wide-ﬁeld of view, causes substantial distortions.An advanced method that corrects the w-term , is the w-projection [26] algorithm.A visibility sample is projected on the w = 0 plane by means of a convolution.The w-term for the w=0 plane is 0, so eﬀects of the w-term get nulliﬁed by theprojection. The algorithm is integrated with the convolutional gridding algorithmand is explained in detail in section 2.2. 15 hapter Introduction to Radio Interferometry and Image Synthesis

Other alternatives to w-projection exploit other characteristics of the measurementequation. For example, it is possible to express the measurement equation as a3-dimensional Fourier transform [27, 28].

Snapshots [26, 28–30] consider a coplanararray whereby at any given time w is a linear combination of u and v for all themeasurements made by the coplanar array [28, 31]. Thus, for short periods ofobservation time, the w-term causes only a distortion in the lm-plane co-ordinateswhich is corrected by a linear transform. Facets consider the wide-ﬁeld intensityimage as a sum of smaller images ( facets ) over which the w-term can be neglected.There are two types of facets : the image-plane facets and the uvw-space facets .These are reviewed in [26].Hybrid algorithms using w-projection with any of these alternative algorithms arealso possible and are discussed in section 2.2.2.

As new powerful radio telescopes are being built up, the computational demand ofthe imaging pipeline described in Figure 1.6, is increasing to exuberant levels. It isestimated that Phase 2 of the

Square Kilometre Array (SKA) will need as much as4 Exaﬂops of computational power [32], with 90% of computational resources takenby the gridding algorithm [33]. Phase 2 is expected to be commissioned by around2020 and is predicted that though technology would have advanced by that time, asuper-computer delivering computational rates in Exaﬂops would be at the top ofthe TOP 500 [34] list [32, 35].Motivated by such computational requirement, this thesis aims in givinga contribution to the application of high performance computing in radiointerferometry. The main objective is to develop a new high-performance imaging16 hapter Introduction to Radio Interferometry and Image Synthesis tool. It is being called the malta-imager or mt-imager for short. It relies on CUDA [36] compatible NVIDIA R (cid:13) Graphical Processing Units (GPUs) to synthesise imagesby means of w-projection . Deconvolution is outside the scope of this work.The infrastructure handling all numerical calculations is generalised in a framework,which is being called the

General Array Framework (GAFW). It is designed tohandle diﬀerent hardware such as GPUs and

Field Programmable Gate Arrays (FPGAs). It is being implemented for CUDA compatible GPUs only so as to servethe main objective. The mt-imager is built over this framework to perform allGPU-based computation.

This introductory chapter discussed the main concept of image synthesis in radiointerferometry. The next chapter, that is Chapter 2, gives a literature review. Thefocal point of the thesis is the implementation of w-projection and convolutionalgridding . A mathematical treatment and a review of reported implementations ofthese algorithms are given. The chapter is concluded by giving a detailed descriptionof a gridding algorithm over GPUs proposed by Romein [37]. It is a pivotal algorithmto this thesis and will be often referred to as Romein’s algorithm.The

General Array Framework is the subject of Chapter 3. The framework isdiscussed in detail and terms are deﬁned.Chapter 4 is the main chapter whereby the mt-imager is discussed in detail. Notethat the chapter uses the terms and deﬁnitions given in previous chapters. Theimaging tool is divided into components that are independent of each other. Thehigh-level design is ﬁrst discussed, and then details of the tool are given through a CUDA stands for

Compute Uniﬁed Device Architecture and is a parallel programmingframework for GPUs developed by NVIDIA R (cid:13) hapter Introduction to Radio Interferometry and Image Synthesis discussion on each component.Chapter 5 reports on the performance obtained by the tool and shows that the mainobjective of the thesis has been achieved with success. The chapter also reportssome detailed experimental analyses on the performance of Romein’s algorithm asimplemented in the imaging tool. These analyses are meant to enrich the knowledgeon the algorithm. The chapter is concluded by proposing future work.The thesis is concluded in Chapter 6. 18 hapter 2

Gridding and W-Projection

This chapter reviews convolutional gridding and the w-projection algorithm. Somemathematical treatment is given together with other background knowledge. Aliterature review on performance and implementation of these algorithms is includedtogether with some notes on GPU programming.

In interferometry, convolutional gridding is the method used to transform thenon-uniform sampled

UV-plane (visibility plane) to a uniform sampled one suchthat

Inverse Fast Fourier Transform (IFFT) algorithms can be applied (refer tosubsection 1.4.1). Visibility measurements are convolved with an appropriatelychosen convolution function C ( u, v ). The resultant UV-plane is then sampleduniformly, over which an IFFT is applied. The result is an image in the lm-plane (intensity distribution plane), which is corrected from convolution by asimple element-by-element division. The output is a dirty image of the intensitydistribution, which is the true intensity aliased by the non-uniform incompleteUV-coverage. 19 hapter Gridding and W-Projection

In this section, convolutional gridding is discussed via a mathematical treatment.The concept of weighting is introduced, and some topics on the convolution function C ( u, v ) are reviewed. The mathematical treatment given in this section is based on, and adapted from,Jackson et al. [21].The measured intensity distribution I meas ( l, m ) considered here, has been deﬁned insection 1.4 as follows: I meas ( l, m ) = I ( l, m ) A N ( l, m ) √ − l − m (2.1)For this treatment, the w-term is ignored such that Fourier relationship expressedin equation 1.15 is true, that is: V ( u, v ) = F I meas ( l, m ) (2.2)Let n be the number of visibility measurements that the interferometer samples, andlet i be a zero-base index such that u i and v i are the u and v values of a measuredvisibility point. The sampling function P ( u, v ) of the interferometer is deﬁned as: P ( u, v ) = n − X i =0 δ ( u − u i , v − v i ) (2.3)where δ ( u, v ) is the two dimensional Dirac function.As discussed further on, it is desirable to control the shape of the sampling function.This is done by introducing a weight ε i for each sampled visibility, such that the20 hapter Gridding and W-Projection weighted sampling function P ε ( u, v ) equates to: P ε ( u, v ) = n − X i =0 δ ( u − u i , v − v i ) ε i (2.4) F − P ε ( u, v ) is known as the synthesised beam , dirty beam or point spread function (PSF). It is here denoted as psf ( l, m ). psf ( l, m ) = F − P ε ( u, v ) (2.5)With interferometric sampling and use of weights, the visibility plane V ( u, v ) istransformed into a non-uniform sampled plane P ε ( u, v ) · V ( u, v ). The desire is totransform the non-uniform sampled plane to a uniform UV-grid V grid ( u, v ), sampledat regular intervals ∆ u and ∆ v on the U and V axes respectively. This is achievedby convolving P ε ( u, v ) · V ( u, v ) with an appropriately chosen convolution function C ( u, v ) and then uniformly sampling the output. V grid ( u, v ) = { [ P ε ( u, v ) · V ( u, v )] ∗ C ( u, v ) } · III (cid:18) u ∆ u , v ∆ v (cid:19) (2.6)where the operator ∗ is a two-dimensional convolution, and III ( u/ ∆ u, v/ ∆ v ) is theShah function deﬁned as: III (cid:18) u ∆ u , v ∆ v (cid:19) = ∞ X i = −∞ ∞ X k = −∞ δ ( u − i ∆ u, v − k ∆ v ) (2.7)An inverse Fourier transform is applied on V grid ( u, v ) which equates to: F − V grid = { [ psf ( l, m ) ∗ I meas ( l, m )] · c ( l, m ) } ∗ III ( l ∆ u, m ∆ v ) (2.8)where c ( l, m ) = F − C ( u, v ) is referred to as the tapering function.21 hapter Gridding and W-Projection

Convolution with the Shah function causes a replication over F − V grid ( u, v ) atintervals (∆ u ) − in l and (∆ v ) − in m . The synthesised ﬁeld of view is thusdependent on the choice of ∆ u and ∆ v . Only one replica is truly calculatedwhich is mathematically equivalent to multiplying F − V grid ( u, v ) with a rectangularfunction. To complete the process, the image is divided by c ( l, m ) so as to reversethe convolution and produce the non-normalised dirty image I dirty ( l, m ). I dirty ( l, m ) = ( { [ psf ( l, m ) ∗ I meas ( l, m )] · c ( l, m ) } ∗ III ( l ∆ u, m ∆ v )) · rect( l ∆ u, m ∆ v ) c ( l, m ) (2.9)where rect( l, m ) =  | l | < . | m | < .

50 otherwise (2.10)Equation 2.9 describes the output of the convolutional gridding algorithm.

The PSF aliases I meas ( l, m ) in such a way that I meas ( l, m ) cannot be recoveredthrough direct methods. This is the eﬀect of data incompleteness. Iterative methodsknown as deconvolution are applied to try to recover I meas ( l, m ) to the best possibleaccuracy (refer to section 1.4.2). The eﬀectiveness is dependent on the form of thePSF, which is desired to be as compact as possible. If too few visibility measurementssamples are considered, then, the PSF might be too wide for a proper recovery. Inthe extreme case, where only one measurement is considered, the PSF is inﬁnitelywide.The weight introduced in equation 2.4 plays a pivotal role in controlling the form ofthe PSF. As visible in Figure 1.4, there is a higher density of visibility measures inregions covered by shorter baselines. This tends to overemphasise the long spatialwavelength of the PSF causing wide skirts. The uniform weighting scheme caters for22 hapter Gridding and W-Projection this issue by weighting each measurement with a value inversely proportional to theneighbourhood density [38]. In another scheme, called natural weighting , the densityissue is given second priority, and visibility data is weighted inversely proportionalto its variance so as to obtain the best signal to noise ratio [14]. Short baseline datawill still be overemphasised, but an in-between scheme, known as robust weighting [39], tries to compromise between natural and uniform weighting schemes.

The form of the tapering function c ( l, m ) is crucial, since the replication caused bythe convolution with the Shah function, forces c ( l, m ) to alias the image. Aliasing canbe fully suppressed if the tapering function covers the whole ﬁeld of view deﬁned byrect( l ∆ u, m ∆ v ) and then goes to 0 outside the region. The inﬁnite sinc function (inthe UV-plane ) is the best choice for C ( u, v ) [20] but it is computational unattainable.Various functions have been studied [21] and the general choice in interferometry isthe spheroidal prolate function [40–42].The convolution function C ( u, v ) has to be ﬁnite, or in other words it has to have aﬁnite support . The support of the convolution function is deﬁned as the full widthof the function in grid pixels. Note that in literature, support might be deﬁneddiﬀerently as explained by Humphreys and Cornwell [43].It is a common approach that the convolution function is not calculated duringgridding but is pre-generated prior the gridding process [37, 43] and stored inmemory. The function has to be sampled at a higher rate than V grid ( u, v ). Suchsampling is referred to as oversampling . The oversampling factor β is deﬁned as: β = ∆ u ∆ u conv = ∆ v ∆ v conv (2.11)23 hapter Gridding and W-Projection where ∆ u conv and ∆ v conv represent sampling intervals at which the convolutionfunction is being sampled. For convenience, it is assumed that the oversamplingfactor is invariant over the U and V axes.In general an implementation of a gridder does not perform any interpolation on thenumerical data of the convolution function. It merely chooses the nearest numericalvalues while gridding a record . This changes the form of the convolution functionwhich can be modelled by the following equation:˜ C ( u, v ) = " C ( u, v ) · III u ∆ u/β , v ∆ v/β ! ∗ rect u ∆ u/β , v ∆ v/β ! (2.12)where ˜ C ( u, v ) is the oversampled version of C ( u, v ).Applying the Inverse Fourier transform, the following is obtained: F − ˜ C ( u, v ) = " c ( l, m ) ∗ III lβ/ ∆ u , mβ/ ∆ v ! · sinc πl ∆ uβ ! sinc πm ∆ vβ ! (2.13)The sinc functions change the form of the convolution function. Their aﬀects arecorrected for, by dividing the image with the sinc functions after the inverse Fouriertransform takes place. In subsection 2.1.1, the w-term was ignored while describing the convolutionalgridding algorithm. When the term is much less than unity, such an assumptionis acceptable [26]. Otherwise ignoring the w-term leads to substantial distortions. In this thesis, the term record is used to describe a single- or multi-polarised visibilitymeasurement. hapter Gridding and W-Projection

This section discusses a recent algorithm known as w-projection [26] that handlesthe w-term through convolution. A visibility record can be projected to the w = 0 visibility plane, by convolving with a w -dependent function. W-projection builds over the convolutional gridding algorithm by applying the stated fact. Amathematical treatment of w-projection are the subject of the next subsection. Itis followed by other w-projection related topics which are performance, and hybridalgorithms (with w-projection ). Frater and Docherty [44] show that a relationship exists between any constant w plane and the w = 0 plane. Following is a mathematical proof based on [44]:Let V w ( u, v ) represent visibility over a plane with constant w and let g w ( l, m )represent the w-term in its exponential form. g w ( l, m ) = e − j πw ( √ − l − m − (2.14)Using the measurement equation 1.14 it can be shown that: V w ( u, v ) = Z ∞−∞ Z ∞−∞ I meas ( l, m ) g w ( l, m ) e − j π ( ul + vm ) dldm (2.15)which implies the following Fourier relationship: V w ( u, v ) = F [ I meas ( l, m ) g w ( l, m )] (2.16)Clearly g ( l, m ) = 1 and thus 25 hapter Gridding and W-Projection V ( u, v ) = F I meas ( l, m ) (2.17)Substituting in equation 2.16 and applying the convolution theorem it results that: V w ( u, v ) = V ( u, v ) ∗ G w ( u, v ) (2.18)where G w ( u, v ) = F g w ( l, m ) (2.19)Any visibility plane V w ( u, v ) with constant w is related to the w = 0 plane by aconvolution with a known function G w ( u, v ).Clearly g w ( l, m ) = [ g w ( l, m )] − exists implying that G w ( u, v ) = F g w ( l, m ) also existsand thus the convolution in equation 2.18 can be inverted to put V ( u, v ) subject ofthe formula. V ( u, v ) = V w ( u, v ) ∗ G w ( u, v ) (2.20)Any visibility plane with constant w can be projected to the w = 0 plane via aconvolution with the w-dependent function ¯ G w ( u, v ).Cornwell et al. [26] argue that as a consequence of equation 2.20 any visibility pointcan be re-projected on the w = 0 plane. This forms the basis of the w-projection algorithm which builds over convolutional gridding by convolving each visibilityrecord with a w -dependent function C w ( u, v ). It is deﬁned as: C w ( u, v ) = G w ( u, v ) ∗ C ( u, v ) (2.21)26 hapter Gridding and W-Projection

The Point Spread Function (deﬁned in section 2.1.1) is calculated using w-projection as follows: psf ( l, m ) = F − n − X i =0 δ ( u − u i , v − v i ) ∗ G w i ( u, v ) ε i (2.22) psf ( l, m ) = F − n − X i =0 G w i ( u − u i , v − v i ) ε i (2.23)Equation 2.9, which is the output of the convolutional gridding algorithm is validfor w-projection provided that equation 2.23 is used for psf ( l, m ). G w ( u, v ) is not directly solvable and thus the convolution functions C w ( u, v ) haveto be solved numerically [26]. They need to be pre-generated and stored in anoversampled format. The gridding process will convolve with ˜ C w ( u, v ) which is theoversampled version of C w ( u, v ).˜ C w ( u, v ) = ( [ G w ( u, v ) ∗ C ( u, v )] · III u ∆ u/β , v ∆ v/β !) ∗ rect u ∆ u/β , v ∆ v/β ! (2.24)The aliasing and correction arguments given in section 2.1.3 still hold. Samplingin w is also required, and this will also generate some aliasing. Cornwell et al. [26]argues that aliasing can be reduced to tolerable values by scaling in √ w rather thanlinearly. Support of ˜ C w ( u, v ) is dependent on w and increases with increasing w [29].This can lead to prohibitive memory requirements for storage of ˜ C w ( u, v ) [29, 45].It is a known issue in w-projection and there is research going on in order to handlethe problem, such as the work presented by Bannister and Cornwell [45].It is to be noted that based on the same principles described here, it is possible to The plural is used in view that there is a diﬀerent convolution function for each unique valueof w . hapter Gridding and W-Projection correct the w-term in the intensity plane. Applying an inverse Fourier transform onequation 2.16 it results that: I meas ( l, m ) = F − V w ( u, v ) g w ( l, m ) (2.25)The w-stacking algorithm [29] is based on equation 2.25. Visibility data ispartitioned in w and gridded on separate constant w visibility planes. Correction isapplied after a Fourier transformation of the plane. At the end, all planes are addedtogether. In section 1.4.3 alternatives to w-projection were discussed. These alternativesare not mutually exclusive with w-projection . Hybrid solutions using w-projection are possible and have been successfully implemented. For example, facets and w-projection have integrated together in CASA [46] where w-projection is appliedover wide facets . The recently proposed w-snapshots algorithm [29] uses w-projection to project visibilities on a plane over which the snapshots algorithm can be applied.The plane is chosen by least square ﬁt and changed whenever the visibility readingbeing projected over the plane deviates too much.

The basic idea of w-projection is the one currently giving the best performance.Bhatnagar [47] claims that w-projection is faster than facets by a factor of 10.Bannister and Cornwell [45] claim that w-projection is the most computationallyeﬃcient algorithm.Variants or hybrids based on w-projection can give better performance. For example,28 hapter Gridding and W-Projection

Kogan[48] shows that w-stacking might be computationally faster than w-projection .Cornwell et al. [29] claim that w-snapshots algorithm is faster and more scalable than w-projection . This section summarises key aspects of NVIDIA GPU programming using CUDA.For details, reference should be made to documents [49, 50] supplied by NVIDIA.

GPUs are parallel devices that can achieve impressive computational power byhandling thousands of execution threads concurrently. Mandated by hardwaredesign, threads are grouped in blocks, and each block is executed by onemulti-processor. Each multi-processor concurrently handles a few of these blocksand blocks are scheduled to run only after other blocks ﬁnish execution. Threadswithin a block are guaranteed to be all running when a given block is scheduledon a multi-processor. For modern GPUs, the maximum threads per block is 1024,implying that the modern GPU can handle tens of thousand of threads concurrently.An execution code over a GPU is referred to as a kernel , and in this thesis thisterm is used only for this purpose. The term thread conﬁguration is used to describethe set-up of threads and blocks for the kernel . Many of the thread conﬁgurations mentioned in this thesis are in such a way that a thread is set to process an elementwithout the need to interact with other threads. In such thread conﬁgurations it isto be assumed that blocks are set with the maximum number of threads, and enoughnumber of blocks are requested to execute at full occupancy. It is also to be assumedthat there can be thread re-use, in the sense that a given thread will process a setof elements one after each other. It has the same eﬀect of a thread being destroyed29 hapter Gridding and W-Projection after an element is processed and re-created to handle a new element when a newblock is scheduled.

Many GPU algorithms tend to be memory bound meaning that the main limitingfactor is access to memory. In the design of such algorithms, the strive is often theoptimisation memory usage.The GPU provides diﬀerent types of memory. Table 2.1 lists the relevant ones forthis thesis together with some salient features.

Memory Type Location(on/oﬀ chip) Access Scope Lifetime

Register On R/W 1 thread ThreadShared On R/W All threads inblock BlockGlobal Oﬀ R/W All threadsand host Host allocationTexture Oﬀ R All threadsand host Host allocation

Table 2.1:

An incomplete list of diﬀerent memory types available on an NVIDIAGPU, together with some salient features. Memory types, which are not relevant tothis thesis, are not listed.

Global memory is the only Read/Write memory that is persistent beyond the lifetimeof a kernel and it is also the largest in size (some few Gigabytes). It also suﬀersfrom high latency. Shared memory is much faster than global memory but limitedin scope and size (a few tens of kilobytes). Registers are even faster as they can beaccessed at zero extra clock cycles per instruction in most execution scenarios. Theyare much more limited in scope than shared memory and are a scarce resource.Shared memory and registers can be used to store temporary data, that need to beaccessed quickly. For example, if a set of data is required to be read by all threadsin a block, then it is general practice to load the data set in shared memory.30 hapter Gridding and W-Projection

Textures are read-only memory structures, and in this thesis, they are used for theiron-chip cache that enables eﬃcient access to read-only data from global memory.They provide further functionalities, not used in this thesis, and not reviewed here.

Literature on high-performance implementation of w-projection and convolutiongridding for radio interferometry is scarce. This contrasts with medical scienceresearch whereby GPUs are a common topic of research for use in medical equipment[22, 51, 52].Edgar et al. [53] report work on a CUDA gridder for the

Murchison Wide FieldArray (MWA) [54]. A thread is associated with a unique point on the grid. Eachthread scans through a list of visibility records, calculating the contribution (if any)that the record gives to the grid point under the thread responsibility. Most of therecords do not give any contribution to a grid point, and there is substantial wasteof time in the scan. This waste is reduced by grouping records by their position onthe grid, such that each gridder thread scans only some of the groups.Humphreys and Cornwell [43] discuss a gridder used for benchmarking. A threadis assigned for each point of the convolution function. To avoid race conditions, asingle run of the kernel only grids few records which do not overlap when convolved.Another benchmark study is presented by Amesfoort et al. [55]. Race conditionsare evaded by allocating a private grid for each thread block. This study makes thegridder unusable for radio interferometry in view that memory requirement for sucha conﬁguration are much higher than the global memory available in modern GPUs.Romein [37] has recently proposed as algorithm which this thesis regards as abreakthrough on the subject. It is used in the development of the thesis’ imaging31 hapter Gridding and W-Projection tool. From here on this algorithm will be referred to, as Romein’s algorithm. Itexploits the trajectory behaviour of the baseline that was explained in section 1.2.Details on the algorithm are given in section 2.5.Yatawatta [56] at the

Netherlands Institute for Radio Astronomy (ASTRON)[57]developed a new imaging tool called excon . It grids visibility measurement overGPUs using w-projection and w-snapshots .GPUs are not the only device considered for the high performance implementationof w-projection and convolutional gridding . For example, Verbanescu et al. [58],Verbanescu[59] and Amesfoort et al. [55] consider an implementation over the CellB.E processor while Kestur et al. [60] reports on a framework for gridding overFPGAs.

As already stated, Romein’s algorithm refers to the algorithm presented in [37]. Thisthesis makes use of this algorithm and hence it is reviewed in some detail here. Itshould be noted that chapter 5 presents new analysis of this algorithm that are notpublished in [37].Records measured by a given baseline are sorted by time and presented consecutivelyto the algorithm. In this way, while records are being gridded one after the other,the region in the UV-grid that is updated by a convolved record moves with thebaseline trajectory.The grid is split up in sub-grids of sizes equal to the size of the convolution functionthat will be used to grid. Threads are assigned to take care of one grid point in eachsub-grid as shown in Figure 2.1. Thanks to this conﬁguration a thread updates oneand only one grid point in convolving a visibility record. By virtue of the baseline32 hapter Gridding and W-Projection

Figure 2.1:

Thread conﬁguration of Romein’s algorithm for a hypothetical 9 × × trajectory behaviour, it is quite likely that, subsequent records will need to updatethe same grid point. Updates are accumulated in the registry, until the grid pointmoves out of the region being updated. At this point, the thread commits theaccumulated update to global memory and commences a new accumulation for anew grid point in a diﬀerent sub-grid.Explanation of some implementation details used in the test case presented byRomein [37] for this algorithm, now follows. Only the main details that are reusedin this thesis are pointed out.All grid point updates related to a record are handled by one CUDA block. Thismeans that the block requires an amount of threads equal to the convolution functionsize. GPUs impose a maximum on the number of threads per block (maximum of1024 threads per block for the latest architectures) which is in many cases smaller33 hapter Gridding and W-Projection than the size of a convolution function. A given thread is thus allowed to caterfor more than one grid point in each sub-grid. Ideally all the grid points underthe responsibility of a thread should be handled concurrently but due to registrylimitations in modern GPUs, this is not always possible. Instead, a group of recordsis scanned several times by the thread so as to cater for all the grid points entrustedto it. All the threads in a block will need to read the same record data for severaltimes, and by GPU best practices, data is pre-loaded in stored memory so as tohave a fast access.Diﬀerent CUDA blocks grid diﬀerent groups of records. This mandates the use ofatomic additions to commit the accumulated grid point updates to the grid. Singleprecision atomic additions are intrinsically supported by GPUs, but double precisionatomic additions are not supported. Despite the fact that a software implementationof double precision atomic additions is possible, such an implementation is ineﬃcientand heavily impairs the algorithm’s performance. In conclusion, the algorithm workseﬃciently only for single-precision.The implementation makes use of textures for retrieving convolution function datastored in global memory.Romein [37] reports a maximum of 93 Giga Grid point updates per second for thetest implementation. Gridding was done over a quad-polarised grid using a GeForceGTX680 GPU. This result is revised in chapter 5 in view that enhancements madein this thesis, give more knowledge about the algorithm. The metric is explained in section 5.3.1. hapter 3 The General Array FrameWork

The

General Array FrameWork (GAFW) provides an infrastructure for numericalcalculations. The framework has features meant to facilitate high-performance andhas a simple user interface. It adapts to the underlying hardware (CPUs, GPUs,FPGAs etc) since all control logic is handled by the framework through an engine.The framework’s design promotes collaboration between scientists with basicknowledge in programming, and developers specialised in high performancecomputing. A layered approach allows for two distinct development areas, eachsuited for the respective collaborators mentioned above. The concept of thisframework is based on the mathematical concepts of arrays and operators withminimal framework-speciﬁc jargon. This makes it easily comprehensible andmanageable by scientists.In this thesis, a C++ [61] implementation has been developed that supportsmulti-GPU systems and forms the basis of the imaging tool that will be discussedin the next chapter. The use of the framework in the imaging tool simpliﬁed thedevelopment of the tool. In this context the "user" is the developer of an application built over the

General ArrayFramework . hapter The General Array FrameWork

The framework mimics the way humans tend to make a calculation. In most cases, anequation is ﬁrst deﬁned, and afterwards numerical data is entered to obtain a result.In other words, humans tend to understand mathematical relationships before doingany calculation. A similar computational scenario is presented in this framework.The application deﬁnes how a calculation is done and then makes a separate requestfor a calculation over a set of input data. Since calculation deﬁnitions are separatefrom the actual calculation, they can be reused over and over again for diﬀerentinput data.A calculation is deﬁned using a calculation tree . The calculation tree is a graphof operators that get connected together via arrays . Operators deﬁne logic thatgenerates numerical data, while each array represents an N-dimensional ordered setof numerical data of a speciﬁc type (ex integers, complex numbers, etc).

Arrays areset as input and/or output to operators .Figure 3.1 gives an example of a simple tree. It deﬁnes convolution using FFTs.

Arrays are represented by arrows.

FFT IFFTDot MultiplyFFT

CDAB OE

Figure 3.1:

An example of a calculation tree. This tree deﬁnes convolution bymultiplying elements in the Fourier domain. Boxes represent operators while arrays are represented by arrows. hapter The General Array FrameWork

For convenience, the entry point for the generated numerical data for an array isprovided by a third object called the result . Each array has associated to it one andonly one result object.

Result objects provide also the entry point for calculation requests. Calculationrequests are result-centric and not tree-centric in the sense that the request is tocalculate the result. Based on the deﬁned calculation trees , the framework decideswhich operations (that is, execution logic described by operators ) are required tocalculate and obtain the requested result.

Result objects provide mechanisms to inform the framework on what to do with thegiven data. An example is whether the application intends to retrieve the data ornot. This has a signiﬁcant impact on performance since a memory copy from GPUto host is required.Calculations are done asynchronously with respect to the application thread. Suchbehaviour is crucial to achieve high performance and maximise the usage of availableresources. Once the application requests a calculation, the framework validates andproceeds in the background. The application thread can load more data and requestfurther calculations while GPUs are executing code in the background.Two ﬁnal points on the general concept of the framework are the factory and objectidentiﬁcation. A factory is used to make things as easy as possible to the user andensure full control of the framework over its own objects. The factory maintains allframework objects, including their creation and destruction. Framework objects areidentiﬁed by a two-string structure which is referred to as an identity . A registryservice provided by the framework enables the application to use the objects’ identity instead of C++ pointers or references, when communicating with the framework.37 hapter The General Array FrameWork

Calculation Execution Control Layer(CECL) (Made up of an Engine)

Application LayerOperation Execution Layer (OEL) (Operator Calculation logic)

Programming Interface Layer (PIL) (Arrays, Operators, Results, Factory, etc)

Hardware Layer (GPUs, FPGAs, CPU, DSP boards etc)

Figure 3.2:

Layers of the

General Array Framework . Each layer is aware only ofthe layer directly below it and gives service to the layer directly above it. The onlyexception to the rule is the CECL which has direct access to the

Hardware Layer . The framework can be modelled by ﬁve stacked layers as shown in Figure 3.2. Eachlayer is described hereunder:

The Application Layer:

As its name implies, this layer represents theapplication built over the GAFW. The application constructs calculationtrees , inputs numerical data to the system, requests calculations and retrievesresults. Note that the application layer is unaware of the underlying hardwareas it is up to the framework to make the necessary adaptations to executecalculations on the underlying hardware. This keeps the development of theapplication layer simple.

The Programming Interface Layer (PIL) deﬁnes all functionality thatthe framework provides to the application layer. Its design is intended to oﬀer38 hapter The General Array FrameWork a remarkably simple interface to the application layer. Section 3.3 discussesthe programming interface layer in greater detail.

The Calculation Execution Control Layer (CECL) contains all thelogic to perform a calculation. It is well aware of the underlying hardware andadapts to it. As already pointed out, only GPUs are currently supported, andit adapts calculations to the number of GPUs available. The layer is madeup of a multi-threaded engine, providing all GPU control logic to execute acalculation. It requires direct access to the underlying hardware.

The Operation Execution Layer(OEL) is responsible to executeoperator code on the hardware.

The Hardware Layer is the hardware itself. As pointed out many times,CUDA compatible GPUs are the only supported hardware. It is aimed thatin the future, there will be support for other hardware such as FPGAs, DSPboards, CPUs and clusters.

This section discusses more concepts and details of the framework as seen from thepoint of view of the application layer. Each subsection discusses a concept, objector service given by the framework.

Since a priority in the design of the framework is simplicity, the handling of objectsis mandated to the framework and stripped oﬀ from the application. The frameworkdelivers such service through a factory . Only one instance of the factory can existat any moment in the application life cycle, and is expected to exist throughout thewhole lifetime of the application. Whenever the application requires the creation and39 hapter The General Array FrameWork destruction of framework objects, it does so by making a request to the framework’s factory .The factory simpliﬁes the management of operators for the application layer, since operator initialisation and destruction logic can vary from one operator to an other.Further details are discussed in section 3.3.4.The use of the factory is beneﬁcial for forward compatibility with future releasesof the framework. Since all the initialisation/destruction logic is in the absolutecontrol of the framework, any future enhancements of the framework that changessuch logic can be implemented with no change to the application layer code. Theapplication will still be able to compile against new versions of the framework.The factory represents the framework, and its instantiation is analogous to theinstantiation of the framework. Once the factory is set up, there are no otherprocedures to follow to initialise the framework. The application can immediatelybuild calculation trees and subsequently request calculations.

Framework objects are identiﬁed using a two-string structure called an identity .User-deﬁned objects can also have an identity , and the framework is well structuredto support such user-deﬁned objects.An identity is deﬁned by two strings: a name and an object-name . The name identiﬁes the type of object such as "array" or "result" while the object-name identiﬁes the object itself. In Figure 3.1 the object-names "A" and "B" are usedto represent two arrays in the calculation tree. Every object is expected to have aunique object-name , which is given by the application layer. Names are given by theframework and are useful to distinguish between diﬀerent operators (such as "FFT" or "Dot Multiply" ). 40 hapter The General Array FrameWork

Object-names have a hierarchical naming scheme . A dot (.) is used to separatelevels in the hierarchy. There are no rules regarding how the hierarchy is set up, but,if the object is to be registered, its parent needs to be registered beforehand. Thehierarchy is essential to avoid conﬂicts in object-names . For example, if two unrelatedtrees are developed, there is the possibility that, by mistake, the same object-name ischosen for objects in the two trees. If each tree uses its own namespace, by deﬁningits own unique branch in the hierarchy, then the issue is avoided. Note that theframework provides mechanisms to support application deﬁned objects to work intheir own namespace.The use of identities gives the possibility of communicating with the frameworkon objects using object-names or names instead of C++ pointers or references.A registry service is available in the framework that keeps a mapping between object-names and respective objects. Framework objects are automatically registeredon creation. Other application deﬁned objects can be registered manually. Thisstrategy alleviates the application from maintaining C++ memory pointers toobjects that it needs. An array represents the movement of data within a tree. It has an identity andis automatically registered by the factory on creation. An array is deﬁned by itsdimensions and the type of numbers it holds (ex integers, complex numbers, etc).An array within a tree can be in three modes: input, output or bound. The modesare mutually exclusive, and arrays can only be in one mode.An array is in input mode if numerical data is manually pushed in by the applicationlayer. The array provides the entry point to input such data. Names do not have a naming scheme. hapter The General Array FrameWork An array can be bound with a result object. Such a bind instructs the framework touse the last generated data represented by the result object as input to a subsequentcalculation in which the bound array is involved.If an array is neither bound nor in input mode then the array is in output modeand is expected to be set as an output of an operator. The application is notnormally expected to deﬁne the dimensions and type of data for such arrays, since operators are expected to have logic to determine such properties automaticallyduring validation (refer to section 3.4.1). A GAFW operator describes logic that generates numerical data. For example, anoperator might describe the logic of a Fast Fourier Transform.

Operators have an identity (discussed in section 3.3.2). The identity’s name of an operator is an essential property and is one of the main reasons why an identity includes a name. Diﬀerent operators are regarded as diﬀerent object types andmust have a diﬀerent name . The application communicates its request for a new operator object using the name of the operator . Good meaningful names are thosethat describe the execution code represented by the operator such as "FFT" .In the general case, diﬀerent operators are implemented using a diﬀerent class.Thanks to identities and object-oriented polymorphism, this complexity is hiddenfrom the application which is aware only of the base class.An operator has three forms of input data: input arrays , pre-validator arrays and parameters . Input-less operators are legitimate.

Input arrays are an ordered list of arrays that describe the data on whichthe operator operates. The numerical data that the array represents is made42 hapter The General Array FrameWork available on the GPU memory during the operation execution.

Pre-validator arrays are another ordered list of arrays . Diﬀerent fromthe normal input arrays , the data is not presented to the operator duringexecution. They are only applicable during the validation phase (refer tosection 3.4.1). The framework guarantees that they get validated before the operator . It is expected that the operator validation logic will need to obtainsome information from the array (such as dimensions or data type) in orderto execute its own validation logic.

Parameters are a set of variables that are set up prior to calculations.They can be regarded as conﬁguration data for the operator . For example, an operator doing zero padding would need to know how much zero padding isneeded. This information can be given to the operator through parameters.An operator has only one form of output, which is an ordered list of arrays in outputmode.An operator can request some additional allocation of memory on the GPU for itsown private use while executing. This memory is referred to as a buﬀer.

Operators exist in the PIL and OEL layers (refer to Figure 3.2). As mentionedearlier, the base class is presented to the application layer. The PIL, therefore, onlydeﬁnes the basic interactions of the operator with the application layer. As for theOEL, the operator deﬁnes all validation and execution logic.The procedure to develop an operator is simple. The operator is deﬁned by anew C++ class that inherits the operator basic class. Two methods need to beoverloaded, one providing validation logic and another one providing the kernelsubmission logic. In most cases, a new CUDA kernel needs to be coded. Thisis the only hard part in the whole procedure since it requires skilled expertise inCUDA programming. The ﬁnal step is to register it in the framework by its name .The registration process requires coding for class instantiation and operator object43 hapter The General Array FrameWork destruction.

Result objects provide the entry points for controlling and retrieving calculated dataand requesting calculations. Each array has associated to it one and only one result object and contains all the generated numerical data related to the array .Since calculation logic is automated within the framework, the application needsto give instructions on how to handle the data. These instructions are listed andexplained below:

Application requires results:

The application has to inform the frameworkabout its intention of retrieving the calculated data. This is particularlyessential for performance because extra memory copies from GPU to hostmemory are required to allow the application to access the result. It alsoremoves any unnecessary memory allocation on the host.

Data re-usability:

This deﬁnes the intention of the application to requestsubsequent calculations that will re-use data generated. On such instruction,the framework keeps a copy of the result, possibly on GPU memory, for re-use.

Overwrite:

Setting a result for overwrite instructs the framework that priorto executing the corresponding operation on the GPU, it should initialiseoutput data to the last calculated result.These instructions can be given to any result object participating in a calculation andnot only to the result object through which a calculation is requested. For example,referring to the calculation tree in Figure 3.1, if O is requested to be calculated and C is set to be a required result, then the framework will copy the result of C to thehost and make it available to the application.44 hapter The General Array FrameWork

As their name implies, proxy results are meant to serve as proxies to result objects.

Proxy result objects behave as if they are genuine result objects (the result class isinherited). The proper result object which is proxied, is conﬁgurable at any time,and can be changed at will. The main use of these objects is in situations where aﬁxed result object needs to be presented, but needs to be changed every now andthen behind the scenes.

GAFW modules are intended to contain a deﬁned calculation in one object. Theyare application deﬁned and need to inherit the module’s base class. Inputs andoutputs of modules are in the form of result objects. The module must provide allthe logic to request calculations to the framework.

It is advantageous to have a statistics system in a high-performance application. Forexample, it is helpful to have execution timings of operations executed on the GPU.The framework generates its own statistics that are pushed to the application forfurther processing.A single statistic is contained in a unique object whose class inherits a general baseclass. An interface is deﬁned in such a way that statistic objects can be postedthrough it. The interface needs to be implemented by the application. It is alsoexpected that posted statistics are processed in a background thread. This designallows the application to use the infrastructure for its own generated statistics.45 hapter The General Array FrameWork

The

Calculation Execution Control Layer is made up of an engine whose functionis to execute a calculation request. This section discusses the engine and how itexecutes a calculation over GPUs. Figure 3.3 portrays a high-level design of theengine.

SchedulerDevice Manager Device Manager Device ManagerFinal Maintenance &StatisticsHardwareDevice(GPU) HardwareDevice(GPU) HardwareDevice(GPU)Validation LogicCalculation RequestResults

The Engine

Snapshot Taking Logic

Figure 3.3:

High-level design of the engine

Calculations are executed in three steps: Validation, snapshot taking and propercalculation. Each step is discussed in the following subsections:46 hapter The General Array FrameWork

The framework needs to verify that the calculation tree is valid. It also needs tohandle any missing tree data (array dimensions and data type) that can be predicted.If the framework has no suﬃcient data, or the calculation tree is invalid, then thevalidation process returns an error in the form of a C++ exception.Some of the validation logic has to be supplied by the operator used in the calculationtree . The operator has to validate itself on details regarding its unique speciﬁcations,such as, the number of inputs and outputs. It also has to determine the dimensionsof the output arrays when missing or incorrect.The validation algorithm is depicted in the ﬂow chart shown in Figure 3.4. Thealgorithm is split in two sub-algorithms, one for validation of arrays and one forvalidation of operators . These two sub-algorithms recur on each other.

Array validation in output mode requires the validation of the respective operator , while operators need the validation of all input arrays and pre-validators. In this way,the tree is traversed until non-output arrays or input-less operators are found. Thealgorithm then reverses back validating all objects.

Taking a snapshot means the copying of all relevant data regarding a calculationsuch that the proper calculation can be executed asynchronously. In this way, theproper calculation takes place in the background while the application can changetrees, input new data, request other calculations or execute any other logic.To execute a calculation, the engine transforms the calculation tree into a streamof operations to run. Each element in the stream describes an operation in full,including transient data, such as state, locking mechanisms and memory allocationon GPU for input and output data. The engine relies exclusively on this data to47 hapter The General Array FrameWork perform a calculation.

This step, computes the calculation tree over GPUs. Simplistically speaking, it is amatter of memory management and kernel submission. All the tasks are executedby the engine with the exception of kernel submission. The latter is executed by the operator on request from the engine.The engine is implemented using a multi-threaded approach. All the work done bythe engine is divided into simple tasks, each handled by a separate thread. Ablocking FIFO queue is used to communicate between threads. This approachenables the framework to monitor many events concurrently without the need oflooping. It avoids busy waiting were events are continuously queried for statusupdates. It also helps the framework to act quickly on events over which a threadblock waits. The waiting thread is immediately released as the event is ﬁred.The engine delegates the actual management of a device to a device manager. Thissimpliﬁes functional support for multi-GPUs (and in the future other devices). Aunique instance of the manager is brought up for every device supported. It has itsown task threads as illustrated in Figure 3.5.The engine schedules an operation to be executed over a device by submitting itto the respective device manager. In systems having more than one GPU, a wholecalculation request is submitted to one GPU while the next calculation request issubmitted to the next GPU. This simple scheduling algorithm proved to be goodenough for the imaging tool discussed in the next chapter.GPU memory management works as follows: The allocation thread continuouslyallocates memory for incoming operations until the memory is full or there areno operations in the pipeline. Once the GPU memory is full it waits until the48 hapter The General Array FrameWork post-execution thread frees memory. Memory is freed once the computation relatedto an operator is ready and will not be reused for future operations. Memory that isnot freed and is not set as input or output for any operation submitted to the devicemanager is managed by the memory management thread. Such memory can halt thewhole calculation process as it might not leave enough space for currently scheduledoperators. In the case of such a scenario, the device manager caches the data on thehost main memory (if not already done) and frees memory on the GPU. Cachingalso takes place when data needs to be transferred from one GPU to another, orwhen de-fragmentation of memory is required.Unfortunately, the CUDA runtime API [62] is unable to allocate memory on a GPUwhile a kernel is being executed. This causes the allocation thread to freeze upduring kernel execution, reducing thread concurrency on the CPU. This has a directimpact on the overall performance since data transfers cannot be initialised priorto the allocation of the respective memory. In order to ease the problem, a lockingmechanism is in place that denies concurrent allocation of memory and submissionof threads. In this way, the allocation process works in batch while kernel submissionis locked. In the chapter’s introduction, it was claimed that the framework is designed forcollaboration. This section elaborates on this argument.Framework related development of an application can be divided in two areas. Theﬁrst area is the application layer. Development related to this area is easy and fastwith no specialised expertise required. Whoever attempts development in this area,requires general knowledge on the behaviour of the calculation with no details on By GPU best practices [49] time to transfer data from host to GPU is hidden by doing memorytransfers in parallel with kernel execution. hapter The General Array FrameWork how it is deployed on the GPU. This area suits remarkably well to that scientistwho has general expertise in programming and views programming as a means toreach scientiﬁc goals.Development of operators is the second area. Developing an operator requiresspecialised expertise in high-performance parallel computing over GPUs. Much lessknowledge about the overall behaviour of the calculation is required, but in-depthand detailed knowledge of the operator is a must. It is suited to a developer who haslittle interest on scientiﬁc goals and much more interest in writing high-performancecode.The two areas are separate and the only commonalities are the framework itselfand functional speciﬁcation of the operator . Therefore, it is is easy to promotecollaboration between the scientist and the developer, and get the best out of the two.One notes that functional speciﬁcation documents are a standard in the industrialsoftware development world. It is the main tool with which developers communicatewith their clients.

As part of the implementation, some standard operators have been developed. Thissection discusses the most noteworthy ones.

This operator executes Fast Fourier Transform over arrays of any dimension. It ispossible to divide the array into sub arrays of lesser dimensions and perform a Fouriertransform over them. For example, in the case of a matrix, it is possible to requesta 1-dimensional Fourier transform over each column. It is implemented using the50 hapter The General Array FrameWork

CUDA cuFFT library [63] provided by NVIDIA. Unfortunately, the documentation[63] does not give substantial information on the buﬀer space required to execute anFFT. Documentation only states that it is between one and eight times the size ofthe input being FFTed. To be on the safe side, the operator has been set to requesta buﬀer eight times larger than the input.

Given a sequence { a i } n − i =0 , the all-preﬁx-sums operator does an accumulated additionover the elements to return the sequence { b i } n − i =0 deﬁned as follows: b i =  if i = 0 b i − + a i − otherwise (3.1)This operation is also known as an exclusive scan .The implementation is based on the ideas presented in [64], which are based on thework of Blelloch [65]. Balanced trees are used. This implementation computes lowerlevels of the balanced tree over registers giving a signiﬁcant boost to performance. This is a generic operator (in C++ terms: a template). It provides general code for operators that handle reductions of the form: R = n − M i =0  k − O j =0 ( a j,i )  (3.2)where 51 hapter The General Array FrameWork n { a j,i } n − i =0 o k − j =0 is a sequence of sequences, whereby each element sequence isan input to the GAFW operator . L is a general mathematical operator that has to be associative. N is another general mathematical operator that does not need to beassociative.Specialisation of the template needs to deﬁne the two mathematical operators andthe value of k . The value of n is determined from the size of the input arrays . The operator can reduce over dimensions. For example, in case of a matrix, it is possibleto reduce each column separately. The result will be a 1-dimensional array of lengthequal to the number of rows of the matrix.Reductions are simple to implement over GPUs. In this implementation, a kernelis run with a conﬁguration of maximum threads per block and a number of blockshigh enough to reach nearly 100% real occupancy. Work load is split evenly over allthreads such that each thread reduces a subset of the n elements in each sequence.Results are saved in shared memory so as to run L over all results produced bythreads in a given block. This is stored in global memory in such a way that asecond kernel re-applies L over the values calculated by each block.52 hapter The General Array FrameWork A rr ay V a li d a t i o n A l go r i t h m O p er a t o r V a li d a t i o n A l go r i t h m Start validation of operator Any array inputs or pre-validators?Validate input arrays and pre-validators Run operator validation logicWas validation successful? Was validation successful?Operator validation successfulValidation errorStart validation of array Is array output of an operator?Yes Validation successfulIs array bound to a result?No Was lastresult generated reusable? YesValidate operator Are array’s dimensions and data type well set?YesIs dataenteredmanually? Yes YesNo Validation error NoValidation errorNoNoWas operator validation succesful?No YesYes Yes YesNoNo No

Figure 3.4:

Flow chart of the validation algorithm. The algorithm is split in twosub-algorithms: one for validating arrays and one for validating operators . Thestarting point is the validation of the array associated with the result object throughwhich the calculation has been requested. The algorithm iterates between the twosub-algorithms so that it traverses the calculation tree until ﬁnding a non-outputarray or input-less operator. Once such objects are found it will reverse back whilevalidating all objects. hapter The General Array FrameWork

CUDA Device Manager

Copy Wait Thread:

Wait until all memory transfers are ready

Allocation Thread:

Allocate memory on GPU and initiate Memory transfers

Submit Thread:

Submit Operator’s kernel/s to GPU

Post-Execution Thread:

Copy memory to host and free where necessary

Memory Management Thread:

Unlinked Memory LogicOperation Scheduled Operation ReadyWill memory bereused by a not yet scheduled operator?Yes

Execution Wait Thread:

Wait for kernel/s execution to complete

Figure 3.5:

A ﬂow chart depicting the tasks applied to an operation once scheduledon the CUDA Device Manager. Each process block represents a task that gets executedin a separate thread. The memory management thread does not handle operationsbut instead administers memory allocated on GPU. The decision block at the middleof the diagram is executed by the post-execution thread . hapter 4 The Malta-imager

The

Malta-imager (mt-imager) is a high performance image synthesiser for radiointerferometry. It is implemented in C++ [61] and CUDA [62], and uses theGAFW infrastructure (refer to chapter 3). It achieves unprecedented performanceby means of GPUs. The CPU multi-threaded design for handling pre- and post-dataprocessing is also a crucial ingredient in ensuring the best performance.The image is synthesised using w-projection (refer to section 2.2). An independentmultiple-Stokes dirty image is synthesised for each channel. Measurement data isexpected to be available as a MeasurementSet [17] stored using the Casacore TableSystem [66]. Output images are stored as a 4-dimensional data cube FITS [67]primary image. Most of the calculations are done in single-precision mode [68–70].

The design is based on the

General Array Framework (GAFW) philosophy (refer tochapter 3). The system is made up of seven autonomous components as depictedin Figure 4.1. None of them interact directly with each other, and with a fewexceptions, they are unaware of each other. Data is communicated between each55 hapter The Malta-imager component using the GAFW, by means of result objects (see section 3.3.5). The main() function integrates all components together.Each component is assigned a unique set of tasks. The

Conﬁguration Manager takes care of producing conﬁguration information for each component based onthe local conﬁguration and environment. The

Visibility Manager loads data froma

MeasurementSet , sorts, converts and inputs data in the GAFW infrastructure.The

WImager performs gridding over a multi-polarised grid, while the

ImageFinaliser converts the grid to a multiple-Stokes dirty image. These componentsare implemented as GAFW modules, and all calculations are made over GPUs.Channels are processed independently, and, for each channel, separate instancesof the two components are set up. Numerical representations of the convolutionfunctions ˜ C w ( u, v ) (deﬁned in equation 2.24) are generated by the ConvolutionFunction Generator . The

Image Manager stores output images in FITS ﬁles, whilethe

Statistics Manager processes statistics, generated by each component, includingthe GAFW. It then reports them in various CSV (Comma-Separated Values) ﬁles.Data is processed in chunks to exploit parallel mechanisms available on the hardware.This is essential to ensure high-performance. A GPU is by itself a parallel devicewhich can only achieve high-performance through parallel methods. Presenting theGPU with a suitably sized chunk of data ensures best gridding performance. In caseof a multi-GPU system, by virtue of the GAFW, the imaging tool grids independentchannels over diﬀerent GPUs so as to achieve concurrency over GPUs.Fast gridding on GPUs is the imaging tool’s strong suit. Nevertheless CPU boundpre-processing and post-processing of data is a necessity, since, by design GPUsare limited. For example, GPUs cannot load data from hard-disk, neither do theyrecognise C++ objects. Also, they cannot save images to permanent storage. Ifthese pre- and post-processing steps are not well handled on the CPU, then, theycan severely compromise performance. 56 hapter The Malta-imager

Configuration ManagerVisibility Manager Convolution Function Generator Statistics ManagerGAFWMeasurementSet FITS file Statistics CSV Files mt-imagerPermanent Storage

Image ManagerImage FinaliserWImager

Figure 4.1:

High-level design of the imaging tool showing the various componentsand relevant data ﬂow hapter The Malta-imager

Execution time of pre- and post- processing steps is hidden by having the CPU,GPU and permanent storage IO running in parallel. Most components performtheir tasks asynchronously to each other. Since data is processed in chunks, the

Visibility Manager prepares the chunks in the background through a multi-threadedmechanism. Once the ﬁrst chunk is available for gridding the

WImager component(by virtue of the GAFW) grids the data over the GPU while the

Visibility Manager continues its task of preparing other chunks. This achieves concurrent use of theCPU and GPUs.Channels are processed one after each other . Once a channel dirty image isﬁnalised the Image Manager saves the image to disk while subsequent channels arebeing processed. This ensures concurrent use of the GPU and permanent storage.The

Visibility Manager , by virtue of its multi-threaded design ensures concurrencybetween CPU processing and permanent storage IO and exploits the multi-coreinfrastructure of modern CPUs.The core of gridding is based on Romein’s algorithm [37] which is enhanced,implemented and adapted for the necessities of this imaging tool. The algorithmrequires that data is grouped by baseline and sorted in time. The ordering is doneby the

Visibility Manager .The imaging tool supports single, dual or quad polarised data. The termmulti-polarisation is used as to describe any number of polarisations. It shouldbe noted tha the imaging tool handles each multi-polarised visibility record as oneentity.Output images are converted in the I,Q,U,V stokes format or a subset of, dependingon the polarisations available.Visibility data is gridded using the natural weighting scheme (see section 2.1.2). The For multi-GPU systems, channels are processed in parallel by an amount equal to the numberof GPUs. Polarisation can be linear or circular. Both are supported. hapter The Malta-imager required variance is read from the

MeasurementSet .Flagging is also supported. During pre-processing phases of data, not handled bythe imaging tool, some visibility records might be ﬂagged for various reasons. Theseinclude erroneous readings. The imaging tool does not grid any such ﬂagged data.Note that ﬂagging is done per polarisation, and a ﬂag has value of 1 when therespective polarised data is not to be gridded and 0 otherwise.The ﬁnal point in this section is about channel frequencies. Since, in the generalcase, the interferometer resides on the Earth, channel frequencies are normally givenin the topocentric frame of reference. However, to make corrections for Dopplereﬀects caused by the motion of the Earth, the imaging tool grids using the

LocalStandard of Rest Kinematic (LSRK) frame of reference. Frequency values in thisframe of reference are not given directly by the

MeasurementSet , so a conversion isrequired. This is taken care of by the

Visibility Manager using the CPU. It is theonly calculation done in double-precision mode [68].

ConﬁgurationManager

Runtime conﬁguration of the mt-imager is done via command line arguments andconﬁguration ﬁles. The conﬁguration is a set of parameters deﬁned by keyword andvalue. This is similar to the

Common Astronomy Software Applications(CASA) [46]and the standalone lwimager tool based on the casacore libraries [71]. The keywordand its respective value are separated by the equals ( = ) character. For example, nx=100 deﬁnes a parameter named nx with its value set to 100. A type such as astring, boolean or integer is attributed to a parameter value. Boolean parameterscan be set to true by putting, a minus ( - ) just before the parameter name. Forexample, -test and test=true are equivalent.59 hapter The Malta-imager

Parameters can be set within a manually speciﬁed conﬁguration ﬁle. In principle, allparameters can be speciﬁed as command line arguments to mt-imager . This makesthe command quite long, and subject to errors (an issue in lwimager ). The use ofa conﬁguration ﬁle solves the problem. It is a simple text ﬁle where parametersare each listed on a line of their own. Empty lines are allowed, and lines beginningwith the number sign ( ) are assumed as comments and ignored. Parameters setthrough the command line have precedence over those deﬁned in the conﬁgurationﬁle. This enables the user to have a default conﬁguration stored in a ﬁle and partiallyoverridden by command line arguments.The mt-imager uses a logging system based on

Apache log4cxx

API [72]. It requiresan extra conﬁguration with a format deﬁned by the API. Its location is conﬁgurablethrough a parameter.The

Conﬁguration Manager is the holder of all conﬁguration logic. It is a passivecomponent as all the logic runs in the main thread. It does not interact withany components. It produces a component speciﬁc conﬁguration that is passedby the main thread to the component during initialisation. Due to its nature,the

Conﬁguration Manager is code-wise dependent on the other components. Thisdependency is one-way since the other components are neither aware nor dependenton the

Conﬁguration Manager . The

Convolution Function Generator is entrusted in calculating the w-projection oversampled convolution functions ˜ C w ( u, v ) . It is not practical to calculate theconvolution functions during gridding, instead, they are numerically evaluated andpresented to the WImager component in an oversampled format. ˜ C w ( u, v ) is deﬁned in section 2.24. hapter The Malta-imager

This component can generate convolution functions in two modes: normal and w-projection mode. The mode used is conﬁgurable at runtime.In normal mode, the generator only evaluates ˜ C w ( u, v ) at w = 0 and w-term isignored (refer to section 1.4). The term " normal " is used by other imaging toolssuch as CASA [46] and lwimager . In these tools, it describes the same behaviour asin the mt-imager . In this mode, the support of the convolution function is runtimeconﬁgurable. The system will zero pad or trim the function so as to produce thedesired support. This feature is useful for executing performance tests, and it waspivotal in many of the experiments described in chapter 5.In w-projection mode, ˜ C w ( u, v ) is evaluated over various w-planes depending on theruntime conﬁguration. Each convolution function is trimmed to its support size.Support is evaluated after examining the generated data. As already discussed insection 2.2, support of ˜ C w ( u, v ) is a function of w that increases with increasing w .In the two modes, the choice of the tapering function is run-time conﬁgurable.Tapering functions are implemented as GAFW operators, and the choice is deﬁnedby specifying the name of the operator. In this thesis, only one tapering functionoperator has been developed called casagridsf . This implements the same prolatespheroidal function used in the casacore API [71]. It is based on work presented bySchwab [40] and has been adapted to work over GPUs.The oversampling factor is a run-time conﬁgurable variable. Since memory is limitedit is suggested to keep it to a low value of 4, which is the value proposed by Cornwell et al. [26]. This value is hard-coded in lwimager and CASA. Zero-padding is usedas an interpolation scheme.The generator samples in w . As recommended by Cornwell et al. [26], w is scaledin √ w rather than linearly. No necessity exists to calculate for w < hapter The Malta-imager convolution functions are symmetrical around w = 0. The maximum w to consideris runtime conﬁgurable. The

Convolution Function Generator outputs three GAFW results that containall convolution functions data required for

WImager to do its job. Some simplemathematical treatment is given here to help describe the content of the outputs.When a record is gridded, only the numerical data of one convolution function thatfalls on the pixels of the UV-grid is used. Any oversampled point that does notfall on the UV-grid is not considered. Deﬁne M ( w, u, v ) as the function describingthe operation that chooses data. ( u, v, w ) are the baseline components of the recordbeing gridded expressed in number of wavelengths. The function M ( w, u, v ) returnsthe chosen data in a matrix with dimension S w × S w , where S w is the support ofthe convolution function ˜ C w ( u, v ). Deﬁne the half-width h w = ( S w + 1) / m i,j ( w, u, v ) be an element of M ( w, u, v ), where i is the 1-based index of the row,and j is the 1-based index of the column. Then: m i,j ( w, u, v ) = ˜ C w (( j − h w )∆ v − δv, ( i − h w )∆ u − δu ) (4.1)where δu and δv satisfy the simultaneous equations: − ∆ u < δu ≤ ∆ u − ∆ v < δv ≤ ∆ v δu = u + k ∆ u k ∈ Z (4.2c)62 hapter The Malta-imager δv = v + k ∆ v k ∈ Z (4.2d)Note that the image of M ( w, u, v ) is ﬁnite since ˜ C w ( u, v ) is oversampled in u and v , and sampled in w .As deﬁned in equation 4.5, the WImager also calculates a normaliser. There aresubstantial computational savings if the summation of the real parts of the elementsof M ( w, u, v ) are pre-calculated. Let ζ ( w, u, v ) deﬁne such summation such that: ζ ( w, u, v ) = S w X i =1 S w X j =1 < ( m i,j ( w, u, v )) (4.3)where < is the real operator.The three outputs containing all data related to ˜ C w ( u, v ) are now explained. Theﬁrst output is referred to as the Convolution Functions Numeric Data output. Itcontains all numerical data of the oversampled ˜ C w ( u, v ). The data is laid down asfollows: The data is ﬁrst ordered by convolution function in increasing w . Furtherordering for each convolution function is in such a way that elements of each matrixmember of the image of M ( w, u, v ) is coalesced in memory, sorted in the row-majorform. This is in accordance with GPU best practices since when a record getsgridded, the convolution function data selected, is accessed in parallel. All matrixmembers of the image of M ( w, u, v ) are ordered in increasing order by δu and δv .The second output is referred to as the Convolution Function Data Position andSupport output. It contains two elements for each convolution function. The ﬁrstelement is the function’s support. The second is an index pointing to the ﬁrstelement of the ﬁrst output that describe the convolution function. This index isvital to search for the right data to use for gridding.The third output contains the image of ζ ( w, u, v ). The layout is similar to the layout The image of an arbitrary function f ( X ), is deﬁned as the set { f ( x ) : x ∈ X } . hapter The Malta-imager of the ﬁrst output.

A CPU/GPU hybrid algorithm is used to generate the three outputs described.CPU work is done using the imaging tool’s main thread. The payload on the CPUis negligible, and the thread spends most of time waiting for some GAFW resultto be available. The main thread is not available to spawn any new work whilethis component is calculating. This does not degrade performance since any newwork to spawn after the generation of the convolution functions is dependent on thegenerated data. The

Visibility Manager component is initialised before the generatorin such a way that convolution functions are generated by the

Convolution FunctionGenerator component at the same time that the

Visibility Manager prepares datafor gridding.The three outputs are generated using Algorithm 4.1. All functions are normalisedusing a constant normaliser value such that C (0 ,

0) = 1. Support is calculated bydetecting the pixel nearest to the edge that has a magnitude larger than 10 − . TheGPU thread conﬁguration is set such that, for most of the steps, one thread handlesone pixel. WImager

Component

The

WImager component executes the w-projection algorithm (refer to section 2.2)over GPUs, through the General Array Framework.An instance of the component handles one multi-polarised UV-grid. Since theimaging tool treats each channel separately, an instance of the

WImager componentis created for every channel to grid. 64 hapter The Malta-imager

Data : w-planes to consider Result : The three outputs deﬁned before forall the w-planes do On GPU through the GAFW begin

Calculate the sampled F − ( ˜ C w ( u, v )) with zero padding;Apply FFT;Normalise the output as to get ˜ C w ( u, v );Find the support of ˜ C w ( u, v ); end On CPU in the main thread begin

Wait for support info to be available;Manually ﬁll up the second output; end

On GPU through GAFW begin

Trim and Re-order numerical data of ˜ C w ( u, v ) endend On GPU through GAFW begin

Amalgamate all reordered ˜ C w ( u, v ) in one sequence as to ﬁnalise theﬁrst output;Calculate sums as to generate the third output; end Algorithm 4.1:

The algorithm used by the

Convolution Function Generator component to generate the three outputs.65 hapter The Malta-imager

The component is designed to be ﬂexible such that it can be used in otherconﬁgurations. For example, channel data could be grouped by baseline and griddedover independent instances of the

WImager component. The grids would be addedup on ﬁnalisation. The component can be used as it is for w-stacking (discussed insection 2.2) where each w-plane is gridded separately using independent instancesof the component.The

WImager component is a GAFW module (refer to section 3.3.7). All inputsand outputs are GAFW result objects (see section 3.3.5). The whole algorithm isimplemented as one calculation tree and is fully GPU based. It is free from CPUcalculations to ensure the asynchronous behaviour explained in section 4.1.The implemented algorithm is, from here onwards, referred to as the

WImager algorithm. The calculation tree is depicted in Figure 4.2. Some GAFW operatorsare grouped up in one block and then expanded in subsequent Figures 4.3, 4.4 and4.5. Arrays are denoted with keys, tabulated in Table 4.1. The table includesthe mathematical reference used to represent each sequence and a reference to thesection, where the sequence is discussed.The

WImager algorithm is based on Romein’s algorithm, reviewed in section 2.5. Itis enhanced and adapted to suit for the requirements of the mt-imager . One of themain enhancements made over Romein’s algorithm is compression . It is discussedin section 4.4.3.Records that are input to the

WImager algorithm are expected to be grouped bybaseline and sorted by time. If this requirement is not respected, the algorithm willstill work, but with a heavy penalty in performance. Note that there is no restrictionon the amount of baselines in an input chunk of data and records per baseline canvary. Input visibility records do not need to be on the grid and ﬂagging is supported.The algorithm adapts to the varying support of the convolution functions. It isperfectly ﬁne that records in one input chunk require diﬀerent sized convolution66 hapter The Malta-imager functions to be gridded.The

WImager algorithm is conveniently split up in two phases: the preparationphase and the gridding phase. The main diﬀerence between the two phases is thethread conﬁguration. In the ﬁrst phase, each multi-polarised visibility record isprocessed by one thread or in some parts with a ratio of threads per record thatgoes below unity. In the gridding phase, a record is processed by a block of threads.In this case, the threads per record ratio goes in the order of thousands and millionsand thus expensive. The gridding phase can hold all preparation phase logic but, forbest performance, the gridder is stripped from any logic that can be implemented inthe preparation phase. Note that the gridding phase is implemented in one GAFW operator and thus all the other operators in the calculation tree shown in Figure 4.2are part of the preparation phase.The rest of this section explains in detail the

WImager algorithm. The subjectis tackled as follows: First, nomenclature and terminology used is deﬁned andexplained (section 4.4.1). Inputs, outputs and mathematical equations describingthe algorithm are given in section 4.4.2. In the subsequent section, the griddingphase is discussed, whereby all logic that is delegated to the preparation phase isidentiﬁed. The discussion on the

WImager component is concluded by explainingin detail the preparation phase in section 4.4.5.67 hapter The Malta-imager

PreProcessUVW operatorIndex Creation and Filtering Visibility Data Handling and Compression Normaliser Final Calculation(Sum operator)Plan Creation GridderOperator(Phase 2)I-1 I-2 I-3 I-5I-4C-1C-3 C-3 O-1O-2B-1 S-1 S-2 S-3 D-1FS-2 FS-1FS-3 FS-1FS-2PlanPlan S-4 I-4 I-5 S-4SB-1 FS-5I-3B-2 S-5

Figure 4.2:

Calculation tree of the

WImager algorithm

Prefix Sum OperatorIndex Create and FilterOperatorB-1SB-1 B-1 D-1S-1 S-2 S-3S-1 S-2 S-3FS-1 FS-2 FS-3

Figure 4.3:

Calculation tree of the

Index Creation and Filtering block deﬁned inFigure 4.2 hapter The Malta-imager

Prefix Sum operatorCreate PlanOperatorB-1Value Change Detect Operator Plan Analyse OperatorPrefix Sum operatorCreate PlanOperatorFS-3B-3 B-3SB-3 Trial Plan B-4SB-4 PlanB-4FS-3 B-3Trial Plan FS-3FS-3

Figure 4.4:

Calculation tree of the

Plan Creation block deﬁned in Figure 4.2

Compress and Weigh VisibilityOperatorD-1 I-4 I-5SB-1 FS-5I-3B-2 S-5Compress Index GenerateOperator D-1D-3

Figure 4.5:

Calculation tree of the

Visibility Data Handling and Compression blockdeﬁned in Figure 4.2 hapter The Malta-imager

Key Sequence name MathematicalSymbolVisibility Records Inputs (deﬁned in section 4.4.2 )I-1 Baseline in meters { ( u i , v i , w i ) } I-2 Channel Frequency { λ − i } I-3 Weight { ε i } p I-4 Flags { f i } p I-5 Visibility { ϕ i } p Convolution Functions Input Data (deﬁned in section 4.3.2)

C-1 Convolution Functions Numeric Data N/AC-2 Convolution Functions Data Position and Support N/AC-3 Convolution Functions Data Sum N/A

Outputs (deﬁned in section 4.4.2)

O-1 UV-Grid V grid,p O-2 Normaliser N p Binary Sequences (deﬁned in section 4.4.5)

B-1 Gridding Indicator { g bi } B-2 Compression Indicator { c bi } B-3 Support Change Indicator { χ bfi } B-4 Manipulated Support Change Indicator N/A

Scanned Sequences (deﬁned in section 4.4.5)

SB-1 Preﬁx Sum of Gridding Indicator { g si } SB-3 Preﬁx Sum of Support Change Indicator { χ sfi } SB-4 Preﬁx Sum of Manipulated Support ChangeIndicator N/A

Index Sequences (deﬁned in section 4.4.5)

D-1 Gridding Index { g di } D-2 Last Compressed Entry Index { h di } Trial Plan Trial PlanPlan Plan

Non-Filtered Sequences (deﬁned in section 4.4.5)

S-1 Position { a i , b i , c i , d i } S-2 Convolution Data Pointer { z i } S-3 Support { s i } S-4 Normaliser { % i } p S-5 Take Conjugate { ψ i } Filtered Sequences (deﬁned in section 4.4.5)

FS-1 Position { a fi , b fi , c fi , d fi } FS-2 Convolution Data Pointer { z fi } FS-3 Support { s fi } FS-5 Weighted and Compressed Visibilities { γ fi,p } Table 4.1:

Legend for Figures 4.2, 4.3, 4.4, 4.5 hapter The Malta-imager

The various terminologies and nomenclatures used in this section are deﬁned andexplained. As pointed out in section 4.1, visibility records are input to thecomponent in chunks. As indicated in Figure 4.2 the record data chunk is splitin diﬀerent arrays . It is reasonable to refer and represent GAFW arrays related torecord data as mathematical sequences.The usual "curly brackets" format is used to deﬁne mathematical sequences (forexample: { q i } ). It is to be assumed that the sequence is ﬁnite, and that the ﬁrstelement is at index 0, that is { q i } = { q , q , ..., q n − } . The letter n implicitly deﬁnesthe length of the sequence.As record data is processed by the WImager algorithm, some of it is removed. Theprocess of removal is referred to as ﬁltering. Sequences that have records removedfrom are referred to as ﬁltered sequences . Notation-wise, ﬁltered sequences aredistinguished from their non-ﬁlter counterpart by having the letter f as a superscriptto the element. For example, { q fi } is the ﬁltered version of the sequence { q i } .An element in any sequence gives data about one record. Elements in diﬀerentsequences that are either all non-ﬁltered or all ﬁltered sequences, describe the samerecord only if they are at the same position.The algorithm makes use of binary sequences. The term indicator is used to referto elements in such sequences. The letter b is set as superscript of the sequence’selement to show that the sequence is a binary sequence. For example, { g bi } is abinary sequence. Note that if the binary sequence is also ﬁltered, the letter f is alsoretained as a superscript. For example, { g bfi } is a ﬁltered binary sequence.Sometimes a preﬁx sum (exclusive scan) is run over a binary sequence to producea scanned sequence. The resulting scanned sequence is denoted by the letter s setas superscript of the sequence’s element. For example, { g si } is the resulting scanned71 hapter The Malta-imager sequence of { g bi } . If the binary sequence is a ﬁltered one, the superscript letter f isretained and the resultant ﬁltered scanned sequence is denoted by { g sfi } . Index sequences are deﬁned as sequences containing integer index elements pointingto a record. These are generated by using binary sequences as predicates.

Indexsequences are denoted by using the predicate sequence. The superscript letter d isused to denote that the sequence is an index sequence . For example, { g di } is the indexsequence generated using { g bi } as predicate. If the index sequence points to recordsin a ﬁltered sequence, the superscript letter f is also used (for example { g bfi } ).To represent sequences made up of tuples, the usual notation of a comma separatedlist of sub-elements enclosed in a parenthesis, is used. For example, { ( a i , b i , c i , d i ) } represent a ﬁnite sequence of 4-tuple elements. Most sequences are directly linkedwith the input data that deﬁne in full a visibility record.Various sequences include polarisation dependent data. For such sequences, theletter p will distinguish between each polarisation and is set as a subscript after thetwisted brackets. For example, { v i } p contains polarisation dependent data. Whenreferring to an element, the notation v i,p will be used. The

WImager algorithm receives visibility record data in ﬁve inputs. Another threeextra inputs contain data of the convolution functions in numerical form. Theseextra inputs are supplied by the

Convolution Function Generator and discussed insection 4.3. There are two outputs for this algorithm, which are: a multi-polarisedUV-grid and a normaliser. Details on the inputs and outputs of the algorithm follow.The ﬁve visibility records related inputs are:1. baseline in meters as projected on the UVW axis - { ( u i , v i , w i ) } . The apostrophe72 hapter The Malta-imager indicates that values are in meters and not in number of wavelengths.2. channel frequency (per speed of light) - { λ − i } . where λ i represents the channelwavelength, and which is the inverse of the channel frequency divided by thespeed of light.3. ﬂags - { f i } p

4. weight - { ε i } p

5. visibility - { ϕ i } p The ﬁrst output of the algorithm is the multi-polarised UV-grid V grid,p ( u, v ) sampledat regular intervals ∆ u and ∆ v . The pair of equations 4.4 gives a mathematicalrepresentation of how the algorithm calculates this output. It is based on the w-projection gridding mathematical treatment given in sections 2.1.1 and 2.2.1 withsome adaptations and added features. Q i ( u, v ) = ε i,p ( g i + c i )(1 − f i,p ) · v i,p · ˜ C ( w i /λ i ) u − u i λ i , v − v i λ i ! (4.4a) V grid,p ( u, v ) = n − X i =0 (cid:20) Q i ( u, v ) · III (cid:18) u ∆ u , v ∆ v (cid:19)(cid:21) (4.4b)˜ C w ( u, v ) represents the oversampled w-projection convolution functions, deﬁned inequation 2.24. g i is the gridding indicator. It is calculated within the algorithm and deﬁnes if arecord is to be gridded by the gridding phase. When g i is set to 1, the respectiverecord is gridded by the gridding phase while if set to 0 it is not. g i is set to 0 if and only if one of the following three conditions is met: The ﬁrstcondition is when gridding the record requires updates to pixel outside the grid.The second condition is when all polarisations of the record are ﬂagged (that is, ∀ p f i,p = 1). The last condition which sets g i to 0 is when the record is compressed . Flagging was discussed in the penultimate paragraph of section 4.1. Compression is discussed in section 4.4.3. hapter The Malta-imager

When a record is compressed, the compression indicator c i is set to 1. Otherwise,the compression indicator is set to 0. Note that for a given record g i and c i cannotboth be 1. Hence ( g i + c i ) can have values of 0 or 1. When ( g i + c i ) equates to 0,then the given record has no eﬀect on the output grid. If g i = 0 and c i = 1 then therecord will still be gridded and will aﬀect the output, but not through the griddingphase. This is explained in detail in section 4.4.3.It should be noted that when a ﬂag f i,p is set to 1, (1 − f i,p ) = 0 and the respectivepolarised visibility is ignored, since Q i,p ( u, v ) will acquire a value of 0 over the wholegrid. The gridding phase will still grid the polarised visibility, but with no eﬀectsince it will grid zeros.The second output of the WImager is the polarisation dependent normalisers N p .These are required by the Image Finaliser component to output images in units of

J y/beam . The following equation gives a mathematical representation. N p = n − X i =0 ε i,p ( g i + c i )(1 − f i,p ) · ζ w i λ i , u i λ i , v i λ i ! (4.5) ζ ( w, u, v ) is deﬁned in equation 4.3 and its image is given in the Convolution FunctionData Sum input.

Romein’s algorithm exploits the behaviour of the baseline trajectory, and in the

WImager algorithm the behaviour of the baseline trajectory is exploited further.There is a probability that for some consecutive records sampled on the trajectory,the change of ( u, v, w ) is so small that the convolution function values and theposition of gridding do not change. In mathematical terms there is a probabilitythat for two consecutive records at position i and i + 1 the equality shown belowholds: 74 hapter The Malta-imager (4.6)˜ C ( w i /λ i ) u − u i λ i , v − v i λ i ! · III (cid:18) u ∆ u , v ∆ v (cid:19) = ˜ C ( w i +1 /λ i +1 ) u − u i +1 λ i +1 , v − v i +1 λ i +1 ! · III (cid:18) u ∆ u , v ∆ v (cid:19) In this case by virtue of equation 4.4, the weighted and ﬂagged visibilities of the twoentries, that is, ε i,p v i,p (1 − f i,p ) and ε i +1 ,p v i +1 ,p (1 − f i +1 ,p ), are summed together inthe preparation phase and gridded as one record. This thesis terms the process as compression in view that some records are "compressed" into one.The eﬀectiveness of compression is depended on many parameters. For example, compression depends on the observation integration time, the grid conﬁguration(∆ u, ∆ v ), and the oversampling factor of the convolution functions. A shortintegration time causes the interferometer to sample records at a faster rate thanlonger integration times. This means that the Earth would have rotated less fromone sampled record to another over a trajectory. Therefore, there will be smallerchanges in ( u, v, w ) which imply a higher probability of compression. Short baselinestend to have a shorter trajectory than long baselines. Within a given time interval,sampled records over short baselines travel shorter distances on the UV-grid thanover longer baselines. This leads to a higher probability of compression for shorterbaselines. The gridding phase does the ﬁnal gridding. It is implemented by one operatornamed the gridder and is based on Romein’s algorithm. This section discussesimplementation details, together with adaptations made to Romein’s algorithm soas to suit the imaging tool. The discussion focuses on the required inputs that thepreparation phase needs to generate for the gridder to be as fast as possible.75 hapter The Malta-imager

Filtering

Referring to equation 4.4, whenever g i = 0, then, the record is not gridded and isignored by the gridding phase. The calculation for g i is delegated to the preparationphase. In order to evade any selection logic based on the value of g i , the preparationphase ﬁlters out any records with g i = 0. All input sequences to the gridding phase,related to visibility records, are thus ﬁltered sequences.As pointed out earlier, if a polarised visibility value is ﬂagged, then, the gridder willgrid zeros instead of ignoring the polarised value. The gridder does not handle ﬂagsand does not have any ﬂag data as input. The zeroing of the visibility value is donein the preparation phase. Such a strategy evades logic in the gridding phase relatedto ﬂagging and reduces the shared memory usage of a record in the gridder kernels.As will be explained later on, the gridder loads chunks of records in shared memoryprior to gridding them. Shared memory is a limited resource, and there is an impacton performance when few records are loaded. Not loading ﬂags reduces the recordfootprint in shared memory and thus increases the amount of records that can beloaded at one go. Convolution Position sequence { ( a fi , b fi , c fi , d fi ) } Part of the calculation required to ﬁnd the position of the convolution function onthe UV-grid adheres well to the thread conﬁguration of the preparation phase and isdelegated to it. Instead of the { ( u i , v i , w i ) } and { λ − i } sequences, the gridding phaseis presented with a 4-integer tuple sequence { ( a fi , b fi , c fi , d fi ) } . The tuple, whichproposed in Romein’s algorithm, deﬁnes the position of the convolution function asfollows: Each multi-polarised pixel on the grid is 0-based indexed by two integers( y, x ) representing the v and u value of the pixel respectively. The pixel with v and u values of 0 is in the middle of the grid and the pixel indexed as (0 ,

0) has the most76 hapter The Malta-imager negative v and u values .If the record’s position on the grid is on the pixel indexed by ( y i , x i ) and the grid is l pixels long on the u-axis then: a i = x i − h w i + 1 (4.7a) b i = ( y i − h w i + 1) × l + a i (4.7b) c i = − a i mod S w i (4.7c) d i = − ( y i − h w i + 1) mod S w i (4.7d) h w and S w are the half width and support of the convolution function that will beused to grid the record. They were deﬁned in section 4.3.2 Convolution Data Pointer sequence { z i } Similar to the generation of the

Convolution Position sequence, the logic thatchooses the numeric data in the

Convolution Functions Numeric Data input ishandled by the preparation phase. The gridder is presented with an index pointingto the ﬁrst element of the convolution function data that will be used to grid. Notethat thanks to this sequence (the

Convolution Data Pointer sequence), the gridder does not need the

Convolution Functions Data Position input.

Weighted and Compressed Visibilities sequence { γ fi,p } The calculation of the weighted and ﬂagged visibilities (1 − f i,p ) (cid:15) i v i is done in thepreparation phase and the answer is presented in the Weighted and CompressedVisibilities sequence. Each element in the sequence includes the compressed Note that this indexing scheme is the same as that of C++ indexing method of arrays. hapter The Malta-imager visibilities which are summed up with the record to be gridded.

The Plan and local chunks

Similar to Romain’s algorithm a CUDA block grids a sub-chunk of the input data,which is here referred to as the local chunk . It is ﬁrst loaded on shared memoryfor fast access. The number of records in the local chunk is let to vary. Wheneversupport changes, the gridder needs to commit all the grid point updates to theglobal memory and reconﬁgure itself. It makes sense to constrain the local chunk toentries of the same support, forcing it to be of a variable length. Note that thereis a constraint on the maximum length since shared memory is a limited resource.This limit is dependent on the convolution function support that the local chunk willgrid. This is because records with small convolution functions are gridded using adiﬀerent kernel than the other records. For performance reasons, the kernel griddingsmall convolution functions is conﬁgured with less shared memory than the otherkernel.Since local chunks are variable in length, a plan is required to avoid fetching logicduring the loading in shared memory. The plan is a sequence of local chunks deﬁnedby the index of the ﬁrst record and the common convolution function support.

P lan = { ( j df , s f ) , ( j df , s f ) , ... ( j dfn − , s fn − ) , ( j dfn , } (4.8)where P lan is the plan sequence j dfi is an index of the local chunk ’s ﬁrst record in the respective ﬁltered sequences. s fi is the 1-dimensional support of the convolution functions that the chunk willuse. n is the number of chunks. 78 hapter The Malta-imager

The last element ( j dfn ,

0) points to the record after last in the ﬁltered sequences.Note that the element of the k th local chunk starts from the entry pointed out by j dfk and ends at ( j dfk +1 −

1) in the ﬁltered sequences.

Implementation details

Gridding is done using two kernels. One kernel grids convolution functions withsupport less than 31 ×

31 while the second kernel handles larger convolutionfunctions. Each kernel scans through the plan to see which local chunks to grid.Expense of such a scan is insigniﬁcant as the memory consumed by plans is normallysmall.Two diﬀerent kernels are used because small convolution functions need a diﬀerentconﬁguration to be gridded eﬃciently. This issue is also pointed out by Romein [37].The large convolution function kernel is conﬁgured with 1024 threads per block.Shared memory is conﬁgured to the maximum possible that lets the kernel run at100% theoretical occupancy. Conﬁguring the small convolution function kernel with1024 threads per block does not make any sense as the convolution functions aresmaller in size than 1024 pixels. A conﬁguration of 256 threads per block is thereforeused instead. This reduces the theoretical occupancy and performance. Reducingshared memory usage (by reducing the maximum local chunk length) makes up forthe loss in performance. Note that the small convolution function kernel has beentuned so as to get the best performance in gridding 15 ×

15 convolution functions.Diﬀerent numbers of polarisations are handled by means of C++ templates. Thementioned kernels are implemented as C++ templates, and diﬀerent specialisationsare created to handle single-, dual- and quad-polarisations.Convolution function data is loaded from textures similar to the test case presentedin Romein’s algorithm. 79 hapter The Malta-imager

The main property of the preparation phase is that execution over GPU is conﬁguredwith a thread to records ratio near unity. As shown in Figure 4.2, the preparationphase is divided into ﬁve blocks made up of one or a group of GAFW operators. Inthis section, all the blocks and operators are discussed in detail.

The

PreProcessUVW operator

The

PreProcessUVW operator is the decision maker of the preparation phase. Theoperator decides whether a record is to be gridded or compressed, and is conﬁguredin such a way that one thread processes one record. The operator does not do anyﬁltering of non-griddable data, and it does not handle any mathematics related tovisibilities. It just instructs the preparation phase logic which are the records toﬁlter out or compress by outputting binary sequences.The operator prepares most of the inputs for the gridding phase. It does most of thework required to calculate the normaliser as per equation 4.5 and performs othercalculations.Figure 4.2 show all inputs and outputs of this operator and hereunder details of theoperator are discussed by focusing on the outputs.

Gridding Indicator sequence { g i } This indicator is explained in section 4.4.4. It indicates if the record will behandled by the gridding phase or not. The respective element in the sequenceis set to 1 if the record should be processed by the gridding phase. Otherwiseit is set to 0. The preparation phase will subsequently ﬁlter out any recordswhich have their respective gridding indicator set to 0.

Position sequence { ( a i , b i , c i , d i ) } This sequence is explained in the gridding phase section 4.4.4, and deﬁnes80 hapter The Malta-imager the position on the grid were the convolution function will be added whengridding the respective record.

Convolution Data Pointer sequence { z i } This sequence is explained in section 4.4.4. Elements of this sequence pointto the ﬁrst element in the

Convolution Functions Numeric Data input, thatwill be used as to grid the respective record.

Compression Indicator sequence { c i } This is a binary sequence indicating whether a record should be compressedwith its predecessor. The element is set to 1 if the respective entry is tobe compressed and 0 if not. a i (part of the Position sequence) and z i arechecked if they are equal to a i − and z i − respectively to verify the compressionequality deﬁned in equation 4.6. The following equation gives details: c i =  i = 0 & z i = z i − & e i = e i − & record is griddable0 otherwise (4.9)Note that a record is griddable if it does not require to update any grid pointsoutside the grid boundaries, and at least there is one polarised visibility valuethat is not ﬂagged. Take Conjugate sequence { ψ i } This sequence instructs the visibility handling logic to take the conjugate ofthe complex visibility rather than the inputted one. The convolution functionsare only calculated for w i ≥

0, since symmetry exists around the plane w = 0.It is deduced, from the measurement equation 1.14, that V ( u, v, w ) is equalto the conjugate of V ( − u, − v, − w ), and thus all records with w i < − u i , − v i , w i ) and taking the conjugateof visibility.The value of each element in the sequence is set up according to the following81 hapter The Malta-imager equation: ψ i =  w i ≥ − Support sequence { s i } This sequence stores the support of the selected convolution function, forgridding of the record. The elements of the sequence equate to: s i = S w i /λ i (4.11)Data is available from the Convolution Functions Data Position and Support input.

Normaliser

Sequence { % i } p The

PreProcessUVW operator does most of the calculations required for thenormaliser. Elements in this sequence are set to: % i,p = ε i,p ( g i + c i )(1 − f i,p ) · ζ w i λ i , u i λ i , v i λ i ! (4.12)To complete the calculation of the normaliser given in equation 4.5, thesequence is simply summed up. Index Creation and Filtering

The

Index Creation and Filtering logic, ﬁlters out some of the record data generatedby the

PreProcessUVW operator. All records that are not to be handled by thegridding phase (that is the records with g i = 0) are removed from the outputsequences, and at the same time the Gridding Index sequence is generated. Theelements of the

Gridding Index sequence point to record data in the non-ﬁlteredsequences that will be processed by the gridding phase. This index is required for82 hapter The Malta-imager subsequent processing of visibility data in the preparation phase.Figure 4.3 details the part of the calculation tree that ﬁlters various sequences and generates the index. The Gridding Indicator sequence acts as a predicate andan exclusive scan is run over it. The output sequence of the exclusive scan isrepresented by { g si } g s = 0 (4.13) g si = g si − + g i − = n − X i =0 g i (4.14)For any record with g i = 1 then g si gives the position of the record in the ﬁlteredsequences. Algorithm 4.2 is thus applied to create the index sequence and ﬁlteredsequences. Data : Gridding Indicator

Sequence: { g i } , The exclusive scan of { g i } : { g si } ,Non-ﬁltered sequences: { Θ i } Result : Gridding Index

Sequence: { g di } , Filtered sequences: { Θ fi } for all elements in { g i } do in parallelif g i is true then g dg si = i ;Θ fg si = Θ i ; endelse do nothing; endend Algorithm 4.2:

Algorithm of the

Index Create and Filter operatorThe algorithm is implemented as one kernel in the

Index Create and Filter Sequences operator. The kernel is run with a thread per element of { g i } and does the ﬁlteringand index creation at one go. The sequences ﬁltered are:

Position sequence,

Convolution Data Pointer sequence and

Support sequence. Refer to section 3.6.2 for operator details. hapter The Malta-imager

Note that what is here referred to as ﬁltering is also known as stream compaction andthe algorithm described is a well known technique in the world of parallel computing.Some good reviews on this subject are found in [64, 73]

Plan Creation

Plan creation is depicted in Figure 4.4. The plan is a sort of index sequence as eachelement points to an element in the ﬁltered sequences . The same technique asAlgorithm 4.2 is used to create the plan and is explained in the next paragraph.First, a binary sequence showing which elements need to be indexed in the plan isgenerated. This is done by the Value Change Detect operator. Using a thread perelement conﬁguration, the ﬁltered

Support

Sequence { s fi } is analysed for changes insupport. Such a change delimits local chunks . The results are saved in the SupportChange Indicator sequence { χ bfi } using the equation below: χ bfi =  i = 0 or s fi = s fi − Support Change Indicator sequence and applyingAlgorithm 4.3 generates the trial plan. Algorithm 4.3 is implemented in the

PlanCreate operator using the usual thread per element conﬁguration.The trial plan generated is not fully compliant with what is required by the griddingphase, since the length of a local chunk cannot exceed a deﬁned limit dependent onthe support of the convolution function (refer to section 4.4.4). The trail plangenerated is analysed by the

Plan Analyse operator that detects which local chunks are too large. The

Support Change Indicator is manipulated, and some of itselements are set to 1 so as to split large chunks into smaller ones. The process The plan also contains the value of the support of the convolution functions to be used to gridthe local chunk . hapter The Malta-imager

Data : Filtered

Support

Sequence: { s fi } , Support Change Indicator

Sequence: { χ bfi } , Exclusive scan of { χ bfi } : { χ sfi } Result : Plan { j dfi , t fi } for all Elements of { χ bfi } do in parallelif χ bfi is true then j dχ sfi = i ; t i = s fi ; endelse do nothing; endend Algorithm 4.3:

Algorithm for the

Plan Create operatorof the plan creation is then re-executed to generate the ﬁnal plan.

Visibility Handling Logic

The last input for the gridding phase to be generated by the preparation phase isthe

Weighted and Compressed Visibility sequence { γ fi } p . Two operators generatethis input as shown in Figure 4.5. The main calculations are done by Compress andWeigh Visibility operator.An index sequence called

Last Compressed Entry Index sequence { k di } is generatedby the Compress Index Generate operator. Elements in this index sequence pointto the last record in the non-ﬁltered sequences that are to be compressed with therecord indexed by the Gridding Index sequence { g di } . That is, ∀ i records from g di to k di are to be compressed together. If there are no records to be compressed to arecord pointed to by g di then k di = g di .Algorithm 4.4 is used to generate the Last Compress Entry Index sequence. Itworks as follows: The

Compress Index Generate operator inspects the non-ﬁltered

Compressed Indicator sequence { c i } using a thread conﬁguration of one thread perelement. A record, that is last in a subsequence of records to be compressed together,85 hapter The Malta-imager is characterised by having the subsequent record not set for compression. Theoperator detects the position of such records and writes their position in the

LastCompress Entry Index sequence. The position where the index element is savedin the

Last Compress Entry Index sequence, is calculated by subtracting 1 fromthe respective element in the exclusive scan of the

Gridding Indicator sequence.Copying the

Gridding Index sequence on initialisation caters for those records withno subsequent records to compress.

Data : Gridding Index sequence: { g di } , Compress Indicator sequence: { c i } ,Exclusive scan of { g di } : { g si } Result : Last Compressed Entry Index sequence: { k di }{ k di } = { g di } for all elements of { c i } do in parallelif c i = 1 thenif c i is the last element in the sequence or c i +1 = 0 then h d ( g si − = i ; endelse do nothing; endendelse do nothing; endend Algorithm 4.4:

Algorithm of the

Compress Index Generate operatorThe

Compress and Weigh Visibilities operator generates the

Filtered WeightedVisibility and Compressed sequence { γ fi } p at one go, by means of the Last CompressEntry Index sequence and the

Gridding Index

Sequence. Assigning a thread perelement in { γ fi } p , equations 4.16a and 4.16b, are executed to generate the output. < γ fi,p = k di X x = g di (1 − f x,p ) ε x,p < ϕ x,p (4.16a)86 hapter The Malta-imager = γ fi,p = k di X x = g di ψ x (1 − f x,p ) ε x,p = ϕ x,p (4.16b)where < and = are operators returning the real and imaginary part respectivelyof a complex value. { ψ i } is the Take Conjugate sequence. The sequences { f i,p } , { ϕ i,p } and { ε i,p } are the ﬂags, visibility and weight sequences respectively and areall inputs of the WImager algorithm.

Statistics logic

The

WImager component creates four diﬀerent statistics about the gridding of data.They are all generated from the outputs of the

PreProcessUVW operator and areprocessed further by the

Statistics Manager component. Following are the equations:Total multi-polarised records gridded = n − X i =0 g i (4.17a)Total multi-polarised records compressed = n − X i =0 c i (4.17b)Total grid point updates executed by the gridder = n − X i =0 (cid:16) g i × s i (cid:17) (4.17c)Total grid point updates saved because of compression = n − X i =0 (cid:16) c i × s i (cid:17) (4.17d)A grid point update is deﬁned as an update made to one point on the UV-grid.If a single polarised record is gridded with a 7 × GeneralReduction template operator is used (refer to section 3.6.3). GPUs’ execution ofthe above equations is eﬃcient, and the time consumed to generate such statisticsis negligible. 87 hapter The Malta-imager

Visibility Manager

Component

The

Visibility Manager component does all the necessary work to prepare datafor consumption by the

WImager component. The following are the main tasksentrusted to the

Visibility Manager :1. Loading of measurement data from permanent storage.2. Grouping of data by baseline.3. Conversion of channel frequencies from the topocentric frame of reference toLSRK.4. Loading of data in the

General Array Framework .The

Visibility Manager component does all its work on the CPU since its mainpurpose is to prepare data to be used by the GPU. It is CPU intensive, and bymeans of a multi-threaded approach, most of the resources of the modern multi-coreCPU are used. All work is done in the background, as to ensure the asynchronousbehaviour argued in section 4.1.Figure 4.6 depicts the way multi-threading has been designed to work for the

Visibility Manager . All work that is done by the

Visibility Manager is dividedinto simple tasks, and a thread is assigned to do one of these tasks. Only one threadis assigned for loading data from disk. A task is executed by a thread as soon asall required data is present in memory. Note that there is also a dependency onthe sorting index which is generated once the ﬁrst two columns are loaded. In mostcases, the creation of the index is fast enough not to have threads waiting for theindex after the necessary data is loaded from disk. Sorting of frequency is onlycommenced once conversion is ready. Synchronisation between threads is broughtout using block waiting. In this way, no CPU resources are wasted while a threadis waiting for all data to be available. 88 hapter The Malta-imager

The

Visibility Manager is expected to load large amounts of data (in order ofGigabytes). The loading process is expected to take its time, and this has a directeﬀect on performance since the gridding process can only commence after all data isloaded in memory. The impact on performance is dependent on the I/O bandwidthof the storage device. It will be shown in chapter 5 that the loading phase is themain limiting factor of the mt-imager .Having so many data to load directly implies that there is an equal amount of datato prepare. This takes time, but thanks to the multi-threaded design a substantialamount of preparation work is done in parallel with loading. This reduces the

Visibility Manager ’s overall impact on performance.

The

Visibility Manager loads data from one

MeasurementSet [17] in the

CasacoreTable System format [66]. It has been designed and coded in such a way that it caneasily support multiple MeasurementSets and diﬀerent formats such as HDF5 [74].Most of the C++ code, already assumes multiple MeasurementSets, but this featurewas not ﬁnalised due to time constraints.The

Visibility Manager relies on the casacore [71] ms API to load data from disk tomemory. As already pointed out, only one thread is assigned to the loading process.This is because the casacore ms API is not well suited for multi-threading, and theunderlying storage device might not perform well when accessed in parallel. Thethread loads data column by column.It is worthwhile to point out the "problems" that the

Visibility Manager hasto cope with because of the casacore ms API and MeasurementSets. Most ofthe data is loaded in formats which need conversion prior to loading in the89 hapter The Malta-imager

Data Loading Thread Frequency Conversion ThreadData Sorting ThreadsIndex Creation ThreadANTENNA1ANTENNA2TIMEUVWWEIGHTDATA(Visibility)FLAGS

Figure 4.6:

CPU thread concurrency of the

Visibility Manager . The blue threadloads data from a MeasurementSet one column at a time. The thread entrusted tocreate the sorting index waits until ANTENNA1 and ANTENNA2 are loaded fromthe storage device, and commences it’s task. The frequency conversion thread waitsuntil the TIME column is loaded. All sorting threads will wait until the data to besorted is available, including the sorting index. hapter The Malta-imager

General Array Framework . For example, complex visibilities are retrieved as C++ std:complex<ﬂoat> objects and GPUs do not understand such objects. Theyonly know about simple structures of two consecutive ﬂoats. The raw data in aMeasurementSet is normally sorted in time but rarely (rather never) grouped bybaseline. Channel frequency is given in the topocentric frame of reference, but theimaging tool requires it to be in the LSRK frame of reference. The conversion isvariable over time, and its calculation is computationally intensive.

A tailor-made solution has been implemented to group records by baseline, withrecords in each group sorted in time. The casacore ms API provides sortingfunctionality, but it was found to be too slow for the requirements of the imagingtool. The

Visibility Manager component loads data from a MeasurementSet, in thesame order saved on the disk. In most cases, it is already sorted in time, and the

Visibility Manager relies on this fact. If the MeasurementSet is not sorted in time,the imaging tool will still work, potentially with less performance.All data sorting is done using a generated index. The generation is done by apurposely created thread that analyses ANTENNA1 and ANTENNA2 columns togroup all data by baseline. Algorithm 4.5 is used to generate the index. An amountof C++ vectors equal to the amount of possible baselines is set up. Each vectorrepresents a baseline and is populated by index elements pointing to records of thebaseline they represent. All the vectors are at the end amalgamated together seriallyto generate the ﬁnal index. Note that auto-correlations (visibility measurementsusing the same antenna) are not gridded, and removed immediately. Once thesorting index is created, all other sorting threads can initiate their job. Most likelythey will need to wait for the data that will be sorted, to be available in memory.The work to do for sorting, conversion of data (except frequency conversions) and91 hapter The Malta-imager

Data : ANTENNA1 and ANTENNA2 columns

Result : Sorting Index begin

Create Array of C++ vectors equal to the number of baselines;

Deﬁne integer i:=0; forall the records in MeasurementSet doif record’s ANTENNA1==ANTENNA2 then ignore; endelse

Push i in related baseline vector ; end i++; end

Sorting Index=All vectors amalgamated together ; end

Algorithm 4.5:

Algorithm for the creation of sorting indexloading in the

General Array Framework are handled at one go by the sortingthreads. A sorting thread works over one column and prepares data chunk by chunk.It processes the whole data for a channel and then proceeds to another channel.For single-GPU systems, one sorting thread per column is assigned. In multi-GPUsystems, each column of data is handled by an amount of sorting threads equal to thenumber of GPUs available. Too many active sorting threads on the CPU can hinderperformance and thus unnecessary sorting threads need to be avoided. The threadconﬁguration is reasonably good since the imaging tool synthesises a channel in fullover one GPU before proceeding to the next. If multiple channels are synthesisedtogether over the same GPU, but on diﬀerent grids, then it is quite likely that the

General Array Framework will be forced to cache out grids and re-load them whennecessary. Caching is expected to occur since GPU memory is limited, and it islikely to be too small to hold many grids at the same time. Caching consumes time,so it has to be avoided. 92 hapter The Malta-imager

In general, channel frequency is provided in the topocentric frame of reference whileit is desirable to grid using frequency values in the LSRK frame of reference. Theconversion of frequency from the topocentric to the LSRK frame of reference is afunction of the time. The casacore measures [71] API is used to make the conversion.Unfortunately, the API is not fully thread-safe, implying that channels cannot beprocessed in parallel. To top it all, the conversion is computationally expensive.In order to have the best performance the conversion of frequencies is done asfollows: One single thread is assigned to do the conversions for all channels. Channelfrequencies for a given record are processed together. In this way, some computationis avoided since part of the conversion is only a function of time. As the thread loopsover records and makes conversions, it checks if the previous record has the same timeof observation and if so it does not go through the whole conversion process. Insteadit copies the results obtained from the previous record. Records are processed inthe same natural order they get loaded, which is expected to be sorted in time, thussaving a lot of computational power.Once conversion of frequency is ready, sorting threads do their job in the exact wayas explained in section 4.5.2.

Image Finaliser

Component

The

Image Finaliser

Component is a simple GAFW module that converts amulti-polarised UV-grid to a dirty multiple-Stokes intensity image. The outputof this module is a ﬁnalised dirty image ready to be saved to disk by the

ImageManager (refer to section 4.7). A separate instance of the

Image Finaliser is setup for each channel, similar to the

WImager component. The

WImager ’s GAFW result outputs are set as inputs to the

Image Finaliser component.93 hapter The Malta-imager

The outputted stokes are dependent on the polarisations available. If the grid isquad-polarised, I,Q,U and V stokes are outputted. If the grid is dual-polarisedor single polarised, only the stokes that can be calculated from the availablepolarisations are outputted.The algorithm implemented is GPU based using one GAFW calculation tree . Thestep-by-step procedure is documented in Algorithm 4.6.

Data : multi-polarised UV-grid and normaliser value for each polarisation

Result : Multiple-Stokes dirty image begin

Step 1: Inverse FFT of each polarised grid;Step 2: Convert to multiple-Stokes image and normalise;Step 3: Correct for the the tapering function as explained in sub section2.1.1;Step 4: Correct for convolution function sampling eﬀects shown byequation 2.13; end

Algorithm 4.6:

Algorithm used for image ﬁnalisationThe implementation of Algorithm 4.6 is simple. Step 1 is done using the FFToperator discussed in section 3.6.1. Steps 2-4 are all element by element operationsand thus a thread is assigned to each element.

Image Manager component

The

Image Manager component is responsible to write the resultant dirty imagesto a FITS [67] ﬁle whose location is conﬁgurable. All multiple-Stokes dirty imagesare saved in a primary 4-dimensional image cube using the world coordinate system(WCS) [75–77].The axes, saved in ascending order are: right ascension, declination, stokespolarisation and channel frequency. The output is compatible with other popularimaging tools like lwimager and CASA. The CFITSIO API [78] together with94 hapter The Malta-imager casacore [71] are used to generate the axes.As per the general design of the imaging tool, all logic is executed in the background.All work is done in one thread. The thread begins its execution by creating the FITSﬁle and then block waits over GAFW result objects until data is available. Channelimages are saved one-by-one and in order, as soon as the image data is available.This strategy ensures that most of the writing to disk of a channel image is donein parallel with the gridding over the GPU of the next channel. This hides mostof the time consumed in writing images to disk. In order to evade possible I/Ocontention between the

Visibility Manager and the

Image Manager , the latter isonly initialised once the ﬁnalisation of the ﬁrst channel has been requested to theGAFW. The ﬁnalisation request can only happen after loading of measurement datafrom disk has ended.

Statistics Manager component

The

Statistics Manager is entrusted in handling all statistics generated by theimaging tool and to save such data in CSV (Comma-separated values) ﬁles.The imaging tool generates statistics of various types. The

Visibility Manager measures timings of the various activities it does. This includes the time an activitystarted and when it ended. It uses the

Boost.Chrono

API [79] to do so. Othercomponents and the main() function also measure the time taken for the executionof some of their activities. The

WImager component generates the statistics listedin section 4.4.5 while the GAFW engine measures the operators’ execution time overGPU using CUDA events.All generated statistics are processed by the component and saved in three diﬀerentCSV ﬁles. The ﬁrst ﬁle contains all statistics generated by the engine. The secondﬁle contains all timings generated by the imaging tool’s components and the main() hapter The Malta-imager function. The third ﬁle contains performance data related to gridding. Note thatthese three ﬁles have been vital in extracting performance measurements for theresults reported in chapter 5.The ﬁrst and second ﬁles are straight forward to generate. The component just addsa new row to the ﬁle once a particular related statistic is received. More challengingis the generation of the third ﬁle since each row is generated from many statistics.The

Statistics Manager saves in memory any statistic that it needs for the third ﬁleand waits until it has a complete set to generate a row in the third ﬁle.The statistics system of the imaging tool is built over the GAFW statistics systemdiscussed in section 3.3.8. All statistics are objects whose class inherit a generalbase class deﬁned by the GAFW. The

Statistics Manager implements the interfacedeﬁned by the GAFW, through which statistics are pushed to the componentvia a thread-safe FIFO queue. The

Statistics Manager does all work in a singlebackground thread, and the FIFO queue has block waiting capabilities such thatthe thread can wait over the queue when it has no work to do.96 hapter 5

Performance analysis

This chapter reports performance results obtained by mt-imager . The

WImager algorithm (refer to section 4.4) is analysed in depth since it is the tool’s strongsuit. Analyses on the implemented Romein’s algorithm, builds up knowledge onthe algorithm with data not published by Romein [37]. Other analysis focus onthe overall behaviour of the imaging tool whereby limitations are identiﬁed. Acomparative study with a standard CPU based imaging tool known as lwimager ispresented. It measures the performance advantage of the mt-imager over lwimager to show that the thesis main objective is achieved.

All experiments are done over a high-end desktop PC. Speciﬁcations are listed inTable 5.1.: 97 hapter Performance analysis

Component Speciﬁcation

Mother Board MSI Z77 MPowerCPU Intel(R) Core(TM) i7-3770K CPU @ 3.50GHzGPU MSI GeForce GTX 680Memory DDR3 32gb / 1600mhz Corsair Vengeance [4x8GB]Storage 1 HD 3,5" SATAIII 1TB SEAGATE ST1000DM003 7200rpm 64MBStorage 2 OCZ RevoDrive 3 X2 PCI-Express SSD 240 GBOS Ubuntu 12.04.1 LTS

Table 5.1:

Hardware and software speciﬁcations of the PC used for analysis

This PC was the best option within the budget available at the time of purchase.Note that a new series of GeForce cards have been recently released.For experiments that required two GPUs, an extra VEVO GeForce GTX 670 is used.It was the best GPU available at the University, at the time of testing. It performsless than the GTX 680 but still suitable for the analysis presented in this chapter.The second storage device is a fast PCIe Solid State Device (SSD). The operatingsystem and mt-imager binaries are stored on the SATA magnetic storage devicewhile the Data Sets are stored on the SSD card. As it will be evident from thereported results, the loading of data from storage is a limiting factor for performance,and thus the use of the SSD device has considerable weight on the results obtained.

Three data sets are used for analysis. Details are tabulated in Table 5.2.In all experiments, all channels are imaged. Some data is also ﬂagged, and somerecords are out of the grid. This gives a realistic scenario to these experiments.Data Set 1 and Data Set 2 have been carefully selected such that the overallperformance of the mt-imager for diﬀerent scenarios can be analysed in depth. DataSet 2 is approximately 4 × larger in size then Data Set 1, so the imaging tool requires98 hapter Performance analysis

Data Set Number Data Set 0 Data Set 1 Data Set 2

Telescope LOFAR LOFAR LOFARIntegration time 10 10.139 3.00024Size on disk N/A 1.3 Gigabytes 5.3 GigabytesNumber of channels 1 or 16 1 16Records per channel 2138400 6578850 6501600Polarisations per record 1,2,4 4 4Data format C structure

MeasurementSet MeasurementSet

Table 5.2:

Data sets used for analysis to load much more data, prior to gridding. Two data sets contain nearly the sameamount of record per channel but Data Set 2 has 16 channels while Data Set 1contains only 1 channel. This means that there is much more records to grid forData Set 2 than for Data Set 1. In conclusion, Data Set 2 is much more diﬃcult toimage than Data Set 1.Data Set 0 is the same data set used in the analysis presented by Romein [37]. It isused as a point of departure for the analysis whereby an environment is created tobe as similar as possible to Romein’s test scenario (that is, the test case presented in[37]). Some diﬀerences do exist since the mt-imager is implemented using diﬀerentassumptions.Since Data Set 0 is stored in a C structure, a special Visibility Manager hasbeen developed to cater for it. Only ( u, v, w ) data and the channel topocentricfrequencies are supplied. Time of observation is not included, denying the possibilityof a frequency conversion. Auto-correlations (that is records resulting from thecorrelation of the same signal from the same antenna) are ﬂagged. The specialVisibility Manager sets visibilities to unity and mimics scenarios of 1, 2 or 4polarisations. In Romein’s test scenario, channel data was interleaved and all recordsgridded on one grid. This feature has been implemented in the special VisibilityManager but not in the proper

Visibility Manager . Here, it is referred to as channelinterleaving . 99 hapter Performance analysis

In this chapter, performance is measured using diﬀerent metrics discussed in thefollowing subsections: (Giga grid point updates per second)

This metric is used to describe performance of the

WImager gridding phase. Theunit G shall stand for

Giga grid point updates per second .A grid point update is deﬁned as an update made to one point on the UV-grid. Ifa single polarised record is gridded with a 7 × × × General Array Framework engine usingCUDA events.The metric can be measured in two ways to get the total rate and/or the realrate. For the total rate, compressed records are included in the metric as if theywere gridded independently. For the real rate, only genuinely gridded recordsare considered. In either case, records not gridded because of reasons beyond compression are dismissed. (Mega records/second)

This is similar to the previous metric. Instead of counting grid point updates,multi-polarised records are considered. Polarisation is not factored in, and a griddedmulti-polarised record is counted as one. Similarly to the previous metric, this metriccan be measured as a total rate or real rate.100 hapter Performance analysis

This ratio is given in terms of either records or grid point updates. It deﬁnes theratio of compressed records or grid point updates against genuinely gridded recordsor grid point updates. A value of 0 indicates that no records were compressed.

Mega records/second ) This metric is used to quantify the performance of the

WImager preparation phase.This is the total number of records presented to the WImager component againstthe total execution time on the GPU of all preparation phase kernels. The executiontime is measured by the

General Array Framework engine using CUDA events.

This section describes all the experiments reported in this chapter. Results areanalysed in the next section.Some experiments are intended to analyse the performance of the

WImager component. The system is conﬁgured to work in normal mode whereby only one w-plane is considered (that is one convolution function is used over the whole w range). Experiments are repeated several times using the following diﬀerently sizedconvolution functions (in pixels): ×

7, 15 ×

15, 23 ×

23, 31 ×

31, 63 ×

63, 95 × × × × w-projection is only enabled when examiningthe overall performance.Experiments are divided into separate batches. Each batch is numbered and eachexperiment is referred to using its batch number and another number. The two Note that performance data for convolution functions of size 7 × ×

511 pixels havenot been published by Romein [37] hapter Performance analysis numbers are divided using a dot (.) such as 1 . channel interleaving is disabledand compression enabled. It is also to be assumed that quad-polarisation is usedwhen imaging Data Set 0.All experiments presented is this chapter have square dimensions and the intensityimage pixel length and width are always equal.The following subsections describe all experiments batch by batch. Compression

This batch of experiments is intended to comprehend the behaviour of compression .Data Set 0 is used. Channel interleaving and compression are enabled or disabledas necessary. The special Visibility Manager outputs quad-polarised records. Theimage dimensions and pixel length are set to the same values as Romein’s testscenario. Table 5.3 describes each experiment. The ﬁrst experiment sets theenvironment nearest to Romein’s test scenario.

Ref No Imagedimensions(pixels) Pixel length(arcsec) Over-samplingfactor Channelinterleaving Compression × × × × Table 5.3:

Experiment details of batch 1

Gridding Rates are reported in Figure 5.1.102 hapter Performance analysis G i g a g r i d p o i n t upd a t e s \ s e c o nd Convolution function size (pixels)

Figure 5.1:

Real and total gridding rate results for experiment batch 1. The numbersin the legend refer to speciﬁc experiments as tabulate in Table 5.3.

This batch of experiments is intended to study how the gridder real performance isaﬀected on varying the UV-grid sampling interval. This is achieved by changing theimage dimensions. Details are listed in Table 5.4.103 hapter Performance analysis

Ref No Data Set No Imagedimensions(pixels) Pixel length(arcsec) Oversamplingfactor × × × × × × × × × Table 5.4:

Experiment details of batch 2

The real gridding rates obtained from these experiments are shown in Figures 5.2,5.3 and 5.4. G i g a g r i d p o i n t upd a t e s \ s e c o nd Convolution function size (pixels)

Figure 5.2:

Real gridding rate results per convolution function size for experiments2.1 to 2.3 (Data Set 0) hapter Performance analysis G i g a g r i d p o i n t upd a t e s / s e c o nd Convolution function size (pixels)

Figure 5.3:

Real gridding rate results per convolution function size for experiments2.4 to 2.6 (Data Set 1) G i g a g r i d p o i n t upd a t e s / s e c o nd Convolution function size (pixels)

Figure 5.4:

Real gridding rate results per convolution function size for experiments2.7 to 2.9 (Data Set 2) hapter Performance analysis

In this batch, the oversampling factor is varied to analyse its impact on the realgridder performance. Note that an oversampling factor of 2 is generally not enoughfor accurate imaging. Experiment details are given in Table 5.5.

Ref No Data Set No Imagedimensions(pixels) Pixel length(arcsec) Oversamplingfactor × × × × × × × × × Table 5.5:

Experiment details of batch 3

Figures 5.5, 5.6 and 5.7 plot the real gridding rate obtained. Figure 5.8 plots theaverage rate obtained from the second and third batch (this batch) of experiments,together with the total variation. 106 hapter Performance analysis G i g a g r i d p o i n t upd a t e s / s e c o nd Convolution function size (pixels)

Oversampling factor=2 Oversampling factor=4 Oversampling factor=8

Figure 5.5:

Real gridding rate results per convolution function size for experiments3.1 to 3.3 (Data Set 0) G i g a g r i d p o i n t upd a t e s / s e c o nd Convolution function size (pixels)

Oversampling factor=2 Oversampling factor=4 Oversampling factor=8

Figure 5.6:

Real gridding rate results per convolution function size for experiments3.4 to 3.6 (Data Set 1) hapter Performance analysis G i g a g r i d p o i n t upd a t e s / s e c o nd Convolution function size (pixels)

Oversampling factor=2 Oversampling factor=4 Oversampling factor=8

Figure 5.7:

Real gridding rate results per convolution function size for experiments3.7 to 3.9 (Data Set 2) G i g a g r i d p o i n t upd a t e s / s e c o nd Convolution function size (pixels)

Average rate with deviation

Figure 5.8:

Plot showing average real gridding rate for experiments of batch 2 andbatch 3, together with total variation. The total variation is represented by the errorbar. hapter Performance analysis

The aim of this batch is to analyse the gridder’s real performance vis-a-vi numberof polarisations gridded. Only Data Set 0 is used. Table 5.6 gives details.

Ref No Data SetNo Imagedimensions(pixels) Pixel length(arcsec) Over-samplingfactor No ofpolarisations × × × Table 5.6:

Experiment details of batch 4

The following two ﬁgures report the results obtained. Figure 5.9 plots the realgridding rate for each experiment. Figure 5.10 plots the multiple increase in recordgridding rate when reducing the number of polarisations from 4 to 2 and 1. Reducingthe number of polarisations, reduces the grid point updates generated by each recordand thus an increase in real record gridding rate is expected. G i g a g r i d p o i n t upd a t e s \ s e c o nd Convoliution function size (pixels)

Figure 5.9:

Real gridding rate per convolution function size for experiment batch 4 hapter Performance analysis R e a l g r i dd i n g r e c o r d r a t e g a i n a g a i n s t qu a d - p o l a r i s a t i o n g r i dd i n g Convolution function size (pixels)

Figure 5.10:

Multiple increase in real record gridding per convolution function sizeobtained against quad-polarisation imaging for experiment batch 4

Proper w-projection is enabled for this batch and Data Sets 1 and 2 are imagedusing the two conﬁgurations described in Table 5.7.

Conﬁguration parameter Conﬁguration 1 Conﬁguration 2

Image dimensions 1024 × × w -planes 200 16Oversampling Factor 4 4 Table 5.7:

Imaging conﬁguration for experiment batch 5

These experiments are repeated using lwimager for a comparative analysis. Notethat, for Data Set 2, lwimager uses single precision while for Data Set 1 lwimager uses double precision. It is an automated feature of the tool.110 hapter Performance analysis

The two conﬁgurations have been carefully selected to show how the mt-imager performs in diﬀerent scenarios. Conﬁguration 1 is a lightweight conﬁguration in thesense that a relatively inexpensive computation is required. Convolution functionsare not larger than 29 ×

29 pixels and therefore a maximum of 841 grid point updatesare required to grid one polarisation for a record. Conﬁguration 2 is a heavy weightconﬁguration requiring expensive computation. The largest convolution functionis 349 ×

349 pixels implying a maximum of 121801 grid point updates to grid asingle polarisation of a record (approximately 145 × more than Conﬁguration 1).Conﬁguration 2 images a much larger ﬁeld of view than Conﬁguration 1, implyingsmaller sampling intervals in the UV-grid. This reduces the likelihood of compression being eﬀective, and making it more diﬃcult for mt-imager to image Conﬁguration2.Since the two conﬁgurations are applied over the "small" Data Set 1 and the "large"Data Set 2, the scenarios presented here are a combination of small/large datasets with lightweight/heavyweight conﬁgurations. The most important run is theimaging of Data Set 2 using Conﬁguration 2 since it is the most diﬃcult and wherehigh-performance is mostly required.Figures 5.11 to 5.16 report the results obtained. Figure 5.11 gives a timeline ofmain events occurring during a run of the imaging tool. It must be kept in mindthat mt-imager does work in parallel, so during a deﬁned time interval it wouldbe doing more than one activity. During loading of data, the Visibility Manager component prepares data chunks. After loading, GPUs grid any ready data chunks,while the

Visibility Manager component prepares other data chunks. Convolutionfunctions are generated over the GPU, at the same time when the

Visibility Manager component is loading data from disk.Figures 5.12 and 5.13 report compression related data, to give more insights on how compression works. 111 hapter Performance analysis

The

WImager algorithm performance is reported in Figures 5.14, 5.15 and 5.16. time (seconds) D a t a S e t N o / C o n f i g u r a t i o n N o Loading of dataWait time to grid first chunk

Extra time for background preparation of other chunksTime to finish all calculations and save images

Exit time

Performance gain factor over lwimager

Figure 5.11:

Execution timelines of mt-imager for experiment batch 5 and 6. Theperformance gain over lwimager is also stated. hapter Performance analysis C o m p r e ss i o n r a t i o Data Set No/Configuration No

Compression ratio (records) Compression ratio (grid point updates)

Figure 5.12:

Compression ratios obtained for experiment batch 5. d i m e n s i o n a l c o n v o l u t i o n f un c t i o n s upp o r t Data Set No/Configuration No

Average really gridded records support Average compressed records support

Figure 5.13:

Average convolution function support for experiment batch 5 hapter Performance analysis G i g a g r i d p o i n t upd a t e s / s e c o nd Data Set No/Configuration No

Real rate Total rate

Figure 5.14:

Gridding rate results for experiment batch 5 R e c o r d p r o c e ss i n g r a t e ( M e g a r e c o r d s / s e c o nd ) Data Set No/Configuration No

Prepartory phase Gridding phase

Figure 5.15:

Plot of preparation and gridding phase record rate for experiment batch5 . All records presented to the

Wimager are considered for the preparation phaserate. As for the gridding rate, only records that are truly gridded are considered. hapter Performance analysis E x e c u t i o n t i m e r a t i o Data Set No/Configuration No

Preparation phase Gridding phase

Figure 5.16:

Plot giving the ratio of

WImager preparation phase execution timeagainst gridder execution time for experiments in batch 5

Experiments in this batch aims in analysing the scalability of mt-imager over anumber of GPUs. Two experiments are deﬁned and in the two of them Data Set2 is imaged over Conﬁguration 2 (deﬁned in Table 5.7). In the ﬁrst experiment mt-imager images over the GTX670 card only. In the second experiment mt-imager images over the two GPUs, that is the GTX680 and GTX670 (refer to section 5.1for further details). Timeline results are shown in Figure 5.11.115 hapter Performance analysis

Compression

In the ﬁrst batch of experiments (refer to Figure 5.1), switching oﬀ channelinterleaving while compression is disabled (experiment 1.2) reduces the gridderperformance for the large convolution functions (31 ×

31 and wider), by around 10G.No change in performance resulted for smaller convolution functions. On enablingcompression, the total rate for interleaved channels goes to a maximum of 280G andfor non-interleaved channels goes to 175G. This implies a 3-fold performance increasefor interleaved channels and 2-fold increase for non-interleaved channels . The realgridding rate goes down as compression is enabled, but as argued in section 5.5.2 itgoes to a steady value.From these results, it can be deduced that performance of Romein’s algorithmwithout compression has some dependence on the records that would be compressedif compression is enabled. The methods implemented in this thesis tackle such"compressible" records in a way to obtain higher performance.Unfortunately, the claimed total rates are not likely to be achieved in realisticscenarios where proper w-projection is used. In section 4.4.3 it was discussedthat the probability of compression is not evenly distributed over the UV-gridbut is more prominent for short baselines. Short baselines tend to have smallconvolution functions, and thus the highest probability for compression is for recordswith the least computational exigencies. This argument is well supported by thereported results of experiment batch 5. Figures 5.12 and 5.13 show that for allthe runs made using proper w-projection , the average convolution function size forcompressed records is smaller than that for really gridded records. Consequently,the compression ratio in terms of grid point updates is smaller than the respective This controls the total gridding rate. hapter Performance analysis record compression ratio. This reduction reached a factor of 10 for some runs. Thisimplies that performance gains delivered by compression can be severely degraded.This does not necessarily make compression ineﬀective since as shown in Figure5.14, Data Set 2 is gridded using Conﬁguration 1 at a total rate of 107G. One notesthat when there are no records to compress there is no performance loss against ascenario with compression disabled.

Compression can either not aﬀect performanceor enhance it.Integration time eﬀects compression . Data Set 2 has an integration time roughly3 × shorter than Data Set 1, and results depicted in Figure 5.12 reveal the highlydiﬀerent compression ratios between Data Set 1 and Data Set 2. This is in-line withthe arguments given in section 4.4.3.A ﬁnal note on compression regards channel interleaving . From the results reportedon the ﬁrst batch of experiments, it is clear that compression occurred over channels.This shows that compression can also deliver performance in multi-frequency imagesynthesis. In such synthesis, channels are gridded over the same grid. The developedimaging tool does not support such a feature, but it is a good point to note for thefuture. The good news about compression over channels is that the probability ofoccurrence should be distributed evenly over the grid. It is likely to be dependenton channel frequency bandwidth, whereby the narrower the channel frequencybandwidth, the higher is the probability for compression . Worth recalling that,for a wide ﬁeld of view, channel bandwidth has to be as narrow as possible (refer tosection 1.1.2). It is deduced from the second and third batch of results (Figures 5.2- 5.8) that thereis little eﬀect on the gridder real performance when varying the oversampling factoror the UV-grid sampling interval. Figure 5.8 gives the average rate obtained from117 hapter Performance analysis these experiments per convolution function size and reports the total variations inperformance. The biggest ﬂuctuations happened for the 15 ×

15 convolution functionwhereby a maximum rate variation of 16G resulted. For all the others, the maximumvariation was less than 10G.It is also deduced that the gridder is scalable in terms of convolution function size.The gridder, grids convolution functions of size larger or equal to 31 ×

31 at anearly constant rate of around 50G. There is clearly a degradation in performancefor convolution function of size less than 31 ×

31. However, this does not constitutea real issue since they are relatively light in terms of computation.Rates are nearly maintained when proper w-projection is enabled. Figure 5.14reports that, for the heavy-weight Conﬁguration 2, the gridder rate was around49G for the two data sets. It is slightly less than the gridding rate achieved byconvolution functions larger or equal to 31 ×

31, but much higher than the griddingrate achieved for some of the small convolution functions (7 × × w-projection uses many convolution function sizes and that the localchunk length (refer to section 4.4.4) is variable, then, this rate can be considered asacceptable.Performance achieved using the light-weight Conﬁguration 1 is also acceptablethough much less than Conﬁguration 2. Given that, in this conﬁguration, allconvolution sizes are less than 31 ×

31 pixels, then such a drop is expected sincethese sizes tend to be gridded at a lower rate.

Results of experiment batch 4 reported in Figures 5.9 and 5.10, show that the gridderdoes not down-scale with the number of polarisations being gridded. Gridding dualor single polarised records require half or quarter of grid point updates respectively118 hapter Performance analysis than gridding quad-polarised records. If the gridding rate (in grid point updates/sec)remains constant over the number of polarisations, then, records are expectedas to be gridded at a double or quadruple rate respectively when compared toquad-polarisation gridding. Instead, as reported in Figure 5.10 the record griddingrate increased only by a factor of 1.1 × for single-polarisation gridding of largeconvolution functions. The culprit is believed to be the retrieval of convolutionfunction data that is discussed in section 5.5.4. It is claimed that the main limiting factor of the gridder is the retrieval of theconvolution function numerical data from texture. This happens for each grid pointupdate. Romein [37], argues that the main limiting factor of his scenario is atomicoperations. This behaviour is not observer in the implementation presented in thisthesis, since there is little change in performance when the oversampling factor isvaried. A reduction in oversampling rate results in a reduction in the number ofrecords that are gridded without causing any atomic commits, since more records getcompressed. Thus for each grid point update the likelihood of an atomic commit isincreased. If atomic operations were the main limiting factor, the real gridding rateshould drastically decrease with decreasing oversampling factor. Instead, the rateincreases for large convolution functions (refer to Figures 5.5, 5.6, 5.7). For smallerconvolution functions, the rate does decrease but not in the order that would beexpected. A similar argument can be made for the polarisation results (batch 4).Decreasing polarisations reduces atomic operations per gridded record in proportionto the decrease in the number of polarisations. No substantial increase in recordgridding rate resulted when the number of polarisations is decreased.The polarisation results also rule out the ﬂoating point operations required for gridpoint update as a main limiting factor. The payload is constant to eight ﬂops per119 hapter Performance analysis grid point and independent on the number of polarisations. In view of the heavyreduction in the gridding rate expressed in grid point updates per second, it cannotbe the main limiting factor.The remaining possible culprits are the retrieval of the convolution function fromtexture, the gridding logic and access to shared memory. When compression isdisabled (see Figure 5.1), the real rate increases drastically (around 30G). Payload ofshared memory access and gridding logic is ﬁxed for enabled or disabled compression .The only variant is the texture performance since it is a cache. When compression isdisabled the likelihood of a cache hit is increased. This is because the probability thatthe same numerical data is used by subsequent records is larger than zero. Hence,the claim that the main limiting factor is the retrieval of convolution function datafrom the texture is proved.

WImager preparation phase performance

Figure 5.15 shows that the preparation phase prepares records at a fast rate. Itsimpact on the overall performance of the

WImager algorithm is dependent on thecomputational intensity required by the gridder. The higher the computationalintensity of the gridder phase, the lower is the overall impact of the preparationphase. Figure 5.16 visualises the point. For the heavy-weight Conﬁguration 2,the preparation phase execution time is negligible when compared to the executiontime of the gridder. On the other hand, imaging over Conﬁguration 1 resulted ina preparation phase execution time larger than the gridding time. One must bearin mind that the preparation phase’s main job is to reduce logic from the griddingphase. If the logic is integrated in the gridding phase, the total execution time ofthe

WImager algorithm would increase.120 hapter Performance analysis

Overall performance of the mt-imager is reported in Figure 5.11. Results for themulti-GPU scenario are ignored in this section but discussed in section 5.5.7.Figure 5.11 reveals that mt-imager synthesised Data Set 2 over Conﬁguration 2 (thehardest of all runs) 94 × faster than lwimager . This result shows that the mainthesis objective of developing a high-performance imaging tool has beenachieved.

This gain is not sustained for all runs. It is a side eﬀect of high performance.Computation is so eﬃcient that the loading of data from disk is a signiﬁcant limitingfactor. During this time, the GPU has to wait for the ﬁrst chunk of data. Worstcase occurred for the simplest run (Data Set 1 over Conﬁguration 1), where only a6.2 × gain was obtained. Loading of data in this run took most of the time .It should be stated that for the simplest run (Data Set 1 over Conﬁguration 1), thegeneration of the 200 convolution functions over the GPU ﬁnished nearly at the sametime when loading of data was ready. This implies that generation of convolutionfunctions might sometimes limit the performance further.One notes the remarkably small time interval, shown in red, for all runs. During thistime, the Visibility Manager makes the last preparations for the ﬁrst chunk of dataafter that all data has been loaded from disk. This time interval is short becausethe

Visibility Manager did most of the preparation work while the system is stillloading data. It does not make miracles, and for the large Data Set 1 it requires asubstantial amount of extra time to prepare the other data chunks. More in-depthanalyses reveal that this extra time is needed to sort and convert visibility data(that is the sequence deﬁned as { ϕ i } p is Table 4.1). Visibility data has the highest As per section 4.5, data is loaded through the casacore ms API which can aﬀect the dataloading time. Exit time is ignored in this discussion. hapter Performance analysis memory consumption. This extra time can also impose limits on the performanceof the imaging tool. The

Visibility Manager might not supply data chunks at ratesfaster than the processing of the chunks over the GPU. A case in point is Data Set2 imaged using Conﬁguration 1, where most of the GPU work is done while the

Visibility Manager is preparing data.Results show that the processing time of the

Visibility Manager is independentof conﬁguration but mostly dependent on the data set being imaged. This is anexpected result.A ﬁnal observation is the excessive time the imaging tool takes to exit. It is markedin Figure 5.11 as exit time . This time interval is a waste of time since by then, allimages are ﬁnalised and saved to disk. For the multi-GPU scenario, it amountedto 10 seconds! Detailed analysis revealed that most of this time is consumed bythe CUDA Runtime API to reset GPU devices. It is yet unclear who is the culprit,whether a limitation of the CUDA Runtime API, GPUs, or something else.

When imaging over two GPUs (refer to Figure 5.11), only a performance gain of1 . × against the GTX670 run was obtained. The rather low value is the resultof the exceptionally strong performance already obtained by imaging over 1 GPU.Performance gains obtained from imaging over more than 1 GPU are limited by thetime consumed to load data from disk. If the time to load data and the exit time are ignored, a speed-up of 1 . × results. Thus, mt-imager is scalable over GPUs.Nevertheless, the loading of data from disk, limits the gains severely. Comparison is made against the GTX670 GPU because it is less powerful than the GTX680. hapter Performance analysis

Table 5.8 summarises the main performance topics reported in this chapter.

Topic Comment

Overall performance Nearly 100 × faster then lwimager .Main limiting factor Loading of data from disk.Other limiting factor mt-imager takes substantial time toexit.Scalability over GPUs mt-imager is scalable over GPU, butmost gains are hindered by the mainlimiting factor. WImager gridder performance Real gridding rate of 50 Giga gridpoint updates/sec for mostcomputationally intensive scenario.

Compression performance Performance obtained from compression varies depending onimaging conﬁguration and data set.3-fold increase in gridding performancewere obtained in particular runs.Gridder main limiting factor Retrieval of convolution functionnumeric data from GPU memory.

Table 5.8: mt-imager performance summary hapter 6

Conclusion

In this thesis, a new high-performance imaging synthesis tool for radio interferometrywas developed. The tool which is called malta-imager or mt-imager exploitsthe computational power delivered by GPUs, to achieve unprecedented highperformance. The backbone handling numerical calculations was generalised anda new framework was developed called the General Array Framework (GAFW).Test cases presented in this thesis show that the imaging tool is able to synthesisimages nearly 100 × faster than a common CPU based imaging tool. This clearlyshows that the thesis main objective, which is the development of a high-performanceimaging tool, has been achieved in full. The achievement and detailed results reported in this thesis open the door for moreresearch and development for the imaging tool.The imaging tool still lacks necessary functionality to make it a popular tool ofchoice. This is especially true in wide-ﬁeld astronomical observations. Future workhas to address this problem. In particular, the imaging tool lacks a deconvolution124 hapter Conclusion process and primary beam correction handling the A-term described in equation1.12.The A-term is normally handled by A-projection [80]. Convolution gridding canbe applied in a similar way as the w-projection algorithm. The main diﬀerence isthat the A-term is a function of polarisation forcing a diﬀerent convolution functionfor each polarisation. In view that the gridder is limited by the retrieval of theconvolution function data from memory, this method will most probably ill-perform,using the thesis implementation. A better method would be

A-stacking whereby the

A-term is corrected directly on the intensity plane.Another limitation is that synthesised images are limited by GPU memory. Itis indispensable to overcome this limitation since current and next generationtelescopes support images that do not ﬁt in the GPU memory (SKA images may beas large as 10 pixels [29]). The solution is to scale over a GPU cluster. It will beadvantageous if the system scales down to one GPU and still be able to synthesiselarge images. Such down-scaling will let mt-imager work on PCs and non-clusteredindependent servers.As pointed out in chapter 3, the GAFW has been designed in such a way as tosupport diﬀerent hardware. In this thesis, only GPUs are supported, and the naturalway forward is to support CPUs and clusters of GPUs and/or CPUs. It is quitedesirable that the imaging tool grids over GPU clusters or CPU clusters by virtueof the GAFW.In the ﬁeld of high performance, it is extremely beneﬁcial to retain an open mindset. Though this thesis shows that GPUs do provide a suitable solution, it doesnot mean that they provide the best solution. The possibility of synthesising overCPUs should be analysed. The use of other hardware such as FPGAs or Intel R (cid:13) XeonPhi TM [81] should also be considered. This argument is particularly signiﬁcant forthe future. With the continuous enhancement of current technology and new ideas125 hapter Conclusion that come out on the market, it will not be a surprise if solutions better than GPUswill be available. 126 eferences [1] A. A. Michelson and F. G. Pease, “Measurement of the diameter of alphaOrionis with the interferometer.”

Astrophys. J. , vol. 53, pp. 249–259, May 1921.[2] A. A. Michelson, “On the application of interference methods to astronomicalmeasurements,”

Phil. Mag. , vol. 30, pp. 1–21, 1890.[3] A. A. Michelson, “On the Application of Interference Methods to AstronomicalMeasurements,”

Astrophys. J. , vol. 51, p. 257, Jun. 1920.[4] M. P. van Haarlem, M. W. Wise, A. W. Gunst, G. Heald et al. , “LOFAR: TheLOw-Frequency ARray,”

A&A , vol. 556, p. A2, May 2013.[5] “The Square Kilometer Array,” Date accessed: 16 th et al. , “SKA1: High Level SystemDescription,” SKA Program Development Oﬃce, Tech. Rep., Feb. 2011.[8] P. Dewdney, W. Turner, N. Roddis, R. McCool et al. , “High-Level SKA SystemDescription,” SKA Program Development Oﬃce, Tech. Rep., Feb. 2010.[9] R. T. Schilizzi, P. Alexander, J. M. Cordes, P. E. Dewdney et al. , “PreliminarySpeciﬁcations for the Square Kilometre Array, SKA Memo 100,” Tech. Rep.,Dec. 2007.[10] W. E. Carter, D. S. Robertson, A. Nothnagel, G. D. Nicolson, H. Schuh, andJ. Campbell, “IRIS-S: Extending geodetic Very Long Baseline Interferometryobservations to the southern hemisphere,” J. Geophys. Res.-Sol. Ea. , vol. 93,no. B12, pp. 14 947–14 953, 1988.[11] D. S. Robertson, “Radio interferometry,”

Rev. Geophys. , vol. 25, no. 5, pp.867–870, 1987. 127eferences[12] H. Hirabayashi, H. Hirosawa, H. Kobayashi, Y. Murata et al. , “Overview andInitial Results of the Very Long Baseline Interferometry Space ObservatoryProgramme,”

Science , vol. 281, no. 5384, pp. 1825–1829, Sep. 1998.[13] M. Ryle, “The New Cambridge Radio Telescope,”

Nature , vol. 194, pp. 517–518,May 1962.[14] A. Thompson, J. Moran, and G. Swenson,

Interferometry and Synthesis inRadio Astronomy . Wiley, 2008.[15] R. Perley, “Principles of Interferometry,” in

Twelfth SynthesisImaging Workshop . NRAO, 2010, Date accessed: 16 th Essential Radio Astronomy , NRAO, Dateaccessed: 16 th et al. , Principles of Optics:Electromagnetic Theory of Propagation, Interference and Diﬀraction of Light .Cambridge University Press, 1999.[19] R. Bracewell, “Radio Interferometry of Discrete Sources,”

Proc. IRE , vol. 46,no. 1, pp. 97–105, 1958.[20] J. D. O’Sullivan, “A Fast Sinc Function Gridding Algorithm for FourierInversion in Computer Tomography,”

IEEE Trans. Med. Imag. , vol. 4, no. 4,pp. 200–207, 1985.[21] J. I. Jackson, C. H. Meyer, D. G. Nishimura, and A. Macovski, “Selection of aconvolution function for Fourier inversion using gridding,”

IEEE Trans. Med.Imag. , vol. 10, no. 3, pp. 473–478, Sep. 1991.[22] A. Eklund, P. Dufort, D. Forsberg, and S. LaConte, “Medical Image Processingon the GPU: Past, Present and Future,”

Med. Image Anal. , vol. 17, no. 8, pp.1073–1094, Dec. 2013.[23] J. A. Högbom, “Aperture Synthesis with a Non-Regular Distribution ofInterferometer Baselines,”

Astron. Astrophys. Suppl. Ser. , vol. 15, p. 417, Jun.1974.[24] B. G. Clark, “An eﬃcient implementation of the algorithm ’CLEAN’,”

Astron.Astrophys. , vol. 89, p. 377, Sep. 1980.128eferences[25] J. L. Starck, E. Pantin, and F. Murtagh, “Deconvolution in Astronomy: AReview,”

Publ. Astron. Soc. Pac. , vol. 114, no. 800, pp. 1051–1069, Oct. 2002.[26] T. J. Cornwell, K. Golap, and S. Bhatnagar, “The Noncoplanar Baselines Eﬀectin Radio Interferometry: The W-Projection Algorithm,”

IEEE J. Sel. TopicsSignal Process. , vol. 2, pp. 647–657, Nov. 2008.[27] B. Clark, “Curvature of the Sky, VLA Scientiﬁc Memorandum No 107,” Tech.Rep., 1973.[28] T. J. Cornwell and R. A. Perley, “Radio-interferometric imaging of very largeﬁelds - The problem of non-coplanar arrays,”

Astron. Astrophys. , vol. 261, pp.353–364, Jul. 1992.[29] T. J. Cornwell, M. A. Voronkov, and B. Humphreys, “Wide ﬁeld imaging forthe square kilometre array,”

Proc. SPIE , vol. 8500, p. 85000L, 2012.[30] S. M. Ord, D. A. Mitchell, R. B. Wayth, L. J. Greenhill et al. , “InterferometricImaging with the 32 Element Murchison Wide-Field Array,”

Publ. Astron. Soc.Pac. , vol. 122, no. 897, pp. 1353–1366, 2010.[31] W. Brouw, “Data Processing for the Westerbork Synthesis Radio Telescope,”Ph.D. dissertation, Leiden Observatory, Leiden University, 1971.[32] T. Cornwell and G. van Diepen, “Scaling Mount Exaﬂop: from thepathﬁnders to the Square Kilometre Array,” 2009, Date accessed: 16 th th th Queue , vol. 6, no. 2, pp. 40–53, Mar. 2008.[37] J. W. Romein, “An eﬃcient work-distribution strategy for griddingradio-telescope data on GPUs,” in

Proceedings of the 26th ACM internationalconference on Supercomputing , ser. ICS ’12. New York, NY, USA: ACM, 2012,pp. 321–330. 129eferences[38] A. R. Thompson and R. N. Bracewell, “Interpolation and Fouriertransformation of fringe visibilities,”

Astron. J. , vol. 79, pp. 11–24, Jan. 1974.[39] D. Briggs, “High Fidelity Deconvolution of Moderately Resolved Sources,”Ph.D. dissertation, The New Mexico Institue of Mining and Technology,Socorro, New Mexico, 1995.[40] F. R. Schwab, “Optimal Gridding, VLA Scientiﬁc Memorandum No. 132,” Tech.Rep., 1980.[41] D. Slepian and H. O. Pollak, “Prolate spheroidal wave functions, Fourieranalysis, and uncertainty–I,”

Bell Syst. Tech. J. , vol. 40, pp. 43–64, Jan. 1961.[42] H. J. Landau and H. O. Pollak, “Prolate spheroidal wave functions, Fourieranalysis, and uncertainty-II,”

Bell Syst. Tech. J. , vol. 40, pp. 64–84, Jan. 1961.[43] B. Humphreys and T. Cornwell, “Analysis of Convolutional ResamplingAlgorithm Performance, SKA Memo 132,” CSIRO Australia Telescope NationalFacility, Tech. Rep., 2011.[44] R. H. Frater and I. S. Docherty, “On the reduction of three dimensionalinterferometer measurements,”

Astron. Astrophys. , vol. 84, pp. 75–77, Apr.1980.[45] K. W. Bannister and T. J. Cornwell, “Memory-eﬃcient w-projection with thefast Gauss transform,”

MNRAS , vol. 430, no. 3, pp. 2390–2400, 2013.[46] J. P. McMullin, B. Waters, D. Schiebel, W. Young, and K. Golap, “CASAArchitecture and Applications,” in

Astronomical Data Analysis Software andSystems XVI , ser. ASP Conf. Ser., R. A. Shaw, F. Hill, and D. J. Bell, Eds.,vol. 376, Oct. 2007, p. 127.[47] S. Bhatnagar, “Wide-Field Imaging: I,” in

Twelfth Synthesis ImagingWorkshop . NRAO, Jun. 2010, Date accessed: 16 th CUDA C Best Practices Guide , 5th ed., NVIDIA Corporation, Oct. 2012.[50]

CUDA C Programming Guide , 5th ed., NVIDIA Corporation, Oct. 2012.[51] G. Pratx and L. Xing, “GPU computing in medical physics: A review,”

Med.Phys. , vol. 38, no. 5, pp. 2685–2697, 2011.[52] T. Schiwietz, T.-C. Chang, P. Speier, and R. Westermann, “MR imagereconstruction using the GPU,” in

Proc. SPIE , vol. 6142, 2006, p. 61423T.130eferences[53] R. G. Edgar, M. A. Clark, K. Dale, D. A. Mitchell et al. , “Enabling a highthroughput real time data pipeline for a large radio telescope array with GPUs,”

Comput. Phys. Commun. , vol. 181, pp. 1707–1714, Oct. 2010.[54] J. D. Bowman, I. Cairns, D. L. Kaplan, T. Murphy et al. , “Science with theMurchison Wideﬁeld Array,”

Proc. Astron. Soc. Aust. , vol. 30, p. 31, Apr. 2013.[55] A. S. van Amesfoort, A. L. Varbanescu, H. J. Sips, and R. V. vanNieuwpoort, “Evaluating multi-core platforms for HPC data-intensive kernels,”in

Proceedings of the 6th ACM conference on Computing frontiers , ser. CF ’09.ACM, 2009, pp. 207–216.[56] S. Yatawatta, Aug. 2013, (Private Communication).[57] “ASTRON,” Netherlands Institute for Radio Astronomy, Date accessed: 16 th et al. ,“Radioastronomy Image Synthesis on the Cell/B.E,” in Proceedings of the 14thinternational Euro-Par conference on Parallel Processing , ser. Euro-Par ’08.Berlin, Heidelberg: Springer-Verlag, 2008, pp. 749–762.[59] A. L. Varbanescu, “On the Eﬀective Parallel Programming of Multi-coreProcessors,” Ph.D. dissertation, Delft University of Technology, 2010.[60] S. Kestur, K. Irick, S. Park, A. Al Maashri, V. Narayanan, and C. Chakrabarti,“An algorithm-architecture co-design framework for gridding reconstructionusing FPGAs,” in

Proceedings of the 48th Design Automation Conference , ser.DAC ’11. New York, NY, USA: ACM, 2011, pp. 585–590.[61] ISO,

ISO/IEC 14882:2011 Information technology — Programming languages— C++ . Geneva, Switzerland: International Organization forStandardization, Feb. 2012.[62]

CUDA API Reference Manual , 4th ed., NVIDIA Corporation, Feb. 2011.[63]

CUDA CUFFT Library , 5th ed., NVIDIA Corporation, Oct. 2012.[64] M. Harris, S. Sengupta, and J. D. Owens, “Parallel Preﬁx Sum (Scan) withCUDA,” in

Gpu Gems 3 , 1st ed., H. Nguyen, Ed. Addison-Wesley Professional,2007, ch. 39.[65] G. E. Blelloch, “Preﬁx Sums and Their Applications,” in

Synthesis of parallelalgorithms . Morgan Kaufmann Publishers Inc., 1990, pp. 35–60.[66] G. van Diepen, “Table System File Formats, casacore Notes Series 260,”ASTRON Dwingeloo, Tech. Rep., 2010.131eferences[67] W. D. Pence, L. Chiappetti, C. G. Page, R. A. Shaw, and E. Stobie, “Deﬁnitionof the Flexible Image Transport System (FITS), version 3.0,”

A&A , vol. 524,p. A42, Dec. 2010.[68] “IEEE Standard for Floating-Point Arithmetic,”

IEEE Std 754-2008 , pp. 1–79,2008.[69] N. Whitehead and A. Fit-Florea, “Precision & Performance: Floating Point andIEEE 754 Compliance for NVIDIA GPUs,” NVIDIA Technical White Paper,Tech. Rep., 2011.[70] D. Goldberg, “What every computer scientist should know about ﬂoating-pointarithmetic,”

ACM Comput. Surv. , vol. 23, no. 1, pp. 5–48, Mar. 1991.[71] “casacore: A set of c++ core libraries derived from aips++ [homepageon the Internet],” Date accessed: 16 th August 2013. [Online]. Available:https://code.google.com/p/casacore/[72] “Apache log4cxx,” Date accessed: 16 th August 2013. [Online]. Available:http://logging.apache.org/log4cxx/[73] D. Horn, “Stream Reduction Operations for GPGPU Applications,” in

GPUGems 2 th August 2013.[75] E. W. Greisen, M. R. Calabretta, F. G. Valdes, S. L. Allen et al. ,“Representations of spectral coordinates in FITS,”

Astron. Astrophys. , vol. 446,pp. 747–771, Feb. 2006.[76] M. R. Calabretta and E. W. Greisen, “Representations of celestial coordinatesin FITS,”

Astron.Astrophys. , vol. 395, pp. 1077–1112, Dec. 2002.[77] E. W. Greisen and M. R. Calabretta, “Representations of world coordinates inFITS,”

Astron. Astrophys. , vol. 395, pp. 1061–1076, Dec. 2002.[78] W. Pence, “CFITSIO, v2.0: A New Full-Featured Data Interface,” in

Astronomical Data Analysis Software and Systems VIII , ser. ASP Conf. Ser.,D. M. Mehringer, R. L. Plante, and D. A. Roberts, Eds., vol. 172, 1999, p. 487.[79] H. Hinnant, B. Dawes, and V. J. B. Escriba,

Boost.Chrono 2.0.0 , 2012.[80] S. Bhatnagar, U. Rau, and K. Golap, “Wide-ﬁeld wide-band InterferometricImaging: The WB A-Projection and Hybrid Algorithms,”

Astrophys. J. , vol.770, p. 91, Jun. 2013. 132eferences[81] “Intel R (cid:13) Xeon Phi TM Product Family,” Date accessed: 16 thth