[PDF] Large-Scale Discrete Fourier Transform on TPUs

Abstract

In this work, we present two parallel algorithms for the large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two parallel algorithms are associated with two formulations of DFT: one is based on the Kronecker product, to be specific, dense matrix multiplications between the input data and the Vandermonde matrix, denoted as KDFT in this work; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT in this work. Both KDFT and FFT formulations take full advantage of TPU's strength in matrix multiplications. The KDFT formulation allows direct use of nonuniform inputs without additional step. In the two parallel algorithms, the same strategy of data decomposition is applied to the input data. Through the data decomposition, the dense matrix multiplications in KDFT and FFT are kept local within TPU cores, which can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme in both parallel algorithms, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, minimizing the time required by the communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow. The three-dimensional complex DFT is performed on an example of dimension 8192×8192×8192 with a full TPU Pod: the run time of KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to demonstrate the high parallel efficiency of the two DFT implementations on TPUs.

Full PDF

LLarge-Scale Discrete Fourier Transform on TPUs

Tianjian Lu, Yi-Fan Chen, Blake Hechtman, Tao Wang, and John Anderson

Google Research1600 Amphitheatre Pkwy, Mountain View, CA 94043, USAe-mail: { tianjianlu, yifanchen, blakehechtman, wangtao, janders } @google.com Abstract —In this work, we present two parallel algorithmsfor the large-scale discrete Fourier transform (DFT) on TensorProcessing Unit (TPU) clusters. The two parallel algorithms areassociated with two formulations of DFT: one is based on theKronecker product, to be speciﬁc, dense matrix multiplicationsbetween the input data and the Vandermonde matrix, denotedas KDFT in this work; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT in thiswork. Both KDFT and FFT formulations take full advantage ofTPU’s strength in matrix multiplications. The KDFT formulationallows direct use of nonuniform inputs without additional step.In the two parallel algorithms, the same strategy of datadecomposition is applied to the input data. Through the datadecomposition, the dense matrix multiplications in KDFT andFFT are kept local within TPU cores, which can be performedcompletely in parallel. The communication among TPU coresis achieved through the one-shufﬂe scheme in both parallelalgorithms, with which sending and receiving data takes placesimultaneously between two neighboring cores and along thesame direction on the interconnect network. The one-shufﬂescheme is designed for the interconnect topology of TPU clusters,minimizing the time required by the communication among TPUcores. Both KDFT and FFT are implemented in TensorFlow. Thethree-dimensional complex DFT is performed on an example ofdimension × × with a full TPU Pod: the run timeof KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scalinganalysis is provided to demonstrate the high parallel efﬁciencyof the two DFT implementations on TPUs. Index Terms —Discrete Fourier transform, Fast Fourier trans-form, Hardware accelerator, Kronecker product, Parallel com-puting, TensorFlow, Tensor processing unit

I. I

NTRODUCTION

The discrete Fourier transform (DFT) is critical in many sci-entiﬁc and engineering applications, including time series andwaveform analyses, convolution and correlation computations,solutions to partial differential equations, density function the-ory in ﬁrst-principle calculations, spectrum analyzer, syntheticaperture radar, computed tomography, magnetic resonanceimaging, and derivatives pricing [1]–[4]. However, the compu-tation efﬁciency of DFT is often the formidable bottleneck inhandling large-scale problems due to the large data size andreal-time-processing requirement [5], [6]. In general, effortson enhancing the computation efﬁciency of DFT fall intotwo categories: seeking fast algorithms and adapting the fastalgorithms to hardware accelerators. One breakthrough of thefast-algorithm category is the Cooley-Tukey algorithm [7], alsoknown as the fast Fourier transform (FFT), which reduces thecomplexity of N -point DFT from O ( N ) to O ( N log N ) . TheCooley-Tukey algorithm assuming that the number of data is a power of two is known as the Radix-2 algorithm and followedby Mixed-Radix [3] and Split-Radix [8] algorithms.In addition to the fast algorithms, the performance ofhardware accelerators has been steadily driving the efﬁciencyenhancement of DFT computation: the ﬁrst implementationof the FFT algorithm was realized on ILLIAC IV parallelcomputer [9], [10]; over the years, the DFT computationhas been adapted to both shared-memory [11], [12] anddistributed-memory architectures [13]–[17]. The advancementof hardware accelerators has enabled massive parallelizationfor DFT computation. One such example is deploying the FFTcomputation on manycore processors [18]. Another exampleis implementing the FFT algorithm on clusters of graphicsprocessing units (GPUs) [19]. A GPU cluster contains anumber of nodes (machines) and within each node, GPUs areconnected through PCIe, a high-speed serial interface. TheCooley-Tukey algorithm and its variants often require a largenumber of memory accesses per arithmetic operation such thatthe bandwidth limitation of PCIe becomes the computationbottleneck of the overall performance of FFT on GPU clusters.Prior to the recent development of novel high-speed intercon-nects such as NVLink [20], [21], many efforts related to theGPU-accelerated DFT computation are spent on minimizingthe PCIe transfer time [3], [22]. It is worth mentioning thatthe route of algorithm-hardware co-design has also been takenwith Field Programmable Gate Arrays (FPGAs) to optimizethe conﬁgurations of a customized hardware accelerator forhigh-performance computing of DFT [23]–[25].The recent success of machine learning (ML), or deep learn-ing (DL) in particular, has spurred a new wave of hardwareaccelerators. In many ML applications, it becomes increas-ingly challenging to balance the performance-cost-energy ofprocessors with the growth of data. Domain-speciﬁc hardwareis considered as a promising approach to achieve this [26]. Oneexample of the domain-speciﬁc hardware is Google’s TensorProcessing Unit (TPU) [27]. As a reference, TPU v3 provides420 teraﬂops and 128 GiB high-bandwidth memory (HBM)[28]. In witnessing how DFT computation beneﬁts from thedevelopment of hardware accelerators, it is tempting to askwhether TPU can empower the large-scale DFT computation.It is plausible with the following four reasons: (1) TPU isan ML application-speciﬁc integrated circuit (ASIC), devisedfor neural networks (NNs); NNs require massive amounts ofmultiplications and additions between the data and parametersand TPU can handle these computations in terms of matrixmultiplications in a very efﬁcient manner [29]; similarly, DFT1 a r X i v : . [ c s . M S ] D ec an also be formulated as matrix multiplications betweenthe input data and the Vandemonde matrix; (2) TPU chipsare connected directly to each other with dedicated, high-speed, and low-latency interconnects, bypassing host CPUor any networking resources; therefore, the large-scale DFTcomputation can be distributed among multiple TPUs withminimal communication time and hence very high parallelefﬁciency; (3) the large capacity of the in-package memory ofTPU makes it possible to handle large-scale DFT efﬁciently;and (4) TPU is programmable with software front ends suchas TensorFlow [30] and PyTorch [31], both of which make itstraightforward to implement the parallel algorithms of DFTon TPUs. In fact, all the aforementioned four reasons havebeen veriﬁed in the high-performance Monte Carlo simulationson TPUs [32], [33].In this work, we present two parallel algorithms for DFTon TPUs: one is based on the Kronecker product, to bespeciﬁc, dense matrix multiplications between the input dataand the Vandermonde matrix, denoted as KDFT in this work;and the other is based on the Cooley-Tukey algorithm andphase adjustment, denoted as FFT in this work. For a N -point DFT, the computation complexity of KDFT is O ( N ) ,whereas that of FFT is O ( N log N ) . Both parallel algorithmstake full advantage of TPU’s strength in matrix multiplications.It is worth mentioning that KDFT takes in nonuniform inputdata without additional steps. The nonuniform Fourier trans-form has important applications in signal processing, medicalimaging, numerical solutions of partial differential equations,and machine learning [34]–[37]. Both KDFT and FFT usethe same strategy of data decomposition over the input data,through which the dense matrix multiplications are kept localwithin TPU cores and can be performed completely in parallel.The communication among TPU cores is achieved throughthe one-shufﬂe scheme in both parallel algorithms, with whichsending and receiving data takes place simultaneously betweentwo neighboring cores and along the same direction on theinterconnect network. The one-shufﬂe scheme is designed forthe interconnect topology of TPU clusters, minimizing the timerequired by the communication among TPU cores. Scalinganalysis is provided to demonstrate the high parallel efﬁciencyof the proposed two algorithms of DFT on TPUs.II. TPU S YSTEM A RCHITECTURE

In this section, we provide an overview of the TPU systemarchitecture on both the hardware and software components.

A. Hardware architecture

Figure 1 shows one TPU board or unit: there are fourTPU chips on the same board; each chip has two cores; andeach core contains the scalar, vector, and matrix units (MXU).Structured as a × systolic array, MXU provides thebulk of compute power of a TPU chip and handles 16 Kmultiply-accumulate (MAC) operations in one clock cycle.The inputs and outputs of MXU are ﬂoat32 and the MAC op-erations on MXU are performed with bﬂoat16 [38]. However,one ﬂoat32 number can be decomposed into multiple bﬂoat16 (a)(b) Fig. 1: (a) TPU v3 has four chips on the same board and (b)each chip contains two cores.Fig. 2: TPU v3 Pod in a data center.numbers and with appropriate accumulations, high-precisionMAC operation can be achieved [39]. The implementation ofboth parallel algorithms in this work leverages the strategy ofdecomposition and accumulation and achieves the precisionof ﬂoat32. As shown in Fig. 1(b), each TPU core has 16GiB high-bandwidth memory (HBM). The large capacity ofin-package memory makes it possible to solve large-scaleproblems in a highly efﬁcient manner. TPU is designed asa coprocessor on the I/O bus: each board shown in Fig. 1(a)is paired with one host server consisting of CPU, RAM, andhard disk; TPU executes the instructions sent from CPU onthe host server through PCIe.Figure 2 shows a TPU v3 Pod in a data center where atotal number of 2048 cores are connected to each other. In aPod conﬁguration, TPU chips are connected through dedicatedhigh-speed interconnects of very low latency. The interconnecttopology is a two-dimensional (2D) toroidal mesh with eachchip connected to its four nearest neighbors such that the com-munication takes place in four directions. These interconnectsbypass the CPU networking resources and go from chip to chipdirectly. In our implementations, we have further optimizedthe communication strategy to take advantage of the TPUinterconnect topology.2ig. 3: A computation graph of TensorFlow operations.

B. Software architecture

TensorFlow is used to program TPUs in this work. ATensorFlow client converts the TensorFlow operations into acomputational graph. A sample computation graph performingaddition and convolution operations is shown in Fig. 3. TheTensorFlow client sends the graph to a TensorFlow server.The TensorFlow server partitions the computational graph intoportions that run on TPU and CPU, respectively. If multipleTPUs are to be employed, the graph is marked for replication.The TensorFlow server then compiles the sub-graph that runson TPUs into a HLO program and invokes the acceleratedlinear algebra (XLA) compiler. The XLA compiler takes inthe HLO program and converts it into a LLO program, whichis effectively the assembly code for TPUs. Both the generationand compilation of the computational graph occur on the hostserver. The compiled LLO code is loaded onto TPUs forexecution from the host server through PCIe.The memory usage of a TPU is determined at compile time.Because both the hardware structure of MXU and the memorysubsystem on a TPU core prefer certain shapes of a tensorvariable involved in an operation, the XLA compiler performsthe data layout transformations in order for the hardware toefﬁciently process the operation. If a tensor variable does notalign with the preferred shape, the XLA compiler pads zerosalong one dimension to make it a multiple of eight and theother dimension to a multiple of 128. Zero padding under-utilizes the TPU core and leads to sub-optimal performance,which should be taken into account in the implementation ofthe parallel algorithms on TPUs.III. DFT F

ORMULATIONS

In this section, we provide the detailed formulations for bothKDFT and FFT.

A. KDFT Formulation

The KDFT formulation starts from the general form of DFT,which is deﬁned as X k (cid:44) X ( z k ) = N − (cid:88) n =0 x n z − nk , (1)where (cid:44) denotes “deﬁned to be”, x n represents the input, and { z k } N − k =0 are N distinctly and arbitrarily sampled points onthe z -plane. Equation (1) can be rewritten into the matrix form { X } = [ V ] { x } , (2)where { X } = ( X ( z ) , X ( z ) , · · · , X ( z N − )) T , { x } = ( x , x , · · · , x N − ) T , and [ V ] =  z − z − · · · z − ( N − z − z − · · · z − ( N − ... ... ... . . . ... z − N − z − N − · · · z − ( N − N −  . (3)Note that [ V ] is the Vandermonde matrix of dimension N × N .Computing the inverse DFT is equivalent to solving the linearsystem in Equation (2). In the situation when { z k } N − k =0 areuniformly sampled on the z -plane, the Vandermonde matrix [ V ] becomes unitary and contains the roots of unity.The general form of a 2D DFT can be written as X ( z k , z k ) = N − (cid:88) n =0 N − (cid:88) n =0 x ( n , n ) z − n k z − n k , (4)where [ x ] has the dimension of N × N and { ( z k , z k ) } N N − k =0 represents the set of distinctly andarbitrarily sampled points in ( z , z ) space. It is worthmentioning that the sampling with ( z k , z k ) has to ensurethe existence of the inverse DFT. If the sampling is performedon rectangular grids, Equation (4) can be rewritten into thematrix form as [ X ] = [ V ] [ x ] [ V ] T , (5)where [ X ] =  X ( z ,z ) X ( z ,z ) ··· X ( z ,z ,N − ) X ( z ,z ) X ( z ,z ) ··· X ( z ,z ,N − ) ... ... ... ... X ( z ,N − ,z ) X ( z ,N − ,z ) ··· X ( z ,N − ,z ,N − )  , (6) [ x ] =  x (0 , x (0 , ··· x (0 ,N − x (1 , x (1 , ··· x (1 ,N − ... ... ... ... x ( N − , x ( N − , ··· x ( N − ,N −  , (7) [ V ] =  z − z − · · · z − ( N − z − z − · · · z − ( N − ... ... ... . . . ... z − ,N − z − ,N − · · · z − ( N − ,N −  , (8)3nd [ V ] =  z − z − · · · z − ( N − z − z − · · · z − ( N − ... ... ... . . . ... z − ,N − z − ,N − · · · z − ( N − ,N −  . (9)Note that both [ V ] and [ V ] are Vandermonde matrices ofdimensions N × N and N × N , respectively. One can furtherlump [ x ] into a vector such that Equation (5) can be writteninto the same matrix form as Equation (2), in which [ V ] forthe 2D DFT is expressed as [ V ] = [ V ] ⊗ [ V ] , (10)where ⊗ denotes the Kronecker product [40].Similarly, the three-dimensional (3D) DFT deﬁned by X ( z k , z k , z k ) = N − (cid:88) n =0 N − (cid:88) n =0 N − (cid:88) n =0 x ( n , n , n ) z − n k z − n k z − n k (11)can be rewritten into the matrix form as { X } = [ V ] ⊗ [ V ] ⊗ [ V ] { x } , (12)where [ V j ] =  z − j z − j · · · z − ( N j − j z − j z − j · · · z − ( N j − j ... ... ... . . . ... z − j,N j − z − j,N j − · · · z − ( N j − j,N j −  ,j ∈ { , , } . (13)For the 3D DFT deﬁned in Equation (12), the sampling isperformed on rectangular grids in ( z , z , z ) space and theVandermonde matrix [ V ] has the dimension of N × N . It canbe seen that the Kronecker product bridges the gap between thematrix and tensor operations, through which the contractionbetween a rank-2 tensor and another rank-3 tensor in the 3DDFT can be formulated as matrix multiplications. The KDFTformulation can be easily extended to higher dimensions. B. FFT Formulation

The FFT formulation starts with X k (cid:44) N − (cid:88) n =0 x n e − j π nkN , (14)in which x n represents the input data and the frequencysampling has to be uniform. The global index n in Equation(14) can be expressed as n = P l + β, (15) where l = 0 , , · · · , NP − and β = 0 , , · · · , P − . WithEquation (15), Equation (14) can be rewritten as X k (cid:44) N − (cid:88) n =0 x ( P l + β ) e − j π ( P l + β ) kN (16) = P − (cid:88) β =0 e − j π βkN  NP − (cid:88) l =0 x ( P l + β ) e − j π lk NP  . (17)In Equation (17), (cid:101) X k = NP − (cid:88) l =0 x ( P l + β ) e − j π lk NP (18)is computed with the famous Cooley-Tukey algorithm locallyon individual cores. Prior to the local transform, the gatheringof the input among the cores is required, which arises from theglobal indexing in Equation (15). After the local transform,the phase adjustment needs to applied, which is formulatedas matrix multiplications similar to that in Equation (2).Higher dimensional FFT such as 2D and 3D can be achievedby repeating this one-dimensional (1D) scheme along thecorresponding dimensions.IV. I MPLEMENTATION OF THE PARALLEL ALGORITHMS

In this section, we provide details for implementing bothKDFT and FFT on TPUs, including the data decompositionand the one-shufﬂe scheme.

A. Data decomposition

The data decomposition applied to the input data local-izes the matrix multiplications on individual cores, whichis critical to achieve high parallel efﬁciency on TPUs. Fora 3D DFT, the data decomposition is applied to the inputdata along all three dimensions. The decomposition is basedon a TPU computation shape ( P , P , P ) where P , P ,and P denote the number of TPU cores along the ﬁrst,the second, and the third dimension, respectively. Given theTPU computation shape ( P , P , P ) and the input data ofdimension N × N × N , each TPU core contains a datablock of dimension N P × N P × N P as shown in Fig. 4(a).The data decomposition is also applied to the Vandermondematrix and is along the frequency domain. As shown in Fig.4(b), each core has a slice of the Vandermonde matrix withdimension N i P i × N i , i = 1 , , . It is also shown in Fig. 4 thateach core is assigned an index p along each dimension and p i = 0 , , · · · , P i − , where i = 1 , , . With the proposeddata decomposition, the dense matrix multiplications of bothKDFT and FFT are kept local within individual TPU coresand performed completely in parallel. B. One-shufﬂe scheme

The one-shufﬂe scheme described in Algorithm 1 is usedby both KDFT and FFT. We use KDFT to illustrate the one-shufﬂe scheme. There are two major operations in KDFT:4 a) (b)

Fig. 4: Through the data decomposition with the TPU computation shape ( P , P , P ) , each TPU core contains (a) theVandermonde matrix of dimension N i P i × N i , i = 1 , , and (b) the block of input data of dimension N P × N P × N P fora 3D DFT. The core index is denoted by p = 0 , , · · · , P i − , i = 1 , , . The Fourier transform along the third dimensionrequires shufﬂing the blocks of the input data among the cores that are grouped by the third dimension of the computationshape P .the tensor contraction between the Vandermonde matrix andthe input data; and the communication among TPU cores tosend and receive the data. The tensor contraction is basedon einsum and the communication among TPU cores iswith collective_permute . After one operation of tensorcontraction, the block of the input data initially assigned ona TPU core is shufﬂed once with its neighboring core. Theone-time shufﬂing takes place along the same direction onthe interconnect network. As shown in Fig. 4(b), the DFTalong the third dimension requires shufﬂing the blocks of theinput data among the cores that are grouped by the thirddimension of the computation shape P . In FFT, the one-shufﬂe scheme is used for applying the phase adjustment, inwhich the Vandermonde matrix in Fig. 4(a) contains the phase-shift information.With the one-shufﬂe scheme, sending and receiving datatakes place simultaneously between two neighboring cores andalong the same direction on the 2D toroidal network. The one-shufﬂe scheme best utilizes the topology of the interconnectand hence its bandwidth. Together with the high-speed andlow-latency nature of the interconnects, the one-shufﬂe schemerequires minimal communication time. C. Implementation of the parallel algorithm based on KDFT

Figure 5 illustrates the one-shufﬂe scheme in the parallelalgorithm based on KDFT with a 3D example. We use c , c ,and c to denote three adjacent TPU cores, the operations onwhich are highlighted with three different colors accordingly Algorithm 1

The one-shufﬂe scheme function ONE SHUFFLE (v, x, core idx, num cores,src tgt pairs) iteration idx ← slice idx ← core idx x out ← einsum ( v [ slice idx ] , x ) slice idx ← mod ( slice idx + 1 , num cores ) while iteration idx < num cores − do x ← collective_permute ( x , src tgt pairs ) x out ← x out + einsum ( v [ slice idx ] , x ) slice idx ← mod ( slice idx + 1 , num cores ) iteration idx ← iteration idx + 1 return x outin Fig. 5. After the data decomposition, core c contains ablock of the input data x and a slice of the Vandermondematrix [ V , V , V ] , core c contains x and [ V , V , V ] ,and core c contains x and [ V , V , V ] . Note that thesubscripts appearing in the block of the input data x p ,p ,p are core indices. For simplicity, we ignore the core index onthe third dimension, which is the same across cores c , c , and c . With three einsum and two collective_permute operations, one obtains the partial DFT written as V · x + V · x + V · x on core c , where · represents the tensorcontraction. The steps taken by the partial DFT computationalong one dimension are as follows:1. apply einsum to compute V · x on core c , V · x a)(b)(c) Fig. 5: The one-shufﬂe scheme in the parallel algorithm basedon KDFT is illustrated with a 3D example. We use c , c , and c to denote three adjacent cores, the operations on which arehighlighted with blue, yellow, and green, respectively. The datadecomposition results in the block of the input data x andthe slice of the Vandermonde matrix [ V , V , V ] on core c , x and [ V , V , V ] on core c , and x and [ V , V , V ] on core c . The steps involved in the one-shufﬂe scheme are:(a) computing V · x on core c , V · x on core c , and V · x on core c with · representing the operation of tensorcontraction; (b) collectively permuting the inputs between twoneighboring cores such that x on core c , x on core c , and x on core c and computing V · x on core c , V · x oncore c , and V · x on core c ; (c) collectively permuting theinputs such that x on core c , x on core c , and x on core c and computing V · x on core c , V · x on core c , and V · x on core c . The collective_permute operationin shufﬂing the input between neighboring TPU cores is withsource-target pairs ( c , c ) , ( c , c ) , and ( c , c ) in the formof ( source , target ) . on core c , and V · x on core c as shown in Fig.5(a);2. apply collective_permute to shufﬂe the inputbetween neighboring TPU cores with source-targetpairs ( c , c ) , ( c , c ) , and ( c , c ) in the form of ( source , target ) such that core c contains x , core c contains x , and core c contains x as shown in Fig.5(b);3. apply einsum to compute V · x on core c , V · x on core c , and V · x on core c and add the resultsfrom step 1;4. apply collective_permute with source-target pairs ( c , c ) , ( c , c ) , ( c , c ) , after which core c contains x , core c contains x , and core c contains x asshown in Fig. 5(c);5. apply einsum to compute V · x on core c , V · x on core c , and V · x on core c and add the resultsfrom step 3. D. Implementation of the parallel algorithm based on FFT

Figure 6 illustrates the four steps of a 1D FFT on TPUs:the data decomposition, the gathering of the input, the localtransform, and the phase adjustment. Figure 6(a) shows theinput assigned to individual cores after the data decomposition.In order to achieve an in-order transform, to be speciﬁc,the ordering of the obtained results in the transform domainremains the same as that in the input, it requires a local re-ordering of the input prior to the transform, which can beachieved through einsum . The re-ordering operation is localwithin individual TPU cores. The gathering of the input asshown in Fig. 6(b) is implemented with all_to_all . Afterthe gathering, the Cooley-Tukey-algorithm-based transform isperformed locally on individual cores, which is implementedwith tf.signal.fft as shown in Fig. 6(c). At last, thelocally-obtained transform results are summed over all thecores with phase adjustment, which is achieved through theone-shufﬂe scheme. It can be seen that in the 1D FFT onTPUs, the communication is required in gathering the inputand applying the phase adjustment. Higher dimensional FFTsuch as 2D and 3D can be achieved by repeating the 1D FFTalong the corresponding dimensions.V. P

ARALLEL E FFICIENCY A NALYSIS

In this section, both the strong and weak scaling analysesare provided to demonstrate the efﬁciency of the proposedtwo DFT parallel algorithms on TPUs. For the strong scalinganalysis, the problem size is kept the same as proportionallymore TPU cores are employed. For the weak scaling analysis,the number of TPU cores remains the same as the problem sizeincreases. The TPU proﬁling tool [28], which provides infor-mation on the utilization of the hardware and the efﬁciency ofindividual operations at the program level is used to analyzethe performance of DFT on TPUs. A screenshot of the traceviewer from the TPU proﬁling tool is shown in Fig. 7. With theproﬁling tool, one can breakdown the operations at the HLOlevel, which is quite helpful in identifying the bottleneck of the6 a)(b)(c)(d)

Fig. 6: Four steps for a 1D FFT on TPUs: (a) the datadecomposition, (b) the gathering of the input, (c) the transformperformed locally on individual cores, and (d) the phaseadjustment through the one-shufﬂe scheme. The pair of theindices in (d) represents the source and target pairs used by collective_permute .parallel efﬁciency and making improvements to the algorithmdesigns.

A. Strong scaling analysis of 2D KDFT

Figure 8 shows the computation time of the 2D KDFT withup to 128 TPU cores on an example of dimension × .It can be seen from Fig. 8 that a close-to-linear scaling of thecomputation time with respect to the number of TPU cores isachieved. As a reference, the ideal computation time from thelinear scaling is provided in Fig. 8, which is deﬁned byideal time = T N core , (19)where T denotes the total computation time with two TPUcores and N core is the total number of TPU cores being used. As mentioned in the parallel implementation, the totalcomputation time consists of two parts: the time of matrixmultiplications, or einsum to be speciﬁc, and the communi-cation time of sending and receiving the block of input dataacross TPU cores. It can be seen from Fig. 8 that the timeof matrix multiplications scales linearly with respect to thetotal number of TPU cores. This is because the matrix mul-tiplications are kept completely local within individual cores.The computation time of the 2D KDFT scales approximatelylinearly with respect to the number of TPU cores, with thegap between the actual and the ideal computation time arisingfrom the communication among TPU cores. B. Strong scaling analysis of 3D KDFT

The parallel efﬁciency of the 3D KDFT is demonstratedthrough an example of dimension × × . Similarto the 2D case, the problem size is also ﬁxed as proportionallymore TPU cores are employed. The total computation time isdepicted in Fig. 9. As a reference, the ideal computation timefrom linear scaling is provided in Fig. 9, which is deﬁned byideal time = T N core , (20)where T denotes the total computation time with 32 TPUcores. It can be seen from Fig. 9 that the computation timescales approximately linearly with respect to the number ofTPU cores.The gap between the actual and the ideal computation timein the 3D case also results from the communication amongTPU cores. As mentioned in the parallel implementation, thedata decomposition is applied to the input data along allthe three dimensions with a TPU computation shape. Thecomputation shape in this example has the form of (4 , , n ) with four TPU cores along the ﬁrst dimension, four alongthe second dimension, and n along the third dimension. Itis indeed the number of TPU cores along the third dimensionthat varies in Fig. 9. For example, the computation shapes (4 , , and (4 , , are corresponding to 256 and 512TPU cores, respectively. As the number of TPU cores alongthe third dimension doubles itself, the size of the input datacontained on each core is reduced by half. As a result,the computation time associated with a single operation of collective_permute or einsum is also reduced byhalf, which is shown in Fig. 10. However, as more coresare being used, the total number of collective_permute operations increases. For example, it requires a total numberof 31 collective_permute operations in the Fouriertransform along the third dimension in the case of 512 TPUcores or with the TPU computation shape (4 , , , whereasonly 15 collective_permute operations are required inthe case of 256 TPU cores or with the TPU computation shape (4 , , . It can be seen that even though the time associatedwith one single collective_permute operation decreaseswhen more TPU cores are used, the total communication timefor the DFT along the third dimension does not change. This7ig. 7: A screenshot of the trace viewer from the TPU proﬁling tool. Number of TPU Cores -3 -2 -1 T i m e ( s ec ond s ) Total timeIdeal timeTime on einsum

Fig. 8: The computation time of the 2D KDFT with up to 128TPU cores on an example of dimension × .

32 64 128 256 512 1024 2048

Number of TPU Cores -2 -1 T i m e ( s ec ond s ) Actual timeIdeal time

Fig. 9: The computation time of the 3D KDFT with up to 2048TPU cores on an example of dimension × × .explains the gap between the total and the ideal computationtime in Fig. 9.

16 32 64 128

Number of TPU Cores along the Third Dimension -4 -3 -2 T i m e ( s ec ond s ) Actual time of one collective permuteActual time of one einsumIdeal time

Fig. 10: The computation time of one single operation of collective_permute and einsum , respectively, in the3D KDFT along the third dimension with respect to thenumber of TPU cores.

16 32 64 128 256

Number of TPU Cores -1 T i m e ( s ec ond s ) Actual timeIdeal time

Fig. 11: The computation time of the 3D FFT with up to 256TPU cores on an example of dimension × × .8ig. 12: The breakdown of the computation time with theTPU proﬁling tool for the 3D FFT with 256 TPU cores onan example of dimension × × .TABLE I: Computation time of the 3D KDFT and FFT on afull TPU Pod with 2048 TPU cores.No. Problem size Time (seconds)KDFT FFT1 × × × × × × × × C. Strong scaling analysis of 3D FFT

The parallel efﬁciency of the 3D FFT is demonstratedthrough an example of dimension × × . Theproblem size is ﬁxed as proportionally more TPU cores areemployed. The total computation time is depicted in Fig. 11.As a reference, the ideal computation time from linear scalingis provided in Fig. 11, which is deﬁned byideal time = T N core , (21)where T denotes the total computation time with 16 TPUcores. As shown in Fig. 11 that close-to-linear scaling isobserved in the 3D FFT computation on TPUs up to 128cores. The example of size × × is consideredsmall for more than 128 cores. The breakdown of the totalcomputation time from the TPU proﬁling tool is provided inFig. 12. It can be seen that the communication time consistingof both all_to_all and collective_permute startsdominating the total computation time. D. 3D KDFT and FFT on a Full TPU Pod

In addition to the strong scaling analysis, the computationtime of a few 3D DFT and FFT examples on a full TPU Podwith 2048 cores is provided in Table I. As the problem sizeincreases from × × to × × , the computation time of DFT increases 9.7 times. Similarly,the computation time of DFT increases 11.8 times whenthe problem size increases from × × to × × . As a comparison, the computation timeof FFT scales up approximately eight times as the problemsize increases by eight times. For the two problems of sizes × × and × × , the totalcomputation time of KDFT is about the same as that ofFFT. Given the computation complexity difference betweenKFDFT and FFT, it demonstrates TPU’s strength in matrixmultiplications.VI. C ONCLUSION AND D ISCUSSION

In this work, we proposed and implemented two parallelalgorithms of DFT on TPUs, to be speciﬁc, KDFT andFFT. The formulation of KDFT is based on the Kroneckerproduct. The formulation of FFT is based on the Cooley-Tukeyalgorithm and the phase adjustment. Both formulations takefull advantage of TPU’s strength in matrix multiplications. Theimplementation is in TensorFlow. Through implementing thetwo parallel algorithms, TPU—the domain-speciﬁc hardwareaccelerator for machine learning applications—is employed inthe parallel computing of large-scale DFT. The data decom-position adopted in both parallel algorithms enables the local-ization of the dense matrix multiplications on individual TPUcores, which can be performed completely in parallel. As forthe communication among TPU cores, the one-shufﬂe schemeis designed based on the TPU interconnect topology, withwhich sending and receiving data takes place simultaneouslybetween two neighboring cores and along the same directionon the interconnect network. The one-shufﬂe scheme requiresminimal communication time among TPU cores and achievesvery high parallel efﬁciency. Scaling analysis is providedto understand the parallel efﬁciency of the proposed DFTalgorithms on TPUs.With the demonstrated computation efﬁciency, the large-scale DFT on TPUs opens an array of opportunities forscientiﬁc computing. One possible future work is to integratethe DFT on TPUs with medical image reconstruction, wherenonuniform Fourier transform is extensively used. Anotherfuture work is to extend the two proposed algorithms into aframework and address large-scale DFT of higher dimensions.Finally, the precision of matrix multiplications in this work canbe improved from ﬂoat32 to ﬂoat64.VII. A

CKNOWLEDGMENT

We would like to thank David Majnemer, Reid Tatge, DehaoChen, Yusef Shaﬁ, Damien Pierce, James Lottes, MatthiasIhme, and Rene Salmon for valuable discussions and helpfulcomments, which have greatly improved the paper.R

EFERENCES[1] R. N. Bracewell and R. N. Bracewell,

The Fourier transform and itsapplications . McGraw-Hill New York, 1986, vol. 31999.[2] A. Grama, V. Kumar, A. Gupta, and G. Karypis,

Introduction to parallelcomputing . Pearson Education, 2003.[3] D. Takahashi,

Fast Fourier transform algorithms for parallel computers .Springer, 2019.

4] R. Cont,

Frontiers in quantitative ﬁnance: Volatility and credit riskmodeling . John Wiley & Sons, 2009, vol. 519.[5] G. B. Giannakis, F. Bach, R. Cendrillon, M. Mahoney, and J. Neville,“Signal processing for big data [from the guest editors],”

IEEE SignalProcessing Magazine , vol. 31, no. 5, pp. 15–16, 2014.[6] E. Olshannikova, A. Ometov, Y. Koucheryavy, and T. Olsson, “Visualiz-ing big data with augmented and virtual reality: challenges and researchagenda,”

Journal of Big Data , vol. 2, no. 1, p. 22, 2015.[7] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculationof complex Fourier series,”

Mathematics of computation , vol. 19, no. 90,pp. 297–301, 1965.[8] P. Duhamel and H. Hollmann, “Split radix FFT algorithm,”

Electronicsletters , vol. 20, no. 1, pp. 14–16, 1984.[9] G. Ackins, “Fast fourier transform via ILLIAC IV,”

ILLIAC IV Docu-ment , no. 146, 1968.[10] J. E. Stevens Jr et al. , “A fast Fourier transform subroutine for ILLIACIV,”

CAC document; no. 17 , 1971.[11] P. N. Swarztrauber, “Multiprocessor FFTs,”

Parallel computing , vol. 5,no. 1-2, pp. 197–210, 1987.[12] D. H. Bailey, “FFTs in external or hierarchical memory,”

The journalof Supercomputing , vol. 4, no. 1, pp. 23–35, 1990.[13] A. Gupta and V. Kumar, “The scalability of FFT on parallel computers,”

IEEE Transactions on Parallel and Distributed Systems , vol. 4, no. 8,pp. 922–932, 1993.[14] M. Frigo and S. G. Johnson, “The design and implementation ofFFTW3,”

Proceedings of the IEEE , vol. 93, no. 2, pp. 216–231, 2005.[15] D. Pekurovsky, “P3DFFT: A framework for parallel computations ofFourier transforms in three dimensions,”

SIAM Journal on ScientiﬁcComputing , vol. 34, no. 4, pp. C192–C209, 2012.[16] D. Takahashi, “An implementation of parallel 3-D FFT with 2-Ddecomposition on a massively parallel cluster of multi-core processors,”in

International Conference on Parallel Processing and Applied Math-ematics . Springer, 2009, pp. 606–614.[17] D. T. Popovici, M. D. Schatz, F. Franchetti, and T. M. Low, “Aﬂexible framework for parallel multi-dimensional DFTs,” arXiv preprintarXiv:1904.10119

IEEE Micro , vol. 37, no. 2, pp. 7–17, Mar 2017.[21] A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. Tallent, and K. Barker, “Eval-uating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitchand GPUDirect,” arXiv preprint arXiv:1903.04611 , 2019.[22] Y. Chen, X. Cui, and H. Mei, “Large-scale FFT on GPU clusters,” in

Proceedings of the 24th ACM International Conference on Supercom-puting . ACM, 2010, pp. 315–324.[23] S. K. Nag and H. K. Verma, “Method for parallel-efﬁcient conﬁguringan FPGA for large FFTs and other vector rotation computations,” Feb. 12000, US Patent 6,021,423.[24] M. Garrido, M. Acevedo, A. Ehliar, and O. Gustafsson, “Challenging thelimits of FFT performance on FPGAs,” in . IEEE, 2014, pp. 172–175.[25] C.-L. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, “MultidimensionalDFT IP generator for FPGA platforms,”

IEEE Transactions on Circuitsand Systems I: Regular Papers , vol. 58, no. 4, pp. 755–764, 2010.[26] I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney,R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez et al. , “A Berkeley view of systems challenges for AI,” arXiv preprintarXiv:1712.05855 , 2017.[27] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers et al. , “In-datacenterperformance analysis of a tensor processing unit,” in .IEEE, 2017, pp. 1–12.[28] Cloud TPUs. [Online]. Available: https://cloud.google.com/tpu/[29] N. Jouppi. (2017) Quantifying the performanceof the TPU, our ﬁrst machine learning chip.[Online]. Available: https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-ﬁrst-machine-learning-chip [30] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,M. Krikun, Y. Cao, Q. Gao, K. Macherey et al. , “Google’s neuralmachine translation system: Bridging the gap between human andmachine translation,” arXiv preprint arXiv:1609.08144 , 2016.[31] N. Ketkar, “Introduction to PyTorch,” in

Deep learning with Python .Springer, 2017, pp. 195–208.[32] K. Yang, Y.-F. Chen, G. Roumpos, C. Colby, and J. Anderson,“High performance Monte Carlo simulation of Ising model onTPU clusters,” in

Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis ,ser. SC ’19. ACM, 2019, pp. 83:1–83:15. [Online]. Available:http://doi.acm.org/10.1145/3295500.3356149[33] F. Belletti, D. King, K. Yang, R. Nelet, Y. Shaﬁ, Y.-F. Chen, andJ. Anderson, “Tensor processing units for ﬁnancial Monte Carlo,” arXivpreprint arXiv:1906.02818 , 2019.[34] S. Bagchi and S. K. Mitra, “The nonuniform discrete Fourier transformand its applications in ﬁlter design. I. 1-D,”

IEEE Transactions onCircuits and Systems II: Analog and Digital Signal Processing , vol. 43,no. 6, pp. 422–433, 1996.[35] ——, “The nonuniform discrete Fourier transform and its applicationsin ﬁlter design. II. 2-D,”

IEEE Transactions on Circuits and SystemsII: Analog and Digital Signal Processing , vol. 43, no. 6, pp. 434–444,1996.[36] J.-Y. Lee and L. Greengard, “The type 3 nonuniform FFT and itsapplications,”

Journal of Computational Physics , vol. 206, no. 1, pp.1–5, 2005.[37] A. Rahimi and B. Recht, “Random features for large-scale kernelmachines,” in

Advances in neural information processing systems , 2008,pp. 1177–1184.[38] Using bﬂoat16 with TensorFlow models. [Online]. Available: https://cloud.google.com/tpu/docs/bﬂoat16[39] G. Henry, P. T. P. Tang, and A. Heinecke, “Leveraging the bﬂoat16artiﬁcial intelligence datatype for higher-precision computations,” arXivpreprint arXiv:1904.06376 , 2019.[40] P. A. Regalia and M. K. Sanjit, “Kronecker products, unitary matricesand signal processing applications,”

SIAM review , vol. 31, no. 4, pp.586–613, 1989., vol. 31, no. 4, pp.586–613, 1989.