[PDF] A new set of efficient SMP-parallel 2D Fourier subroutines

Abstract

Extensive set of tests on different platforms indicated that there is a performance drop of current standard de facto software library for the Discrete Fourier Transform (DFT) in case of large 2D array sizes (larger than 16384x16384). Parallel performance for Symmetric Multi Processor (SMP) systems was seriously affected. The remedy for this problem was proposed and implemented as a software library for 2D out of place complex to complex DFTs. Proposed library was thoroughly tested on different available architectures and hardware configurations and demonstrated significant (38-94%) performance boost on vast majority of them. The new library together with the testing suite and results of all tests is published as a project on this http URL platform under free software license (GNU GPL v3). Comprehensive description of programming interface as well as provided testing programs is given.

Full PDF

aa r X i v : . [ phy s i c s . c o m p - ph ] A ug Graphical Abstract

A new set of eﬃcient SMP-parallel 2D Fourier subroutines

Alexander O. Korotkevich M F L O P s Threads new libFFTW ighlights

A new set of eﬃcient SMP-parallel 2D Fourier subroutines

Alexander O. Korotkevich • New set of eﬃcient SMP-parallel DFTs • Free software license • Signiﬁcant performance boost for large sizes of arrays. new set of eﬃcient SMP-parallel 2D Fouriersubroutines

Alexander O. Korotkevich a,b a Department of Mathematics & Statistics, The University of New Mexico, MSC01 1115,1 University of New Mexico, Albuquerque, New Mexico, 87131-0001, USA b L. D. Landau Institute for Theoretical Physics, 2 Kosygin Str., Moscow, 119334,Russian Federation

Abstract

Extensive set of tests on diﬀerent platforms indicated that there is a per-formance drop of current standard de facto software library for the Dis-crete Fourier Transform (DFT) in case of large 2D array sizes (larger than16384 × − Keywords:

FFT, GPL, 2D DFTs

PACS:

1. Introduction

Since 1990s the standard de facto for scientiﬁc high performance com-putations involving discrete Fourier transform (DFT) is FFTW library [1].Author used it during his whole research career, from early 2000s, whenit was still version 2. This library provides self tuned perfected version offast Fourier transform algorithm, which was initially introduced for broaduse in [2]. From the very beginning of the usage of the FFTW, author

Preprint submitted to Journal of Computational Physics August 18, 2020 as fascinated by the elegant design of the software and convenient appli-cation programming interface (API). Author used FFTW for many researchcodes(e.g. [3, 4, 5, 6, 7, 8, 9, 10, 11]). Some of them involved 2D FFTs, butfor very modest sizes of arrays. For quite a long time to get a computationalworkstation with many computing central processing unit (CPU) cores wasprohibitively expensive, while distributed memory computations for thesesizes of arrays was utterly useless due to communication cost. Fortunately,situation changed drastically. Now even several tens of CPU cores SMP con-ﬁgurations are readily available for a manageable cost. Available amount ofmemory also have risen to hundreds of gibibytes. It made some long standingproblems from Physics and Applied Mathematics approachable. For exam-ple, turbulence simulations require wide range of scales, in other words largewavenumbers arrays, like in the recent work [12]. In order to check sometheoretical predictions for propagation of the laser radiation in a turbulentatmosphere, like in paper [13], one needs 2D array of at least (!) (43000) size. Taking into account previously described availability of systems withhigh number of CPU cores, usage of shared memory program architecturemakes algorithms both simpler to implement and more eﬃcient, as it allows toavoid communication between parallel parts of the program due to direct ac-cess to the whole program memory by any part (thread) of the code. FFTWhas multithreaded parallel functions for all kinds of DFTs during practicallyall time of its existence. At the same time, when performance tests weredone, author was disappointed to notice drastic drop of performance of mul-tithreaded DFTs with growth of the array sizes (see Figures 4-6 below). Itshould be speciﬁed, that author was mostly interested in 2D DFTs. Whileprallelization of 2D DFT looks obvious and one would expect linear or nearto linear growth of performance with the number of parallel threads, theproblem is not that simple for small transformation sizes, as inevitable over-head for threads initialization and synchronization can eliminate all the gainif amount of work for every parallel part is not large enough. But for large,tens of thousands points in every direction 2D arrays situation supposed tobe pretty simple and speedup from parallelization looks inevitable. On thecontrary, parallel FFTW performance degradation was increasing with thesize of the arrays (as it is demonstrated later in the paper). A remedy forthe situation was found, nearly linear speedup in number of used CPU coreswas achieved. A set of 2D DFTs (out of place, double precision complexnumbers) is published [14] under free software license (GNU GPL v3) for useby community and/or incorporation into existing software.2 . Proposed solution Author decided that most probably architecture of FFTW is tuned foreﬃcient memory handling for relatively small transformations. This is why itwas decided to use 1D FFTW subroutines and eﬃcient CPU cache memoryhandling techniques in order to try to improve the situation. For one of theprevious codes [8] author developed a parallel version of block matrix trans-position, which allowed to avoid unnecessary penalties for cache memorymisses (when accessed memory parts are not going one after another, whichresults in access to the parts of memory (RAM) not mapped into CPUs cachememory, causing latency delays). As a result, the 2D DFT was implementedin the following way: parallel 1D DFTs (using 1D DFTs from FFTW) inthe direction of array which goes “along the memory” (in C programminglanguage matrix organization is so called “row major format”, which meansthat matrix is stored in memory as row after row, unlike FORTRAN’s “col-umn major format” for a matrix storage, so we ﬁrst compute 1D DFTs alongrows) are computed; then array is transposed, resulting in other dimensionbeing along the memory (in C columns become rows); now parallel DFTs arecomputed for now fast dimension (in C it will correspond to 1D DFTs alongrows of transposed matrix, meaning columns of original one); then we trans-pose array again. Operations of array transposition look unnecessary, as onecould just perform 1D DFTs along second dimension without them, but insuch a case distance in memory between successive elements of corresponding1D array will be large (for large arrays) and nearly every operation with theseelements would result in CPU “cache miss”, meaning the CPU will wait fordata from RAM instead of number crunching. As it was mentioned abovewe avoid unnecessary “cache misses” by using block array transposition. Forlarge array sizes (when array doesn’t ﬁt into CPU cache memory) eﬀect ofeﬃcient memory handling could easily results in order of magnitude speedupof transposition operation in comparison with naive version of it. Blocktransposition is a pretty standard technique demonstrating importance ofthoughtful use of memory especially for large array sizes (some discussion ofeven more advanced memory handling techniques used internally in FFTWcan be found, for example, here [15]). Simple set of subroutines for pool ofthreads organization and synchronization was used [16] (included with thenew library). Because author created this set of 2D DFT subroutines speciﬁ-cally for his set of research tasks, only complex-to-complex out-of-place DFTswere implemented. Other types of DFTs could be implemented by demand3r if new research needs will arise. Published code was not speciﬁcally opti-mized for speed or memory usage, with exception of memory block size fortransposition subroutine, which was tested on several architectures to pro-vide slight performance gain with respect to some generic value (only twodiﬀerent values: for AMD ® and Intel ® CPUs). Application programminginterface (API) was made as close to FFTW’s one as possible. Is describedin details in Appendix A, together with complete description of benchmarkprogram and proposed performance test procedure.

3. Performance of the code

All systems which participated in benchmark are brieﬂy described inAppendix B. In all the tests version 3.3.8 (the latest at the moment of writ-ing the article) of FFTW was used. Conﬁguration of the library for diﬀerentCPUs and machines is described in Appendix C. For evaluation of perfor-mance the same methodology as described for oﬃcial FFTW benchmarkutility (described in [17]) was used, which means that for complex trans-forms number of operations is estimated as 5 N log N (here N is the totalnumber of harmonics/points), time (up to nanoseconds) is measured for aseries of DFTs which takes at least several (usually tens or hundreds) of sec-onds to compute, then this procedure is repeated 10 times and smallest time(meaning the best performance) expressed in microseconds was used to di-vide estimated number of operations in order to get performance in MFLOPs(Mega-FLOPs or millions of ﬂoating point operations per second). Two se-ries of tests were performed: ﬁxed array size and variable number of parallelthreads, ﬁxed number of threads (usually the number of threads which givesthe best performance for FFTW in the previous test was chosen) and vari-able size of the array in order to determine at what value performance drophappens. In all the cases double precision complex array of the size 32768 × System 2b ) the full test was completedtwice with several days between runs to avoid inﬂuence of system processes4tc. Best results of all runs were used for evaluation. Benchmark results arerepresented in Figures 1-3. M F L O P s Threads new libFFTW

Figure 1: Array size (32768) , performance dependence as a function of number of parallelthreads. Red: new library, green: FFTW. System 1 was equipped with two Intel ® Xeon ® CPU E5-2680 v3 @ 2.50GHz (Haswell), which results in 2 ×

12 CPU cores, 2 × As one can see, for this size of the array new library gives performanceboost for all but one system (exception is

System 3 ). See Table 1 fora summary of results. Performance improvement ranges for most systemsMFLOPS

Sys 1 Sys 2a Sys 2b Sys 3 Sys 4 Sys 5

FFTW 42109 17640 24241 44918 9078 15961new library 59175 34292 42130 38628 14986 21993boost 41% 94% 74% −

14% 65% 38%best/core/GHz 986 1021 675 501 407 529

Table 1: Comparison of best performances for a problem 32768 × M F L O P s Threads new lib, Sys 2aFFTW, Sys 2anew lib, Sys 2bFFTW, Sys 2b 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 10 20 30 40 50 60 70 M F L O P s Threads new libFFTW

Figure 2: Array size (32768) , performance dependence as a function of number of parallelthreads. Red: new library, green: FFTW. (Left panel) System 2a was equipped with twoIntel ® Xeon ® Silver 4110 CPU @ 2.10GHz (Skylake), which results in 2 × ×

16 HyperThreading cores.

System 2b was equipped with two Intel ® Xeon ® Gold6126 CPU @ 2.60GHz (Skylake), which results in 2 ×

12 CPU cores, 2 ×

24 HyperThreadingcores. (Right panel)

System 3 was equipped with two Intel ® Xeon ® Gold 6242 CPU@ 2.80GHz (Cascade Lake), which results in 2 ×

16 CPU cores, 2 ×

32 HyperThreadingcores. M F L O P s Threads new libFFTW 0 5000 10000 15000 20000 25000 0 5 10 15 20 25 30 35 M F L O P s Threads new libFFTW

Figure 3: Array size (32768) , performance dependence as a function of number of parallelthreads. Red: new library, green: FFTW. (Left panel) System 4 was equipped with twoAMD ® Opteron ™ Processor 6276 @ 2.3GHz (Bulldozer), which results in 2 × ×

16 HyperThreading cores. (Right panel)

System 5 was equipped with twoIntel ® Xeon ® CPU E5-2670 0 @ 2.60GHz (Sandy Bridge EP), which results in 2 × ×

16 HyperThreading cores.

System 3 . Although architectures of

Systems2a,b and

System 3 are very close, the dependence changed drastically aftersome number of parallel threads. As one can see, until number of threadsclose to 16 everything looks pretty similar to other systems. Some ideas whyit can be so are given in the end of Appendix B. Also, one could noticethat 16 is exactly the number of real CPU cores on one of the two pro-cessors. As a result, absence of speedup after adding more parallel threadscan be attributed to some problems of communication between two CPUsin the system. The source(-s) of these problems can range from architectureﬂaws and/or hardware conﬁguration to conﬁguration of Linux kernel or evenmotherboard (BIOS) software. The fact that there is no noticeable inﬂuenceof this issue on FFTW’s performance could be the lower usage of memorythroughput due to slightly lower performance which does not create somekind of issue in the kernel and/or CPUs. As a summary, the behavior of

System 3 is still a mystery for the author. Even on this system if we limitnumber of parallel threads by 16 (one real CPU) new library demonstratesimprovement in performance.

In order to test the performance of two libraries under consideration as afunction of the size of an array, author performed another series of tests. Nowthe number of parallel threads was ﬁxed and only the size of the problemwas changing. Once again, only out of place double precision complex DFTswere tested. Number of threads was determined from the tests described inthe previous Subsection 3.1. Namely, author used number of parallel threadswhich gave the highest (or close to that) performance for FFTW library onevery architecture. All arrays were square and the problem size was startingfrom 4096 × × ). The same testing program included in the librarypackage was used. Results of the tests are represented in Figures 4-6.As can be seen from the measurements, for every architecture the size ofthe array when FFTW experiences a dramatic drop in performance is slightlydiﬀerent. At the same time the performance of the new library relativelyweakly depends on the size of the problem. Observed oscillations are theconsequence of diﬀerent eﬃciency of 1D FFTW DFTs for diﬀerent sizes ofthe problems (“out of the box” FFTW is the most eﬃcient for sizes which canbe represented as products of integer (natural) powers of few lowest prime7 M F L O P s Size new libFFTW

Figure 4: Performance dependence as a function of array size N , full 2D-array size if N .Number of threads was chosen as giving the best performance to FFTW in the previousseries of tests. Red: new library, green: FFTW. System 1 was equipped with two Intel ® Xeon ® CPU E5-2680 v3 @ 2.50GHz (Haswell), which results in 2 ×

12 CPU cores, 2 × M F L O P s Size new lib, 2.1 GHzFFTW, 2.1 GHznew lib, 2.6 GHzFFTW, 2.6 GHz 10000 20000 30000 40000 50000 60000 70000 80000 0 5000 10000 15000 20000 25000 30000 35000 40000 M F L O P s Size new libFFTW

Figure 5: Fixed number of threads (giving best performance to FFTW in the previousseries of tests), performance dependence as a function of size (length of the side) of thesquare array. Red: new library, green: FFTW. (Left panel)

System 2a was equipped withtwo Intel ® Xeon ® Silver 4110 CPU @ 2.10GHz (Skylake), which results in 2 × ×

16 HyperThreading cores, number of threads is 32.

System 2b was equippedwith two Intel ® Xeon ® Gold 6126 CPU @ 2.60GHz (Skylake), which results in 2 × ×

24 HyperThreading cores, number of threads is 48. (Right panel)

System3 was equipped with two Intel ® Xeon ® Gold 6242 CPU @ 2.80GHz (Cascade Lake),which results in 2 ×

16 CPU cores, 2 ×

32 HyperThreading cores, Number of threads is 33. M F L O P s Size new libFFTW 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 0 5000 10000 15000 20000 25000 30000 35000 M F L O P s Size new libFFTW

Figure 6: Fixed number of threads (giving the best performance to FFTW in the previousseries of tests), performance dependence as a function of size (length of the side) of thesquare array.. Red: new library, green: FFTW. (Left panel)

System 4 was equippedwith two AMD ® Opteron ™ Processor 6276 @ 2.3GHz (Bulldozer), which results in 2 × ×

16 HyperThreading cores, number of threads is 32. (Right panel)

System5 is equipped with two Intel ® Xeon ® CPU E5-2670 0 @ 2.60GHz (Sandy Bridge EP),which results in 2 × ×

16 HyperThreading cores, Number of threads is 32. numbers 2, 3, 5, and 7, although there is a possibility to make FFTW to beas eﬃcient for larger primes as well [18]). As it can be noted, dips and peaksof the performance of new library perfectly coincide with ones for FFTW,which is inevitable as the same 1D DFT subroutines are used. See Table 2for a summary of results. In all the cases there is a signiﬁcant 25 − Sys 1 Sys 2a Sys 2b Sys 3 Sys 4 Sys 5

Threshold 16 × × × × × × × Perf. drop 35% 40% 45% 25% 30% 40%50%

Table 2: Dependence of FFTW performance on problem’s size given as S × S × ,where S is an integer (natural) number, starting from 4 × × × performance degradation when size of the array exceeds the threshold closeto 16384 × System 3 with respect to other systems. The ﬁrstdrop in performance by 25% happens, as problem size exceeds 16 × × × ×

4. Conclusion

The signiﬁcant degradation of performance of current standard de facto

DFT library, using multiple parallel threads on SMP systems, motivated theauthor to perform extensive set of tests on diﬀerent platforms. Results ofthese tests indicated that there is performance drop for array sizes exceedingsome threshold (slightly diﬀerent for diﬀerent CPUs and architectures). Theauthor implemented a relatively simple algorithm as a library using 1D DFTsubroutine from FFTW together with parallel block transposition subroutine.The new library demonstrated signiﬁcant (35 − ® CPUs in the listof tested systems was not intentional. AMD ® Opteron ™ CPU of Bulldozerfamily (and similar) are infamous for their dreadful performance. NewerCPUs of the same company appeared few years ago. Author tried to contactAMD ® through diﬀerent means of communication and requested a remoteaccess to any SMP machine for 24 hours. Author’s request was declined, itwas proposed to contact vendors. Unfortunately, during the time of testsno high performance computing vendors oﬀered servers equipped with recentAMD ® CPUs.Current version of the library provides only 2D out of place complex tocomplex DFT subroutines. It comes with some limitations and not opti-mized for memory consumption. Some straightforward ways of improvementinclude diﬀerent pools of threads for transposition and 1D DFTs, eliminationof current limitations on dimensions of the arrays (currently they have to bemultiples of relatively small number, like 64), and optimization of memory10sage. Further development will depend on author’s future research needs aswell as demand from potential users. Author would prefer the new library tobe included as a possible engine for large arrays into existing libraries and willtry to work in this direction. Another option will be internal tuning of theexisting libraries, perhaps based on tests presented in this paper, in order toeliminate existing signiﬁcant performance degradation for large arrays. Anyof these two variants will make life of users (including the author) way easieras there will be no need to choose diﬀerent libraries for diﬀerent sizes of theproblems.

Acknowledgments

The author has performed this work for the project supported by theSimons Collaboration on Wave Turbulence and author is grateful for thissupport. Some of the systems used for development and testing (

System1 and

System 4 ) were funded by NSF grant OCE 1131791. Author wouldlike to express his gratitude to A. O. Prokoﬁev, who explained him blocktransposition in early 2000s. Also author would like to thank developers ofFFTW [1] for this beautiful and free example of software engineering.

Appendix A. Library description

Appendix A.1. Current library packaging

As currently there is only one subroutine for out of place 2D complexto complex DFT, author left the library as just mere C-ﬁles, which have tobe compiled into object ﬁles and linked to the users program in the usualway. As the interest level to the library is not yet clear, author would preferto invest the time into creation of auto-conﬁguration and installation scriptonly if requested by other users.The new library is dependent on FFTW v3. Author strongly recommendsto compile users local version of the library in the user’s directory, conﬁguredto use system speciﬁc SIMD instructions as explained in Appendix C, asperformance gain can be substantial. After that

Makefile has to be editedin order to provide paths to header ﬁles and library ﬁles. Currently, staticlinking is used. If user prefers to use dynamic linking the option -static in LDFLAGS has to be erased. Usual command make clean all will compilethe library (object ﬁles) and the benchmark program.11 ppendix A.2. Library API

Library API was intentionally made as close to FFTW’s one as possible.In the description we suppose that the reader is familiar with FFTW’s API,detailed description of which can be found at FFTW’s web-site [1]. First ofall, one needs to initialize threads (here we initialize 32 threads): unsigned long int threads_number=32;my_fft_init (threads_number);

Here threads number is the number of threads to be initialized in the pool ofthreads. Maximum number of threads to be available for the computations.This value has to be more or equal to zero.Then the usual creation of a DFT’s plan has to be performed: my_fft_plan plan_fwd;plan_fwd = my_fft_plan_dft_2d (in, out, out2, NX, NY, +1, FFTW_EXHAUSTIVE, threads_number);

Here in and out are input and output arrays similarly to FFTW’s API.Array out2 is a scratch array of the same size which is necessary for thepresent version as the array transposition subroutine does not support inplace functionality. This limitation can be easily eliminated in the nextversion. All other parameters are passed directly to the 1D DFT API of theFFTW library. Here threads number is actual number of threads to be usedfor DFT (has to be less or equal to number of threads previously speciﬁedin the my fft init ). If it is equal to zero, nonparallel (linear) version ofsubroutine is used. Important note:

Currently library supports only sizes when both NX and NY are multiples of variable BLOCK SIDE SIZE deﬁned in the ﬁle my fft lib.h .This variable deﬁnes a size of the square block used for transposition. Thislimitation can be eliminated if needed. Even in its present form it hardlypose any practical issue, as sizes of arrays when this library become beneﬁcialare usually 16 × NX and NY , as a result to change it to thenearest multiple of even 64 (optimal case for all tested Intel ® CPUs, forAMD ® it is equal to 16) is not a serious problem.In order to perform planned DFT one needs to call the subroutine my fft execute with the previously created plan: 12 y_fft_execute (plan_fwd); In the case if the user needs to free the resources allocated for the previ-ously created plan, one needs to call the following subroutine: my_fft_destroy_plan (plan_fwd);

In order for the functions deﬁnitions to be included into the *.c -ﬁle oneneeds to include my fft lib.h : Here it is supposed that ﬁle my fft lib.h is located in the same directory.If you would like to use FFTW concept of wisdom (information aboutpreviously computed optimal plans, signiﬁcantly accelerates creation of newplans), just load wisdom from ﬁle using standard FFTW subroutine. Itshould be noted, that new library uses only

1D linear DFTs and paralleland linear wisdom are not compatible. Please, load only usual (nonparallel)wisdom, not the one produced as a result of plans creation for parallel DFTs.

Appendix A.3. Benchmark program

The simple benchmark program my fft test is provided together withthe library ﬁles. It allows to perform a performance test for speciﬁed sizeof an array using both new library’s and FFTW’s subroutines. Only outof place complex to complex 2D DFTs are compared. Result is reported inMFLOPS. Here is an example of usage of the program: $ ./my_fft_testUsage is the following:./my_fft_test NX NY num_of_repeats number_of_threads [output_file] $ ./my_fft_test 8192 8192 10 8NX = 8192, NY = 8192, num_reps = 10, threads_number=88 Threads initialized.Wisdom successfully imported.Exporting wisdom to file...Time for 10 Fourier transforms using my_fft is 1.09193526530e+01 seconds......... (here will be output for 10 runs)Time for 10 Fourier transforms using my_fft is 1.11323441190e+01 seconds. FLOPS = 8062.030694Using FFTW_MEASURE option for FFTW plan. Usually quite fast.Time for 10 Fourier transforms using FFTW is 9.55567851900e+00seconds......... (here will be output for 10 runs).Time for 10 Fourier transforms using FFTW is 9.55813624100e+00seconds.MFLOPS = 9256.224182Normed L_infty norm: ||Mismatch between my_fft and fftw||_infty/(NX*NY) = 5.218373960228812e-17.

Input parameters are: sizes of the array in X and Y dimensions, number ofrepeated DFTs for every measurement of time (for good results should giveat least few seconds per measurement, several tens of seconds even better),number of threads to use, optional output ﬁle name. If the output ﬁle nameis provided, result is written to a ﬁle (a line is appended to the end of ﬁle)as seven columns: X -size, Y -size, number of threads used, perfomance inMFLOPS for new library, perfomance in MFLOPS for FFTW, best measuredtime for new library, best measured time for FFTW. Also shell-scripts fortesting are provided. They will be discussed in the next subsection togetherwith methodology of testing. Important note 1:

Author used

FFTW MEASURE option for creation ofFFTW parallel plans for large arrays, although FFTW benchmarking man-ual [17] recommends to use

FFTW PATIENT . The reason is extremely longtime necessary for creation of plans with this option. For example, on

Sys-tem 1 creation of a plan for problem size 32768 × FFTW PATIENT option took around 30 hours. No signiﬁcant advantage for large array sizes was obtained. On

System 2b creation of a plan forproblem size 32768 × FFTW PATIENT option tookmore than 24 hours. Again, no signiﬁcant performance boost for large arraysizes was noticed. If one would like to try it, variable

FFTW PATIENT PLAN has to be deﬁned in the ﬁle my fft test.c . Important note 2:

As it was mentioned above, 10 time measurementruns are performed and the best (smallest) time is used for evaluation ofperformance. The number of runs can be changed by redeﬁning variable

RUNS NUMBER in the ﬁle my fft test.c .Program tries to load FFTW wisdom from the ﬁle fftw.wisdom locatedin the same directory. If the ﬁle is found, the program reads the wisdom and14fter creation of a plan for the new library writes the cumulative wisdom tothe same ﬁle. If program cannot ﬁnd fftw.wisdom ﬁle, it creates plans with-out it and saves the accumulated wisdom to fftw.wisdom ﬁle if possible.Unfortunately, as it was already mentioned above, for FFTW v3 wisdominformation for linear and parallel planning subroutines are incompatible.Even more, if linear planning subroutine was used in the program, parallelwisdom cannot be used in the same program. If this situation will changein future releases, all necessary parts of the code can be initiated by deﬁn-ing variables

READ THR WISDOM FOR FFTW and

SAVE THR WISDOM FOR FFTW inthe ﬁle my fft test.c . For parallel wisdom program will try to use ﬁle fftw.thr.wisdom located in the same directory.

Appendix A.4. Installation and testing methodology

When benchmark program is compiled and working properly, one couldtry to ﬁne tune the only changeable parameter in the library, namely thevariable

BLOCK SIDE SIZE in the ﬁle my fft lib.h , which determines the sidelength of the square block used for transposition. By author’s experience, theoptimal values are: 64 for Intel ® CPUs and 16 for AMD ® CPUs (authorhad an access only to relatively old AMD ® CPUs). Author recommends touse powers of 2. Of course, after every change of this value, library and testprogram has to be recompiled.After that one can start scanning system for performance dependence onnumber of used threads with ﬁxed array size. Initially, it is good to determinethe largest array which ﬁts into system’s RAM memory. Due to limitationmentioned above author recommends to use multiples of 2 = 1024. Letus say that the maximum size of the array is (20 × and the systemhas 16 logical CPU cores (usully it means 8 real CPU cores and 8 moreHyperThreading cores). There is a simple shell-script thr test.sh providedwith the library. This is how it can be tuned for the situation describedabove: /my_fft_test 20480 20480 10 $i 20k.User_thr_test.datdone Here we use 5 DFTs for every time measurements for small number of parallelthreads and 10 for presumably faster conﬁgurations with larger number ofparallel threads. The output ﬁle format is described above in details.After completing the previous scan, one should ﬁnd the number of threadswhich gives the best performance for FFTW. This number of threads will beused to determine the performance dependence on the size of the problemwhen number of parallel threads is ﬁxed. For example, let us suppose thatthe best performance is achieved by FFTW using 15 threads. Then one canslightly modify simple script size test.sh provided with the library in thefollowing way:

Here we scan in integer (natural) multiples of 1024 as the side length of asquare array. Size is changing from 4 × × Appendix B. Tested systems

All systems were working under diﬀerent ﬂavors of GNU Linux distribu-tions ranging from Ubuntu and Debian to Gentoo. GNU Compiler Collection(gcc) was used as a C compiler in all the cases. Here are short hardware de-scriptions for every system. • System 1

CPU: 2 × Intel ® Xeon ® CPU E5-2680 v3 @ 2.50GHz (Haswell), 2x12CPU cores, 2x24 HT coresRAM: 128 GiB 16

System 2a

CPU: 2 × Intel ® Xeon ® Silver 4110 CPU @ 2.10GHz (Skylake), 2x8CPU cores, 2x16 HT coresRAM: 64 GiB • System 2b

CPU: 2 × Intel ® Xeon ® Gold 6126 CPU @ 2.60GHz (Skylake), 2x12CPU cores, 2x24 HT coresRAM: 64 GiB • System 3

CPU: 2 × Intel ® Xeon ® Gold 6242 CPU @ 2.80GHz (Cascade Lake),2x16 CPU cores, 2x32 HT coresRAM: 384 GiB • System 4

CPU: 2 × AMD ® Opteron ™ Processor 6276 CPU @ 2.3GHz (Bull-dozer), 2x8 CPU cores, 2x16 HT coresRAM: 64 GiB • System 5

CPU: 2 × Intel ® Xeon ® CPU E5-2670 0 @ 2.60GHz (Sandy BridgeEP), 2x8 CPU cores, 2x16 HT coresRAM: 128 GiBIt should be noted, that all systems, except

System 3 , were producing veryconsistent results in all the tests. The

System 3 behaved very inconsistently,with signiﬁcantly diﬀerent results between benchmarking runs (each run tookabout 24 hours, diﬀerence could be as high as 10%). This is the newest CPUin the tested systems and it was released already after recent cache memoryassociated vulnerabilities reported for Intel ® CPUs. Perhaps, temporaryand/or urgent patches to the cache memory system and Linux kernel werethe reason for such a behavior. Author has no other explanation.

Appendix C. Conﬁguration of FFTW

For most of the systems (

Systems 1-3 ) participated in benchmarkingthe following line was used during conﬁguration of FFTW library: $ ./configure --enable-threads --disable-fortran --with-gnu-ld--enable-avx2 --enable-fma --enable-sse2

17f course, if you need FORTRAN interface, please, remove --disable-fortran from the line. As

Systems 4-5 do not support AVX2 SIMD (single instruc-tion multiple data) instructions, slightly diﬀerent conﬁguration line was used: $ ./configure --enable-threads --disable-fortran --with-gnu-ld--enable-avx --enable-fma --enable-sse2 It should be noted, that although

Systems 2a,b-3 support AVX512 SIMDinstruction set, it is not recommended to use conﬁguration option --enable-avx512 as it results in drastic drop in performance. Comparison of diﬀerent con-ﬁgurations can be performed using bench program supplied with FFTWlibrary. The following command line was used for ﬁnal testing: ./bench-onthreads=24 -opatient -s oc32768x32768 , initial testing was done onsmaller 4096 × not recommend to supply any additional CFLAGS during compilation. In his experience it resulted only in performancedrop. Default compilation ﬂags are well tuned and (nearly?) optimal.

References [1] M. Frigo, S. G. Johnson, The design and implementation of FFTW 3,Proc. IEEE 93 (2) (2005) 216–231.URL http://fftw.org [2] J. W. Cooley, J. W. Tukey, An algorithm for the machine calcula-tion of complex Fourier series, Math. Comput. 19 (90) (1965) 297–301. doi:10.2307/2003354 .[3] A. I. Dyachenko, A. O. Korotkevich, V. E. Zakharov, Decay of themonochromatic capillary wave, JETP Lett. 77 (9) (2003) 477–481. arXiv:physics/0308100 .[4] A. I. Dyachenko, A. O. Korotkevich, V. E. Zakharov, Weak tur-bulence of gravity waves, JETP Lett. 77 (10) (2003) 546–550. arXiv:physics/0308101 .[5] A. I. Dyachenko, A. O. Korotkevich, V. E. Zakharov, Weak turbulentkolmogorov spectrum for surface gravity waves, Phys. Rev. Lett. 92 (13)(2004) 134501. arXiv:physics/0308099 .186] A. O. Korotkevich, Simultaneous numerical simulation of direct andinverse cascades in wave turbulence, Phys. Rev. Lett. 101 (7) (2008)074504. arXiv:0805.0445 .[7] A. O. Korotkevich, A. I. Dyachenko, V. E. Zakharov, Numerical sim-ulation of surface waves instability on a homogeneous grid, Physica D321-322 (2016) 51–66. arXiv:1212.2225 .[8] A. O. Korotkevich, P. M. Lushnikov, Proof-of-concept implementationof the massively parallel algorithm for simulation of dispersion-managedwdm optical ﬁber systems, Optics Lett. 36 (10) (2011) 1851–1853. arXiv:1101.0086 .[9] A. O. Korotkevich, P. M. Lushnikov, H. A. Rose, Beyond the randomphase approximation: Stimulated brillouin backscatter for ﬁnite lasercoherence times, Phys. of Plasmas 22 (2015) 012107. arXiv:1105.2094 .[10] S. A. Dyachenko, P. M. Lushnikov, A. O. Korotkevich, The complexsingularity of a stokes wave, JETP Letters 98 (11) (2013) 767–771. arXiv:1311.1882 , doi:10.7868/S0370274X13230070 .[11] S. A. Dyachenko, P. M. Lushnikov, A. O. Korotkevich, Branch cutsof Stokes wave on deep water. Part I: Numerical solution and Pad´eapproximation, Studies in Applied Mathematics 137 (4) (2016) 419–472. arXiv:1507.02784 , doi:10.1111/sapm.12128 .[12] G. Falkovich, N. Vladimirova, Cascades in nonlocal turbulence, Phys.Rev. E 91 (2015) 041201. doi:10.1103/PhysRevE.91.041201 .URL https://link.aps.org/doi/10.1103/PhysRevE.91.041201 [13] I. Kolokolov, V. Lebedev, P. M. Lushnikov,Statistical properties of a laser beam propagating in a turbulent medium,Phys. Rev. E 101 (2020) 042137. doi:10.1103/PhysRevE.101.042137 .URL https://link.aps.org/doi/10.1103/PhysRevE.101.042137 [14] A. O. Korotkevich, Set of SMP-parallel subroutines for 2D out of place complex to complex DFTs.(2019).URL https://github.com/w5kao/my_fft_lib [15] M. Frigo, C. E. Leiserson, H. Prokop, S. Ramachandran,Cache-oblivious algorithms, Proc. IEEE Symp. on Foundations of19omputer Science (FOCS) (1999) 285297.URL http://supertech.csail.mit.edu/papers/FrigoLePr99.pdf [16] A. O. Korotkevich, Library for initialization and synchronization of pool of threads(2009).URL https://github.com/w5kao/my_thr_lib [17] M. Frigo, S. G. Johnson, FFT benchmark methodology (2011).URL [18] M. Frigo, S. G. Johnson, Introduction to FFTW v3 (2011).URL