A new set of efficient SMP-parallel 2D Fourier subroutines
aa r X i v : . [ phy s i c s . c o m p - ph ] A ug Graphical Abstract
A new set of efficient SMP-parallel 2D Fourier subroutines
Alexander O. Korotkevich M F L O P s Threads new libFFTW ighlights
A new set of efficient SMP-parallel 2D Fourier subroutines
Alexander O. Korotkevich • New set of efficient SMP-parallel DFTs • Free software license • Significant performance boost for large sizes of arrays. new set of efficient SMP-parallel 2D Fouriersubroutines
Alexander O. Korotkevich a,b a Department of Mathematics & Statistics, The University of New Mexico, MSC01 1115,1 University of New Mexico, Albuquerque, New Mexico, 87131-0001, USA b L. D. Landau Institute for Theoretical Physics, 2 Kosygin Str., Moscow, 119334,Russian Federation
Abstract
Extensive set of tests on different platforms indicated that there is a per-formance drop of current standard de facto software library for the Dis-crete Fourier Transform (DFT) in case of large 2D array sizes (larger than16384 × − Keywords:
FFT, GPL, 2D DFTs
PACS:
1. Introduction
Since 1990s the standard de facto for scientific high performance com-putations involving discrete Fourier transform (DFT) is FFTW library [1].Author used it during his whole research career, from early 2000s, whenit was still version 2. This library provides self tuned perfected version offast Fourier transform algorithm, which was initially introduced for broaduse in [2]. From the very beginning of the usage of the FFTW, author
Preprint submitted to Journal of Computational Physics August 18, 2020 as fascinated by the elegant design of the software and convenient appli-cation programming interface (API). Author used FFTW for many researchcodes(e.g. [3, 4, 5, 6, 7, 8, 9, 10, 11]). Some of them involved 2D FFTs, butfor very modest sizes of arrays. For quite a long time to get a computationalworkstation with many computing central processing unit (CPU) cores wasprohibitively expensive, while distributed memory computations for thesesizes of arrays was utterly useless due to communication cost. Fortunately,situation changed drastically. Now even several tens of CPU cores SMP con-figurations are readily available for a manageable cost. Available amount ofmemory also have risen to hundreds of gibibytes. It made some long standingproblems from Physics and Applied Mathematics approachable. For exam-ple, turbulence simulations require wide range of scales, in other words largewavenumbers arrays, like in the recent work [12]. In order to check sometheoretical predictions for propagation of the laser radiation in a turbulentatmosphere, like in paper [13], one needs 2D array of at least (!) (43000) size. Taking into account previously described availability of systems withhigh number of CPU cores, usage of shared memory program architecturemakes algorithms both simpler to implement and more efficient, as it allows toavoid communication between parallel parts of the program due to direct ac-cess to the whole program memory by any part (thread) of the code. FFTWhas multithreaded parallel functions for all kinds of DFTs during practicallyall time of its existence. At the same time, when performance tests weredone, author was disappointed to notice drastic drop of performance of mul-tithreaded DFTs with growth of the array sizes (see Figures 4-6 below). Itshould be specified, that author was mostly interested in 2D DFTs. Whileprallelization of 2D DFT looks obvious and one would expect linear or nearto linear growth of performance with the number of parallel threads, theproblem is not that simple for small transformation sizes, as inevitable over-head for threads initialization and synchronization can eliminate all the gainif amount of work for every parallel part is not large enough. But for large,tens of thousands points in every direction 2D arrays situation supposed tobe pretty simple and speedup from parallelization looks inevitable. On thecontrary, parallel FFTW performance degradation was increasing with thesize of the arrays (as it is demonstrated later in the paper). A remedy forthe situation was found, nearly linear speedup in number of used CPU coreswas achieved. A set of 2D DFTs (out of place, double precision complexnumbers) is published [14] under free software license (GNU GPL v3) for useby community and/or incorporation into existing software.2 . Proposed solution Author decided that most probably architecture of FFTW is tuned forefficient memory handling for relatively small transformations. This is why itwas decided to use 1D FFTW subroutines and efficient CPU cache memoryhandling techniques in order to try to improve the situation. For one of theprevious codes [8] author developed a parallel version of block matrix trans-position, which allowed to avoid unnecessary penalties for cache memorymisses (when accessed memory parts are not going one after another, whichresults in access to the parts of memory (RAM) not mapped into CPUs cachememory, causing latency delays). As a result, the 2D DFT was implementedin the following way: parallel 1D DFTs (using 1D DFTs from FFTW) inthe direction of array which goes “along the memory” (in C programminglanguage matrix organization is so called “row major format”, which meansthat matrix is stored in memory as row after row, unlike FORTRAN’s “col-umn major format” for a matrix storage, so we first compute 1D DFTs alongrows) are computed; then array is transposed, resulting in other dimensionbeing along the memory (in C columns become rows); now parallel DFTs arecomputed for now fast dimension (in C it will correspond to 1D DFTs alongrows of transposed matrix, meaning columns of original one); then we trans-pose array again. Operations of array transposition look unnecessary, as onecould just perform 1D DFTs along second dimension without them, but insuch a case distance in memory between successive elements of corresponding1D array will be large (for large arrays) and nearly every operation with theseelements would result in CPU “cache miss”, meaning the CPU will wait fordata from RAM instead of number crunching. As it was mentioned abovewe avoid unnecessary “cache misses” by using block array transposition. Forlarge array sizes (when array doesn’t fit into CPU cache memory) effect ofefficient memory handling could easily results in order of magnitude speedupof transposition operation in comparison with naive version of it. Blocktransposition is a pretty standard technique demonstrating importance ofthoughtful use of memory especially for large array sizes (some discussion ofeven more advanced memory handling techniques used internally in FFTWcan be found, for example, here [15]). Simple set of subroutines for pool ofthreads organization and synchronization was used [16] (included with thenew library). Because author created this set of 2D DFT subroutines specifi-cally for his set of research tasks, only complex-to-complex out-of-place DFTswere implemented. Other types of DFTs could be implemented by demand3r if new research needs will arise. Published code was not specifically opti-mized for speed or memory usage, with exception of memory block size fortransposition subroutine, which was tested on several architectures to pro-vide slight performance gain with respect to some generic value (only twodifferent values: for AMD ® and Intel ® CPUs). Application programminginterface (API) was made as close to FFTW’s one as possible. Is describedin details in Appendix A, together with complete description of benchmarkprogram and proposed performance test procedure.
3. Performance of the code
All systems which participated in benchmark are briefly described inAppendix B. In all the tests version 3.3.8 (the latest at the moment of writ-ing the article) of FFTW was used. Configuration of the library for differentCPUs and machines is described in Appendix C. For evaluation of perfor-mance the same methodology as described for official FFTW benchmarkutility (described in [17]) was used, which means that for complex trans-forms number of operations is estimated as 5 N log N (here N is the totalnumber of harmonics/points), time (up to nanoseconds) is measured for aseries of DFTs which takes at least several (usually tens or hundreds) of sec-onds to compute, then this procedure is repeated 10 times and smallest time(meaning the best performance) expressed in microseconds was used to di-vide estimated number of operations in order to get performance in MFLOPs(Mega-FLOPs or millions of floating point operations per second). Two se-ries of tests were performed: fixed array size and variable number of parallelthreads, fixed number of threads (usually the number of threads which givesthe best performance for FFTW in the previous test was chosen) and vari-able size of the array in order to determine at what value performance drophappens. In all the cases double precision complex array of the size 32768 × System 2b ) the full test was completedtwice with several days between runs to avoid influence of system processes4tc. Best results of all runs were used for evaluation. Benchmark results arerepresented in Figures 1-3. M F L O P s Threads new libFFTW
Figure 1: Array size (32768) , performance dependence as a function of number of parallelthreads. Red: new library, green: FFTW. System 1 was equipped with two Intel ® Xeon ® CPU E5-2680 v3 @ 2.50GHz (Haswell), which results in 2 ×
12 CPU cores, 2 × As one can see, for this size of the array new library gives performanceboost for all but one system (exception is
System 3 ). See Table 1 fora summary of results. Performance improvement ranges for most systemsMFLOPS
Sys 1 Sys 2a Sys 2b Sys 3 Sys 4 Sys 5
FFTW 42109 17640 24241 44918 9078 15961new library 59175 34292 42130 38628 14986 21993boost 41% 94% 74% −
14% 65% 38%best/core/GHz 986 1021 675 501 407 529
Table 1: Comparison of best performances for a problem 32768 × M F L O P s Threads new lib, Sys 2aFFTW, Sys 2anew lib, Sys 2bFFTW, Sys 2b 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 10 20 30 40 50 60 70 M F L O P s Threads new libFFTW
Figure 2: Array size (32768) , performance dependence as a function of number of parallelthreads. Red: new library, green: FFTW. (Left panel) System 2a was equipped with twoIntel ® Xeon ® Silver 4110 CPU @ 2.10GHz (Skylake), which results in 2 × ×
16 HyperThreading cores.
System 2b was equipped with two Intel ® Xeon ® Gold6126 CPU @ 2.60GHz (Skylake), which results in 2 ×
12 CPU cores, 2 ×
24 HyperThreadingcores. (Right panel)
System 3 was equipped with two Intel ® Xeon ® Gold 6242 CPU@ 2.80GHz (Cascade Lake), which results in 2 ×
16 CPU cores, 2 ×
32 HyperThreadingcores. M F L O P s Threads new libFFTW 0 5000 10000 15000 20000 25000 0 5 10 15 20 25 30 35 M F L O P s Threads new libFFTW
Figure 3: Array size (32768) , performance dependence as a function of number of parallelthreads. Red: new library, green: FFTW. (Left panel) System 4 was equipped with twoAMD ® Opteron ™ Processor 6276 @ 2.3GHz (Bulldozer), which results in 2 × ×
16 HyperThreading cores. (Right panel)
System 5 was equipped with twoIntel ® Xeon ® CPU E5-2670 0 @ 2.60GHz (Sandy Bridge EP), which results in 2 × ×
16 HyperThreading cores.
System 3 . Although architectures of
Systems2a,b and
System 3 are very close, the dependence changed drastically aftersome number of parallel threads. As one can see, until number of threadsclose to 16 everything looks pretty similar to other systems. Some ideas whyit can be so are given in the end of Appendix B. Also, one could noticethat 16 is exactly the number of real CPU cores on one of the two pro-cessors. As a result, absence of speedup after adding more parallel threadscan be attributed to some problems of communication between two CPUsin the system. The source(-s) of these problems can range from architectureflaws and/or hardware configuration to configuration of Linux kernel or evenmotherboard (BIOS) software. The fact that there is no noticeable influenceof this issue on FFTW’s performance could be the lower usage of memorythroughput due to slightly lower performance which does not create somekind of issue in the kernel and/or CPUs. As a summary, the behavior of
System 3 is still a mystery for the author. Even on this system if we limitnumber of parallel threads by 16 (one real CPU) new library demonstratesimprovement in performance.
In order to test the performance of two libraries under consideration as afunction of the size of an array, author performed another series of tests. Nowthe number of parallel threads was fixed and only the size of the problemwas changing. Once again, only out of place double precision complex DFTswere tested. Number of threads was determined from the tests described inthe previous Subsection 3.1. Namely, author used number of parallel threadswhich gave the highest (or close to that) performance for FFTW library onevery architecture. All arrays were square and the problem size was startingfrom 4096 × × ). The same testing program included in the librarypackage was used. Results of the tests are represented in Figures 4-6.As can be seen from the measurements, for every architecture the size ofthe array when FFTW experiences a dramatic drop in performance is slightlydifferent. At the same time the performance of the new library relativelyweakly depends on the size of the problem. Observed oscillations are theconsequence of different efficiency of 1D FFTW DFTs for different sizes ofthe problems (“out of the box” FFTW is the most efficient for sizes which canbe represented as products of integer (natural) powers of few lowest prime7 M F L O P s Size new libFFTW
Figure 4: Performance dependence as a function of array size N , full 2D-array size if N .Number of threads was chosen as giving the best performance to FFTW in the previousseries of tests. Red: new library, green: FFTW. System 1 was equipped with two Intel ® Xeon ® CPU E5-2680 v3 @ 2.50GHz (Haswell), which results in 2 ×
12 CPU cores, 2 × M F L O P s Size new lib, 2.1 GHzFFTW, 2.1 GHznew lib, 2.6 GHzFFTW, 2.6 GHz 10000 20000 30000 40000 50000 60000 70000 80000 0 5000 10000 15000 20000 25000 30000 35000 40000 M F L O P s Size new libFFTW
Figure 5: Fixed number of threads (giving best performance to FFTW in the previousseries of tests), performance dependence as a function of size (length of the side) of thesquare array. Red: new library, green: FFTW. (Left panel)
System 2a was equipped withtwo Intel ® Xeon ® Silver 4110 CPU @ 2.10GHz (Skylake), which results in 2 × ×
16 HyperThreading cores, number of threads is 32.
System 2b was equippedwith two Intel ® Xeon ® Gold 6126 CPU @ 2.60GHz (Skylake), which results in 2 × ×
24 HyperThreading cores, number of threads is 48. (Right panel)
System3 was equipped with two Intel ® Xeon ® Gold 6242 CPU @ 2.80GHz (Cascade Lake),which results in 2 ×
16 CPU cores, 2 ×
32 HyperThreading cores, Number of threads is 33. M F L O P s Size new libFFTW 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 0 5000 10000 15000 20000 25000 30000 35000 M F L O P s Size new libFFTW
Figure 6: Fixed number of threads (giving the best performance to FFTW in the previousseries of tests), performance dependence as a function of size (length of the side) of thesquare array.. Red: new library, green: FFTW. (Left panel)
System 4 was equippedwith two AMD ® Opteron ™ Processor 6276 @ 2.3GHz (Bulldozer), which results in 2 × ×
16 HyperThreading cores, number of threads is 32. (Right panel)
System5 is equipped with two Intel ® Xeon ® CPU E5-2670 0 @ 2.60GHz (Sandy Bridge EP),which results in 2 × ×
16 HyperThreading cores, Number of threads is 32. numbers 2, 3, 5, and 7, although there is a possibility to make FFTW to beas efficient for larger primes as well [18]). As it can be noted, dips and peaksof the performance of new library perfectly coincide with ones for FFTW,which is inevitable as the same 1D DFT subroutines are used. See Table 2for a summary of results. In all the cases there is a significant 25 − Sys 1 Sys 2a Sys 2b Sys 3 Sys 4 Sys 5
Threshold 16 × × × × × × × Perf. drop 35% 40% 45% 25% 30% 40%50%
Table 2: Dependence of FFTW performance on problem’s size given as S × S × ,where S is an integer (natural) number, starting from 4 × × × performance degradation when size of the array exceeds the threshold closeto 16384 × System 3 with respect to other systems. The firstdrop in performance by 25% happens, as problem size exceeds 16 × × × ×
4. Conclusion
The significant degradation of performance of current standard de facto
DFT library, using multiple parallel threads on SMP systems, motivated theauthor to perform extensive set of tests on different platforms. Results ofthese tests indicated that there is performance drop for array sizes exceedingsome threshold (slightly different for different CPUs and architectures). Theauthor implemented a relatively simple algorithm as a library using 1D DFTsubroutine from FFTW together with parallel block transposition subroutine.The new library demonstrated significant (35 − ® CPUs in the listof tested systems was not intentional. AMD ® Opteron ™ CPU of Bulldozerfamily (and similar) are infamous for their dreadful performance. NewerCPUs of the same company appeared few years ago. Author tried to contactAMD ® through different means of communication and requested a remoteaccess to any SMP machine for 24 hours. Author’s request was declined, itwas proposed to contact vendors. Unfortunately, during the time of testsno high performance computing vendors offered servers equipped with recentAMD ® CPUs.Current version of the library provides only 2D out of place complex tocomplex DFT subroutines. It comes with some limitations and not opti-mized for memory consumption. Some straightforward ways of improvementinclude different pools of threads for transposition and 1D DFTs, eliminationof current limitations on dimensions of the arrays (currently they have to bemultiples of relatively small number, like 64), and optimization of memory10sage. Further development will depend on author’s future research needs aswell as demand from potential users. Author would prefer the new library tobe included as a possible engine for large arrays into existing libraries and willtry to work in this direction. Another option will be internal tuning of theexisting libraries, perhaps based on tests presented in this paper, in order toeliminate existing significant performance degradation for large arrays. Anyof these two variants will make life of users (including the author) way easieras there will be no need to choose different libraries for different sizes of theproblems.
Acknowledgments
The author has performed this work for the project supported by theSimons Collaboration on Wave Turbulence and author is grateful for thissupport. Some of the systems used for development and testing (
System1 and
System 4 ) were funded by NSF grant OCE 1131791. Author wouldlike to express his gratitude to A. O. Prokofiev, who explained him blocktransposition in early 2000s. Also author would like to thank developers ofFFTW [1] for this beautiful and free example of software engineering.
Appendix A. Library description
Appendix A.1. Current library packaging
As currently there is only one subroutine for out of place 2D complexto complex DFT, author left the library as just mere C-files, which have tobe compiled into object files and linked to the users program in the usualway. As the interest level to the library is not yet clear, author would preferto invest the time into creation of auto-configuration and installation scriptonly if requested by other users.The new library is dependent on FFTW v3. Author strongly recommendsto compile users local version of the library in the user’s directory, configuredto use system specific SIMD instructions as explained in Appendix C, asperformance gain can be substantial. After that
Makefile has to be editedin order to provide paths to header files and library files. Currently, staticlinking is used. If user prefers to use dynamic linking the option -static in LDFLAGS has to be erased. Usual command make clean all will compilethe library (object files) and the benchmark program.11 ppendix A.2. Library API
Library API was intentionally made as close to FFTW’s one as possible.In the description we suppose that the reader is familiar with FFTW’s API,detailed description of which can be found at FFTW’s web-site [1]. First ofall, one needs to initialize threads (here we initialize 32 threads): unsigned long int threads_number=32;my_fft_init (threads_number);
Here threads number is the number of threads to be initialized in the pool ofthreads. Maximum number of threads to be available for the computations.This value has to be more or equal to zero.Then the usual creation of a DFT’s plan has to be performed: my_fft_plan plan_fwd;plan_fwd = my_fft_plan_dft_2d (in, out, out2, NX, NY, +1, FFTW_EXHAUSTIVE, threads_number);
Here in and out are input and output arrays similarly to FFTW’s API.Array out2 is a scratch array of the same size which is necessary for thepresent version as the array transposition subroutine does not support inplace functionality. This limitation can be easily eliminated in the nextversion. All other parameters are passed directly to the 1D DFT API of theFFTW library. Here threads number is actual number of threads to be usedfor DFT (has to be less or equal to number of threads previously specifiedin the my fft init ). If it is equal to zero, nonparallel (linear) version ofsubroutine is used. Important note:
Currently library supports only sizes when both NX and NY are multiples of variable BLOCK SIDE SIZE defined in the file my fft lib.h .This variable defines a size of the square block used for transposition. Thislimitation can be eliminated if needed. Even in its present form it hardlypose any practical issue, as sizes of arrays when this library become beneficialare usually 16 × NX and NY , as a result to change it to thenearest multiple of even 64 (optimal case for all tested Intel ® CPUs, forAMD ® it is equal to 16) is not a serious problem.In order to perform planned DFT one needs to call the subroutine my fft execute with the previously created plan: 12 y_fft_execute (plan_fwd); In the case if the user needs to free the resources allocated for the previ-ously created plan, one needs to call the following subroutine: my_fft_destroy_plan (plan_fwd);
In order for the functions definitions to be included into the *.c -file oneneeds to include my fft lib.h : Here it is supposed that file my fft lib.h is located in the same directory.If you would like to use FFTW concept of wisdom (information aboutpreviously computed optimal plans, significantly accelerates creation of newplans), just load wisdom from file using standard FFTW subroutine. Itshould be noted, that new library uses only
1D linear DFTs and paralleland linear wisdom are not compatible. Please, load only usual (nonparallel)wisdom, not the one produced as a result of plans creation for parallel DFTs.
Appendix A.3. Benchmark program
The simple benchmark program my fft test is provided together withthe library files. It allows to perform a performance test for specified sizeof an array using both new library’s and FFTW’s subroutines. Only outof place complex to complex 2D DFTs are compared. Result is reported inMFLOPS. Here is an example of usage of the program: $ ./my_fft_testUsage is the following:./my_fft_test NX NY num_of_repeats number_of_threads [output_file] $ ./my_fft_test 8192 8192 10 8NX = 8192, NY = 8192, num_reps = 10, threads_number=88 Threads initialized.Wisdom successfully imported.Exporting wisdom to file...Time for 10 Fourier transforms using my_fft is 1.09193526530e+01 seconds......... (here will be output for 10 runs)Time for 10 Fourier transforms using my_fft is 1.11323441190e+01 seconds. FLOPS = 8062.030694Using FFTW_MEASURE option for FFTW plan. Usually quite fast.Time for 10 Fourier transforms using FFTW is 9.55567851900e+00seconds......... (here will be output for 10 runs).Time for 10 Fourier transforms using FFTW is 9.55813624100e+00seconds.MFLOPS = 9256.224182Normed L_infty norm: ||Mismatch between my_fft and fftw||_infty/(NX*NY) = 5.218373960228812e-17.
Input parameters are: sizes of the array in X and Y dimensions, number ofrepeated DFTs for every measurement of time (for good results should giveat least few seconds per measurement, several tens of seconds even better),number of threads to use, optional output file name. If the output file nameis provided, result is written to a file (a line is appended to the end of file)as seven columns: X -size, Y -size, number of threads used, perfomance inMFLOPS for new library, perfomance in MFLOPS for FFTW, best measuredtime for new library, best measured time for FFTW. Also shell-scripts fortesting are provided. They will be discussed in the next subsection togetherwith methodology of testing. Important note 1:
Author used
FFTW MEASURE option for creation ofFFTW parallel plans for large arrays, although FFTW benchmarking man-ual [17] recommends to use
FFTW PATIENT . The reason is extremely longtime necessary for creation of plans with this option. For example, on
Sys-tem 1 creation of a plan for problem size 32768 × FFTW PATIENT option took around 30 hours. No significant advantage for large array sizes was obtained. On
System 2b creation of a plan forproblem size 32768 × FFTW PATIENT option tookmore than 24 hours. Again, no significant performance boost for large arraysizes was noticed. If one would like to try it, variable
FFTW PATIENT PLAN has to be defined in the file my fft test.c . Important note 2:
As it was mentioned above, 10 time measurementruns are performed and the best (smallest) time is used for evaluation ofperformance. The number of runs can be changed by redefining variable
RUNS NUMBER in the file my fft test.c .Program tries to load FFTW wisdom from the file fftw.wisdom locatedin the same directory. If the file is found, the program reads the wisdom and14fter creation of a plan for the new library writes the cumulative wisdom tothe same file. If program cannot find fftw.wisdom file, it creates plans with-out it and saves the accumulated wisdom to fftw.wisdom file if possible.Unfortunately, as it was already mentioned above, for FFTW v3 wisdominformation for linear and parallel planning subroutines are incompatible.Even more, if linear planning subroutine was used in the program, parallelwisdom cannot be used in the same program. If this situation will changein future releases, all necessary parts of the code can be initiated by defin-ing variables
READ THR WISDOM FOR FFTW and
SAVE THR WISDOM FOR FFTW inthe file my fft test.c . For parallel wisdom program will try to use file fftw.thr.wisdom located in the same directory.
Appendix A.4. Installation and testing methodology
When benchmark program is compiled and working properly, one couldtry to fine tune the only changeable parameter in the library, namely thevariable
BLOCK SIDE SIZE in the file my fft lib.h , which determines the sidelength of the square block used for transposition. By author’s experience, theoptimal values are: 64 for Intel ® CPUs and 16 for AMD ® CPUs (authorhad an access only to relatively old AMD ® CPUs). Author recommends touse powers of 2. Of course, after every change of this value, library and testprogram has to be recompiled.After that one can start scanning system for performance dependence onnumber of used threads with fixed array size. Initially, it is good to determinethe largest array which fits into system’s RAM memory. Due to limitationmentioned above author recommends to use multiples of 2 = 1024. Letus say that the maximum size of the array is (20 × and the systemhas 16 logical CPU cores (usully it means 8 real CPU cores and 8 moreHyperThreading cores). There is a simple shell-script thr test.sh providedwith the library. This is how it can be tuned for the situation describedabove: /my_fft_test 20480 20480 10 $i 20k.User_thr_test.datdone Here we use 5 DFTs for every time measurements for small number of parallelthreads and 10 for presumably faster configurations with larger number ofparallel threads. The output file format is described above in details.After completing the previous scan, one should find the number of threadswhich gives the best performance for FFTW. This number of threads will beused to determine the performance dependence on the size of the problemwhen number of parallel threads is fixed. For example, let us suppose thatthe best performance is achieved by FFTW using 15 threads. Then one canslightly modify simple script size test.sh provided with the library in thefollowing way:
Here we scan in integer (natural) multiples of 1024 as the side length of asquare array. Size is changing from 4 × × Appendix B. Tested systems
All systems were working under different flavors of GNU Linux distribu-tions ranging from Ubuntu and Debian to Gentoo. GNU Compiler Collection(gcc) was used as a C compiler in all the cases. Here are short hardware de-scriptions for every system. • System 1
CPU: 2 × Intel ® Xeon ® CPU E5-2680 v3 @ 2.50GHz (Haswell), 2x12CPU cores, 2x24 HT coresRAM: 128 GiB 16
System 2a
CPU: 2 × Intel ® Xeon ® Silver 4110 CPU @ 2.10GHz (Skylake), 2x8CPU cores, 2x16 HT coresRAM: 64 GiB • System 2b
CPU: 2 × Intel ® Xeon ® Gold 6126 CPU @ 2.60GHz (Skylake), 2x12CPU cores, 2x24 HT coresRAM: 64 GiB • System 3
CPU: 2 × Intel ® Xeon ® Gold 6242 CPU @ 2.80GHz (Cascade Lake),2x16 CPU cores, 2x32 HT coresRAM: 384 GiB • System 4
CPU: 2 × AMD ® Opteron ™ Processor 6276 CPU @ 2.3GHz (Bull-dozer), 2x8 CPU cores, 2x16 HT coresRAM: 64 GiB • System 5
CPU: 2 × Intel ® Xeon ® CPU E5-2670 0 @ 2.60GHz (Sandy BridgeEP), 2x8 CPU cores, 2x16 HT coresRAM: 128 GiBIt should be noted, that all systems, except
System 3 , were producing veryconsistent results in all the tests. The
System 3 behaved very inconsistently,with significantly different results between benchmarking runs (each run tookabout 24 hours, difference could be as high as 10%). This is the newest CPUin the tested systems and it was released already after recent cache memoryassociated vulnerabilities reported for Intel ® CPUs. Perhaps, temporaryand/or urgent patches to the cache memory system and Linux kernel werethe reason for such a behavior. Author has no other explanation.
Appendix C. Configuration of FFTW
For most of the systems (
Systems 1-3 ) participated in benchmarkingthe following line was used during configuration of FFTW library: $ ./configure --enable-threads --disable-fortran --with-gnu-ld--enable-avx2 --enable-fma --enable-sse2
17f course, if you need FORTRAN interface, please, remove --disable-fortran from the line. As
Systems 4-5 do not support AVX2 SIMD (single instruc-tion multiple data) instructions, slightly different configuration line was used: $ ./configure --enable-threads --disable-fortran --with-gnu-ld--enable-avx --enable-fma --enable-sse2 It should be noted, that although
Systems 2a,b-3 support AVX512 SIMDinstruction set, it is not recommended to use configuration option --enable-avx512 as it results in drastic drop in performance. Comparison of different con-figurations can be performed using bench program supplied with FFTWlibrary. The following command line was used for final testing: ./bench-onthreads=24 -opatient -s oc32768x32768 , initial testing was done onsmaller 4096 × not recommend to supply any additional CFLAGS during compilation. In his experience it resulted only in performancedrop. Default compilation flags are well tuned and (nearly?) optimal.
References [1] M. Frigo, S. G. Johnson, The design and implementation of FFTW 3,Proc. IEEE 93 (2) (2005) 216–231.URL http://fftw.org [2] J. W. Cooley, J. W. Tukey, An algorithm for the machine calcula-tion of complex Fourier series, Math. Comput. 19 (90) (1965) 297–301. doi:10.2307/2003354 .[3] A. I. Dyachenko, A. O. Korotkevich, V. E. Zakharov, Decay of themonochromatic capillary wave, JETP Lett. 77 (9) (2003) 477–481. arXiv:physics/0308100 .[4] A. I. Dyachenko, A. O. Korotkevich, V. E. Zakharov, Weak tur-bulence of gravity waves, JETP Lett. 77 (10) (2003) 546–550. arXiv:physics/0308101 .[5] A. I. Dyachenko, A. O. Korotkevich, V. E. Zakharov, Weak turbulentkolmogorov spectrum for surface gravity waves, Phys. Rev. Lett. 92 (13)(2004) 134501. arXiv:physics/0308099 .186] A. O. Korotkevich, Simultaneous numerical simulation of direct andinverse cascades in wave turbulence, Phys. Rev. Lett. 101 (7) (2008)074504. arXiv:0805.0445 .[7] A. O. Korotkevich, A. I. Dyachenko, V. E. Zakharov, Numerical sim-ulation of surface waves instability on a homogeneous grid, Physica D321-322 (2016) 51–66. arXiv:1212.2225 .[8] A. O. Korotkevich, P. M. Lushnikov, Proof-of-concept implementationof the massively parallel algorithm for simulation of dispersion-managedwdm optical fiber systems, Optics Lett. 36 (10) (2011) 1851–1853. arXiv:1101.0086 .[9] A. O. Korotkevich, P. M. Lushnikov, H. A. Rose, Beyond the randomphase approximation: Stimulated brillouin backscatter for finite lasercoherence times, Phys. of Plasmas 22 (2015) 012107. arXiv:1105.2094 .[10] S. A. Dyachenko, P. M. Lushnikov, A. O. Korotkevich, The complexsingularity of a stokes wave, JETP Letters 98 (11) (2013) 767–771. arXiv:1311.1882 , doi:10.7868/S0370274X13230070 .[11] S. A. Dyachenko, P. M. Lushnikov, A. O. Korotkevich, Branch cutsof Stokes wave on deep water. Part I: Numerical solution and Pad´eapproximation, Studies in Applied Mathematics 137 (4) (2016) 419–472. arXiv:1507.02784 , doi:10.1111/sapm.12128 .[12] G. Falkovich, N. Vladimirova, Cascades in nonlocal turbulence, Phys.Rev. E 91 (2015) 041201. doi:10.1103/PhysRevE.91.041201 .URL https://link.aps.org/doi/10.1103/PhysRevE.91.041201 [13] I. Kolokolov, V. Lebedev, P. M. Lushnikov,Statistical properties of a laser beam propagating in a turbulent medium,Phys. Rev. E 101 (2020) 042137. doi:10.1103/PhysRevE.101.042137 .URL https://link.aps.org/doi/10.1103/PhysRevE.101.042137 [14] A. O. Korotkevich, Set of SMP-parallel subroutines for 2D out of place complex to complex DFTs.(2019).URL https://github.com/w5kao/my_fft_lib [15] M. Frigo, C. E. Leiserson, H. Prokop, S. Ramachandran,Cache-oblivious algorithms, Proc. IEEE Symp. on Foundations of19omputer Science (FOCS) (1999) 285297.URL http://supertech.csail.mit.edu/papers/FrigoLePr99.pdf [16] A. O. Korotkevich, Library for initialization and synchronization of pool of threads(2009).URL https://github.com/w5kao/my_thr_lib [17] M. Frigo, S. G. Johnson, FFT benchmark methodology (2011).URL [18] M. Frigo, S. G. Johnson, Introduction to FFTW v3 (2011).URL