[PDF] Comparing Python, Go, and C++ on the N-Queens Problem

Abstract

Python currently is the dominant language in the field of Machine Learning but is often criticized for being slow to perform certain tasks. In this report, we use the well-known N -queens puzzle as a benchmark to show that once compiled using the Numba compiler it becomes competitive with C++ and Go in terms of execution speed while still allowing for very fast prototyping. This is true of both sequential and parallel programs. In most cases that arise in an academic environment, it therefore makes sense to develop in ordinary Python, identify computational bottlenecks, and use Numba to remove them.

Full PDF

CComparing Python, Go, and C++ on the N -Queens Problem Pascal Fua, Krzysztof LisComputer Vision Laboratory, EPFLJanuary 14, 2020

Abstract

Python currently is the dominant language in the ﬁeld of Machine Learning but is often criticizedfor being slow to perform certain tasks. In this report, we use the well-known N -queens puzzle [1] as abenchmark to show that once compiled using the Numba compiler it becomes competitive with C++ andGo in terms of execution speed while still allowing for very fast prototyping. This is true of both sequentialand parallel programs. In most cases that arise in an academic environment, it therefore makes sense todevelop in ordinary Python, identify computational bottlenecks, and use Numba to remove them. Figure 1: Three solutions of the puzzle on an × board and one on a × board. Python currently is the dominant language in the ﬁeld of Machine Learning and gives easy access to powerfulDeep Learning packages such as TensorFlow and PyTorch. However, it is known to be slow to perform someoperations such as loops, which are not always easy to vectorize away. In such situations, one might considerswitching to another language, such as C++ or the more recent Go language whose similarity to Python makesthem potentially attractive replacements. In this note, we will argue that this may be necessary only rarelybecause the Numba python compiler [8] delivers performance close to those of C++ while preserving thecompactness and ease of development that make Python such a powerful prototyping tool. Furthermore, itis easy to use. Once a set of Python functions has been identiﬁed as computationally intensive, one simplyadds Numba decorators before their deﬁnitions to instruct Python to compile them while leaving the rest ofthe code largely unchanged.To demonstrate this, we use the well known N -queens puzzle [1] as a benchmark. It involves placing N chess queens on an N × N chessboard so that no two queens threaten each other. Fig. 1 depicts three solutionson a standard × board and one on a larger × one. We will focus on ﬁnding the number of solutionsas a function of N . This is easy for small values of N but quickly becomes computationally intractable forlarger ones because the complexity of our algorithm is exponential with respect to N . There is no knownformula for the exact number of solutions and, to date, is the larger value of N for which the answer hasbeen computed [12].All our benchmarking code available is available online. We welcome comments and suggestions forpotential improvements. https://github.com/cvlab-epfl/n-queens-benchmark a r X i v : . [ c s . M S ] J a n Sequential Processing

We started from the recursive algorithm described in [15] to compute one single solution of the 8-queensproblem and translated it to Python. Our implementation relies on the fact that for two queens at boardlocations ( i , j ) and ( i , j ) not to be in conﬂict with each other, they must not be on the same row, on thesame same column, or on the same diagonal. The ﬁrst two mean that i (cid:54) = i and j (cid:54) = j . The third holds if i + i (cid:54) = j + j and i − i (cid:54) = j − j . In other words, the two diagonals going through any ( i, j ) locationare completely characterized by i + j and i − j .To exploit this, the function allQueensRec of Tab. 1 allocates boolean arrays col , dg1 , and dg2 to keeptrack on which columns and diagonals are still available to place a new queen on an n × n board. It then invokesthe recursive function allQueensRecursive . At recursion level i , for each j , it adds a queen at location ( i, j ) if it is available, marks the column j and the diagonals i + j and i − j as unavailable for additional queens,and calls itself for row i + 1 . The recursion ends in one of two ways. Either i reaches n , meaning that allrows have been successfully ﬁlled, or no more queen can be added. If the ﬁrst case, a counter is incremented.In the second case, nothing happens. In both cases, the program backtracks, undoes it earlier marking, andcontinues until all solutions have been found. This process could be sped up by exploiting the symmetries ofthe N -queens problems. However, this is not required for benchmarking purposes an we chose not to do it tokeep the code simple. We also chose to use the C++ and Go naming convention for functions and variables,that is, we use allQueensRec instead of the more typical all_queens_rec , so that we can use the same namesin all versions of the code we present.In Fig. 2(a), the red curves depicts the computation time on a 2.9 GHz Quad-Core Intel Core i7 Macrunning the Catalina operating system. In the top part of the ﬁgure, we plot the wall-clock time as as functionof the board size n using a standard scale. In the bottom part of the ﬁgure, we plot the same computation timesusing a log-scale instead, which results in an almost straight curve. This serves as a sanity check because thecomputational complexity grows exponentially with n . As allQueensRecursive performs loops, our vanillaPython implementation is inefﬁcient. To remedy this, we used the Numba python compiler [8] as shown inTab. 2. The code is almost unchanged except for adding a couple of Numba decorators and Yet, as depicted bythe green curve in Fig. 2(a), these minor modiﬁcations deliver a 33-fold increase in average computing speedof allQueensNmb over allQueensRec .The Numba decorator njit() that appears in Tab. 2 is short for jit(nopython=True) . It ensures that ifthe code compile without errors it will not invoke python while running and will therefore be fast. Addition-ally, we could have used jit(nopython=True,nogil=True) to instruct Numba to release the Python GlobalInterpreter Lock [3] while executing the function, thus allowing several versions to run simultaneously onthreads of a single process, something that standard Python code cannot do because of the aforementionedlock. This does not have any signiﬁcant impact on performance in a sequential execution scenario.To further assess how effective the Python/Numba combination is, we rewrote the code in Go and C++,as shown in Tabs. 3 and 4. The short variable declarations make the Go code very similar to the Python codewhile being statically typed. The C++ code is slightly more verbose and one must remember to deallocatewhat has been allocated because there is no garbage collector. As shown in Fig. 5, this can be remedied byusing more sophisticated containers such as the standard vectors of C++ that are automatically deallocatedat the end of the scope of their deﬁnition. Note that we used vector as the type for our booleanarrays instead of the apparently more natural vector . We did this because the latter packs the bitsdensely and has to perform binary arithmetic to extract the requested bit for each access because memory canonly be addressed down to whole bytes. In other words, it reduces memory usage at the expense of increasedcomputation. As we are interested in speed, it is therefore more effective to explicitly use bytes ( uint8_t ) forour purpose. Nevertheless, we have veriﬁed that even when using byte vectors, the implementation of Fig. 5incurs a small, but noticeable, performance decrease with respect to that of Tab. 4 and we therefore chose tostick with it, even though it is less elegant. In short, unlike Go, C++ gives the programmer great freedom tocarry out tasks in many different ways but it takes a lot expertise to exploit it effectively and to avoid the manylurking pitfalls.For example, unlike Python and Go, C++ does not automatically check that one does not write beyond thebounds of arrays. As a result, the buggy code of Tab. 6 runs but returns nonsensical values. We unintentionallymade this mistake while translating the code from Python and, even though this is a short program, it tookus a while to spot it. Of course, we could have used a tool such as valgrind, which would have detected theerror, but this is far less convenient than being given a runtime warning. By default Numba does not perform2 def allQueensRec(n): col = np.ones(n,dtype=bool) dg1 = np.ones(2*n,dtype=bool) dg2 = np.ones(2*n,dtype=bool) return allQueensRecursive(n,0,col,dg1,dg2) def allQueensRecursive(n,i,col,dg1,dg2): if n == i : return 1 nsol = 0 for j in range(n): if (col[j] and dg1[i+j] and dg2[i-j+n]): col[j] = False dg1[i+j] = False dg2[i-j+n] = False nsol+=allQueensRecursive(n,i+1,col,dg1,dg2) col[j] = True dg1[i+j] = True dg2[i-j+n] = True return nsol Table 1: Vanilla python code.bounds checks but they can be enabled using the decorator njit(boundscheck=True) , which can be usefulwhile debugging.The cyan and purple curves of Fig. 2(a) depict the corresponding runtimes. The Numba, Go, and C++curves are almost superposed. Closer examination of the raw numbers give in Tab. 15 in appendix show thatC++ wins. Go in slower by about and Numba by . Numba is slower mostly for low values of n , whichsuggests that the algorithm itself runs just as fast but that calling the Numba function from Python involves anoverhead. While the observed differences are statistically signiﬁcant based on the variances of the differentruns, in our daily research practice, they are rarely large enough to justify giving up the development speedthat Python provides and to contend with potential bugs such as the one discussed above.However, there are optimizations that require the low-level control that C++ or Go can provide. Forexample, in all versions of the code presented here, the memory for the col , dg1 , and dg2 arrays is allocateddynamically on the heap. The array sizes are decided at runtime and this code could in theory handle the N -queens problem for any value of n . However the computational cost is exponential and any value of n > is wildly impractical. If we accept to limit ourselves to n < = 32 , we can use ﬁxed-sized arrays allocated onthe stack by declaring them as var col[32]bool in Go or std::array in C++. Unlike in thecase discussed above, using bool instead uint8_t has no adverse effect. We have checked that the C++ codemodiﬁed in this manner delivers a gain over Numba, instead of the earlier . Potential explanationsare that putting the arrays on the stack works better for the CPU cache or that the optimizer has an easiertime reasoning about ﬁxed-size stack arrays. In any event, this shows that C++, and Go, being closer to thehardware may be useful to ﬁne-tune code under some circumstances.Go can therefore be considered a promising alternative to both Python and C++ because its run-timechecks make bugs such as the one of Fig. 6 easy to detect and correct. Furthermore, it is almost as conciseas a Python and a little faster than Numba. However, some of its design features make it unwieldy in theprototyping role. For example, insisting that all variables and packages declared in a ﬁle be used makes sensefor production code but is unhelpful when groping for a solution to a research problem: Commenting out aparticular line of code, can mean many modiﬁcations in the ﬁle, which are unnecessary until a ﬁnal solutionhas been found. Similarly not providing a full-ﬂedged class-system can be understood as a way to discouragethe writing of hard-to-maintain spaghetti code, which is commendable in production mode but unnecessarilyrigid in prototyping mode. 3 t i m e [ s ] PythonNumbaGoC++LispJulia 12 13 14 15 16 17 18Number of queens02004006008001000 t i m e [ s ] NumbaParaPoolGoC++8 9 10 11 12 13 14Number of queens10 t i m e [ s ] PythonNumbaGoC++LispJulia 12 13 14 15 16 17 18Number of queens10 t i m e [ s ] NumbaParaPoolGoC++ (a) (b)Figure 2: Run times as a function of the board size. Linear scale at the top and log scale at the bottom. (a)Sequential. (b) Parallel.

Nearly every modern computer, including the one we used, has a multicore CPU and we can speed things upby running independent parts of the computation simultaneously on separate cores. In Go and C++, this canbe done using multiple threads. Standard Python cannot do this due to the

Global Interpreter Lock (GIL) [3]that we have already encountered in the previous section. Fortunately, there are several workarounds and weexplored two of them:1. Using Numba’s automatic parallelization [10]. Numba implements the ability to run loops in parallel asin Open Multi-Processing (OpenMP). The loop’s body is scheduled in separate threads and the systemautomatically takes care of data privatization and reduction.2. Using a pool of processes. The pool distributes the computation into separate processes and tasks aresent to the available processors using a FIFO scheduling. Each process has its own interpreter and GIL,so they do not interfere. The price to pay is that objects need to be serialized and sent to the processes.To test these two approaches, we parallelized the allQueensRec in a simple way. As shown in Tab. 7, wedeﬁned a new function allQueensCol that puts a queen in column j of the ﬁrst row and then invokes the func-tion allQueensNumba deﬁned in Tab. 2 starting at the second row instead of the ﬁrst, as in allQueensRec .Summing the results for all possible values of j yields the same results by performing n independent com-putations. In Tab. 8, we integrate this code into two functions that spread the tasks across separate cores: allQueensPara uses the ﬁrst method described above while allQueensPool uses the second. We will referto them as para and pool respectively. Numba can take parallelization even further and produce functions thatexploit the GPU. However, we did not explore this aspect in this study because our problem is not conduciveto GPU processing.In Fig. 2(b), we compare runtimes of the sequential Numba-compiled code of the previous section withour two parallelized versions. As before, the sequential code is depicted by the green curve while the twoparallel versions are depicted by the red and blue curves, labeled para and pool respectively. para clearly4 @njit() def allQueensNmb(n,i=0,col=None,dg1=None,dg2=None): col = np.ones(n,dtype=np.bool_) dg1 = np.ones(2*n,dtype=np.bool_) dg2 = np.ones(2*n,dtype=np.bool_) return allQueensNumba(n,0,col,dg1,dg2) @njit() def allQueensNumba(n,i,col,dg1,dg2): if n == i : return 1 nsol=0 for j in range(n): if (col[j] and dg1[i+j] and dg2[i-j+n]): col[j] = False dg1[i+j] = False dg2[i-j+n] = False nsol+=allQueensNumba(n,i+1,col,dg1,dg2) col[j] = True dg1[i+j] = True dg2[i-j+n] = True return nsol Table 2: The python code of Tab. 1 slightly modiﬁed to force numba compilation. .delivers a signiﬁcant improvement. However, for smaller values of n , we noted that para does not alwaysfully use the 8 cores of our machines, which impacts its performance. For values of n up to 13, the overheadinvolved in spawning new processes dominates the computational cost of pool and makes it uncompetitive.However, for n > , this overhead becomes negligible with respect to the rest of the computation and pool starts dominating, albeit only by a small margin for large values of n , as can be seen in Tab. 15.To again compare against Go and C++, we rewrote the code in these two languages using their built-inmulti-threading capabilities, as shown in Tabs 9 and 10. Note that we used the template function std::async function to make the C++ version compact. The corresponding performance measurements are depicted bythe cyan and purple curves of Fig. 2(b). As before for small values of n , para and pool are uncompetitivebecause the initial overhead is too large. However, for larger values of n they catch up and eventually dobetter than Go and almost as well as C++. In short, there are corner cases in which it might pay to switch fromPython to C++ or go but it is not clear how pervasive they are in our research practice. In the two previous sections, we have argued that Numba is a powerful tool to painlessly compile potentiallyslow Python code so that it runs almost as fast as Go and C++. However, it also has limitations: Only a subsetof Python [11] and NumPy [9] features are available inside compiled functions. Numba has a compilationmode that generates code able to handle all values as Python objects and uses the Python C API to perform alloperations on such objects. Unfortunately, relying on this mode results in almost no performance gain overnon-compiled code. This is why we used the njit() decorator in all our examples. It yields much faster codebut requires that the native types of all values in the function can be inferred, which is not necessarily true instandard Python. Otherwise, the compilation fails.In practice, this imposes an additional workload on the programmer who has to ﬁgure out what parts ofthe code are computationally expensive, encapsulate them in separate functions, and make sure that thesefunctions can be compiled using the no-python mode that njit() enforces. This is probably why there areongoing efforts to optimize whole Python programs such as PyPy [13]. Unfortunately, the results are not5 func allQueensRec(n int) int { // Allocate arrays col := make([]bool, n, n) dg1 := make([]bool, 2*n, 2*n) dg2 := make([]bool, 2*n, 2*n) // All columns and diagonals are initially available for i := 0; i < n; i++ { col[i] = true } for i := 0; i < 2*n; i++ { dg1[i] = true dg2[i] = true } // Perform the recursive computation and return the results return allQueensRecursive(n, 0, col, dg1, dg2) } func allQueensRecursive(n int, i int, col [32]bool, dg1 [64]bool, dg2 [64]bool) int { if n == i { return 1 } nsol := 0 for j := 0; j < n; j++ { if col[j] && dg1[i+j] && dg2[i-j+n] { col[j] = false dg1[i+j] = false dg2[i-j+n] = false nsol += allQueensRecursive(n, i+1, col, dg1, dg2) col[j] = true dg1[i+j] = true dg2[i-j+n] = true } } return nsol } Table 3: Go version of the python code of Tab. 1.always compatible with libraries utilizing the C API, such as those routinely used in the ﬁeld of scientiﬁccomputing. As discussed in appendix, Julia is a potential alternative to Python/Numba that supports bothhigh performance scientiﬁc computing and fast prototyping, is compiled, and could eventually address thisproblem.

As Computer Vision and Machine Learning researchers, we primarily need a language that allows us to testand reﬁne ideas quickly while giving us access to as many mathematical, image processing, and machinelearning libraries as possible. The latter spares us the need to reinvent the wheel every time we want to trysomething new. Maintainability and ability to work in large teams are secondary considerations as our codeoften stops evolving once the PhD student or post-doctoral researcher who wrote it leaves our lab. Before thathappens, we typically make it publicly available to demonstrate that the ideas we published in conference andjournals truly work and, in the end, that is often its main function.Python ﬁts that bill perfectly at the cost of being slow when performing operations such as loops. Fortu-nately, as we showed in this report, this shortcoming can be largely overcome by using the Numba compilerthat delivers performance comparable to that of C++, which itself tends to be faster than Go. This suggeststhat a perfectly valid workﬂow is to ﬁrst write and debug a program in ordinary Python; identify the compu-tational bottlenecks; and use Numba to eliminate them. This will work most of the time. In the rare caseswhere it does not, we can rewrite the relevant section of the code in C++ and call it from Python, which canbe achieved using Cython [2] or pybind11 [4]. Interestingly, this approach harkens to the standard way oneused to work in the much older Common Lisp language, as discussed in the appendix.6 int allQueensRecursive(int n,int i,bool *col,bool *dg1,bool *dg2) { if (n == i) { return 1; } int nsol = 0; for (int j = 0; j < n; j++) { if (col[j] && dg1[i+j] && dg2[i-j+n]) { col[j] = false; dg1[i+j] = false; dg2[i-j+n] = false; nsol += allQueensRecursive(n, i+1, col, dg1, dg2); col[j] = true; dg1[i+j] = true; dg2[i-j+n] = true; } } return nsol; } int allQueensRec(int n) { // Allocate dynamic memory on the heap bool *col = new bool[n]; bool *dg1 = new bool[2*n]; bool *dg2 = new bool[2*n]; // All columns and diagonals are initially available memset((void *)col,1,n*sizeof(bool)); memset((void *)dg1,1,2*n*sizeof(bool)); memset((void *)dg2,1,2*n*sizeof(bool)); // Perform the recursive computation int nsol = allQueensRecursive(n,0, col, dg1, dg2); // No garbage collector, must deallocate to prevent memory leaks delete[] col; delete[] dg1; delete[] dg2; return nsol; } Table 4: C++ version of the python code of Tab. 1. To initialize the arrays, we could have used loops as in theGo code of Tab. 3. Instead we used the lower level instruction memset , which performs the same tasks withoutloops and can therefore be expected to be faster. typedef vector BoolArray; // Use uint8_t instead of bool to boost efficiency int allQueensRecursive(int n,int i,BoolArray& col,BoolArray& dg1,BoolArray& dg2){ ........ } int allQueensRec(int n) { BoolArray col(n, true); BoolArray dg1(2*n, true); BoolArray dg2(2*n, true); int nsol = allQueensRecursive(n,0,col,dg1,dg2); return nsol; } Table 5: Using C++ vectors makes it unnecessary to explicitly free them. The call to allQueensRecursivehas been slightly modiﬁed slightly so that they are passed by value instead of by reference and therefore notcopied. 7 int allQueensRec(int n) { // dg1 and dg2 are of size n instead of 2n bool *col = new bool[n]; bool *dg1 = new bool[n]; bool *dg2 = new bool[n]; memset((void *)col,1,n*sizeof(bool)); memset((void *)dg1,1,n*sizeof(bool)); memset((void *)dg2,1,n*sizeof(bool)); ........ } Table 6: Buggy version of the C++ code of Tab. 4. It runs but returns nonsensical results. @njit() def allQueensCol(n,j): col = np.ones(n,dtype=np.bool_) dg1 = np.ones(2*n,dtype=np.bool_) dg2 = np.ones(2*n,dtype=np.bool_) col[j] = False dg1[j] = False dg2[n-j] = False return allQueensNumba(n,1,col,dg1,dg2) if __name__ == "__main__": nsol = 0 for j in range(8): nsol += allQueensCol(8,j) Table 7: The python code of Tab. 2 rewritten to perform n independent computations. @njit(parallel=True) def allQueensPara(n): nsol = 0 for j in prange(n): nsol+=allQueensCol(n,j) return nsol def allQueensPool(n,np=None): with Pool_proc() as pool: nsols= pool.map(partial(poolWorker,n),range(n)) return (sum(nsols)) def poolWorker(n,j): return allQueensCol(n,j) Table 8: Two different ways to Invoke the function allQueensCol of Tab. 7 so that the computation is split into n tasks potentially running on different cores. Note how compact this code is.8 func allQueensPara(nd int) int { // Create the structure that will be used to synchronize var wg sync.WaitGroup wg.Add(nd) // Explicitly allow go to run on 8 cores runtime.GOMAXPROCS(8) sols := make([]int, nd) f := func(wg *sync.WaitGroup, n int, j int) { sols[j] = allQueensCol(n, j) // Result for a queen in cell k of first row wg.Done() // Flag the thread as complete } for j := 0; j < nd; j++ { go f(&wg, nd, j) // Launch a new thread for each computation } wg.Wait() // Wait for all threads to be completed nsol := sols[0] // Sum the individual results for j := 1; j < nd; j++ { nsol += sols[j] } return nsol } func allQueensCol(n int, j int) int { col := make([]bool, n, n) dg1 := make([]bool, 2*n, 2*n) dg2 := make([]bool, 2*n, 2*n) for i := 0; i < n; i++ { col[i] = true } for i := 0; i < 2*n; i++ { dg1[i] = true dg2[i] = true } col[j] = false dg1[j] = false dg2[n-j] = false return allQueensRecursive(n, 1, col, dg1, dg2, 0) } Table 9: Go version of the parallel python code of Tabs. 7 and 8.9 int allQueensCol(int n,int j) { bool *col = new bool[n]; bool *dg1 = new bool[2*n]; bool *dg2 = new bool[2*n]; memset((void *)col,1,n*sizeof(bool)); memset((void *)dg1,1,2*n*sizeof(bool)); memset((void *)dg2,1,2*n*sizeof(bool)); col[j] = false; dg1[j] = false; dg2[n-j] = false; int ncol = allQueensRecursive(n,1,col, dg1, dg2); free(col); free(dg1); free(dg2); return ncol; } int allQueensPara(int nd){ vector> running_tasks; // Start one process per column for(int col = 0; col < nd; col++){ running_tasks.push_back( async(std::launch::async, [=]() {return allQueensCol(nd,col);}) ); } // Wait for results int nsol_sum = 0; for(auto& f : running_tasks) { nsol_sum += f.get(); } return nsol_sum; } Table 10: C++ version of the parallel python code of Tabs. 7 and 8. The async template function makes thecode very compact. 10 ppendixA Other Languages

In his book [15], N. Wirth proposed the N -queens algorithm implemented in Pascal and shown in Tab. 11.It is speciﬁc to the case n = 8 and is designed to stop when the ﬁrst solution is found. It then returnsthe corresponding conﬁguration. As shown in Tab. 12, this can be done even more concisely in Prolog,a logic programming language of the same era as Pascal. What makes Prolog particularly interesting isthat, of all the languages discussed in this report, it is the only one that forces a truly different approach toprogramming. It is well suited to performing the kind of systematic exploration and backtracking that solvingthe N -queens problem requires and is still used to solve speciﬁc tasks that involve rule-based logical queriessuch as searching databases.As not all implementations of the Pascal standard support dynamic arrays, extending the program ofTab. 11 that is speciﬁc to the n = 8 case to the general case would require manually allocating memoryusing pointers, much as in C. This would not be necessary in the even older Common Lisp language, asshown in Tab. 13. Although among the most ancient languages still in regular use, Lisp offers many of thesame amenities as Python, that is, dynamically allocated arrays, sophisticated loop structures, and garbagecollection among others. It can be used either in interpreted or compiled form, much like Python withoutand with Numba. When invoking the compiler, the optional type declarations help it generate faster code. Astandard approach to developing in Lisp is therefore to prototype quickly without the declarations and thenadd them as needed to speed up the code. Interestingly Python is now moving in that direction with its supportfor type hints but does not yet enforce that variable and argument values match their declared types at runtime.They are only intended to be used by third party tools such as type checkers, IDEs, and linters [14].As can be seen in Fig. 2(a), because it is compiled, Lisp does not perform so badly compared to themore recent languages we discussed in this report. The even newer Julia [6] language can be understood asbeing related to it in that it supports rapid prototyping by allowing interactive execution while being compiled.Type declarations are available but optional and the code is very Python-like, as can be seen in Tab. 14. LikeNumba, Julia uses LLVM [7] to perform low level optimizations and produce efﬁcient native binary binarycode. Unlike in C++ or Go, there is no explicit compilation step and yet it delivers performance that are almoston par with those of the other compiled languages we discussed, as can be seen in Fig. 2(a) and Tab. 15.This make Julia a potentially attractive alternative to Python/Numba. Unfortunately, there are some sig-niﬁcant obstacles to its adoption. First, it still is a new language. Some important features remain experi-mental and the number of third-party libraries is limited, whereas Python gives access to a wealth of powerfullibraries, such as the deep learning ones that have become absolutely central to our research activities. Fur-thermore, it features some design choices that differentiate it from currently popular languages [5], such as1-indexed arrays and multiple-dispatch instead of classes. Whether or not these choices are wise, they makeit harder to switch from an established language like Python to Julia. B Raw Data

The performance numbers we used to produce the plots of Fig. 2 are given in Tab. 15. For both the sequentialversions of the code and for each value of n , the time for the fastest implementation appears in red and theone for the second best in blue. These numbers were obtained on a 2.9 GHz Quad-Core Intel Core i7 Macrunning the Catalina operating system. For all versions of the code we ran 20 trials for ≤ n ≤ , 10for ≤ n ≤ , and 3 for n = 18 and computed the mean and variance in each case. We have rerunall these benchmarks on an Intel Xeon X5690 CPU running Ubuntu 18.04 and the overall ranking of theimplementations was unchanged. References [1] J. Bell and B. Stevens. A survey of known results and research areas for n-queens.

Discrete Mathematics ,309(1):1–31, 2009. 1[2] Cython: C-extensions for python. https://cython.org/ . 6113] Global interpreter lock. https://en.wikipedia.org/wiki/Global_interpreter_lock . 2, 4[4] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. pybind11 – seamless operability between c++11and python, 2017. https://github.com/pybind/pybind11 . 6[5] Julia: Noteworthy Differences from other Languages. https://docs.julialang.org/en/v1/manual/noteworthy-differences/ . 11[6] The Julia Programming Language. https://julialang.org/ . 11[7] The LLVM Compiler Infrastructure Project. https://llvm.org/ . 11[8] Numba: A High Performance Python Compiler. https://numba.pydata.org/ . 1, 2[9] Numba: Supported NumPy Features. https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html .5[10] Numba: Automatic parallelization with @jit. https://numba.pydata.org/numba-doc/dev/user/parallel.html . 4[11] Numba: Supported Python Features. https://numba.pydata.org/numba-doc/dev/reference/pysupported.html . 5[12] T. Preußer, B. Nägel, and R. Spallek. Putting queens in carry chains. 2009. 1[13] PyPy: a fast, compliant alternative implementation of the Python language. https://pypy.org/ . 5[14] Support for Type Hints in Python. https://docs.python.org/3/library/typing.html . 11[15] Niklaus Wirth et al.

Algorithms+ Data Structures . Prentice-Hall, 1976. 2, 1112 program eightqueen1(output); var i : integer; q : boolean; a : array[ 1 .. 8] of boolean; b : array[ 2 .. 16] of boolean; c : array[ -7 .. 7] of boolean; x : array[ 1 .. 8] of integer; procedure try( i : integer; var q : boolean); var j : integer; begin j := 0; repeat j := j + 1; q := false; if a[ j] and b[ i + j] and c[ i - j] then begin x[ i ] := j; a[ j ] := false; b[ i + j] := false; c[ i - j] := false; if i < 8 then begin try( i + 1, q); if not q then begin a[ j] := true; b[ i + j] := true; c[ i - j] := true; end end else q := true end until q or (j = 8); end; begin for i := 1 to 8 do a[ i] := true; for i := 2 to 16 do b[ i] := true; for i := -7 to 7 do c[ i] := true; try( 1, q); if q then for i := 1 to 8 do write( x[ i]:4); writeln end. Table 11: Pascal program by Niklaus Wirth in 1976. It ﬁnds one solution to the eight queens problem.13 /* Use clpfd package to loop through all configurations until a feasible one is found */ n_queens(N, Qs) :- length(Qs, N), Qs ins 1..N, safe_queens(Qs). /* Predicate is true if the configuration is feasible */ safe_queens([]). safe_queens([Q|Qs]) :- safe_queens(Qs, Q, 1), safe_queens(Qs). safe_queens([], _, _). safe_queens([Q|Qs], Q0, D0) :- Q0 abs(Q0 - Q) D1 safe_queens(Qs, Q0, D1). /* Example */ ?- n_queens(8, Qs), labeling([ff], Qs). Qs = [1, 5, 8, 6, 3, 7, 2, 4] ; Qs = [1, 6, 8, 3, 7, 4, 2, 5] . (defun allQueensRec(n) (declare (type fixnum n)) (let ((col (make-array n :initial-element t :element-type ’boolean)) (dg1 (make-array (* 2 n) :initial-element t :element-type ’boolean)) (dg2 (make-array (* 2 n) :initial-element t :element-type ’boolean))) (declare (type (array boolean 1) col dg1 dg2 )) (allQueensRecursive n 0 col dg1 dg2 0))) (defun allQueensRecursive(n i col dg1 dg2) ;; Optional declarations. Some compilers exploit them to speed up the code (declare (type (array boolean 1) col dg1 dg2 )) (declare (type fixnum n i)) (if (= i n) (let ((nsol 0)) (declare (type fixnum nsol)) (loop for j from 0 below n when (and (aref col j) (aref dg1 (+ i j)) (aref dg2 (- (+ i n) j))) do (setf (aref col j) nil (aref dg1 (+ i j)) nil (aref dg2 (- (+ i n) j)) nil) (incf nsol (allQueensRecursive n (+ i 1) col dg1 dg2)) (setf (aref col j) t (aref dg1 (+ i j)) t (aref dg2 (- (+ i n) j)) t)) nsol))) Table 13: Common Lisp version of the python code of Tab. 1.14 function allQueensRecursive(n, i, col, dg1, dg2) if n == i return 1 end nsol = 0 for j = 0:n-1 if (col[j+1] && dg1[i+j+1] && dg2[i-j+n+1]) col[j+1] = false dg1[i+j+1] = false dg2[i-j+n+1] = false nsol += allQueensRecursive(n,i+1,col,dg1,dg2) col[j+1] = true dg1[i+j+1] = true dg2[i-j+n+1] = true end end return nsol end function allQueensRec(n) col = ones(Bool, n) dg1 = ones(Bool, 2*n) dg2 = ones(Bool, 2*n) return allQueensRecursive(n,0,col,dg1,dg2) end Table 14: Julia version of the code of Tab. 1

Sequential

Parallel

Table 15: Benchmarking results in seconds per trial for nn