Comparing Python, Go, and C++ on the N-Queens Problem
CComparing Python, Go, and C++ on the N -Queens Problem Pascal Fua, Krzysztof LisComputer Vision Laboratory, EPFLJanuary 14, 2020
Abstract
Python currently is the dominant language in the field of Machine Learning but is often criticizedfor being slow to perform certain tasks. In this report, we use the well-known N -queens puzzle [1] as abenchmark to show that once compiled using the Numba compiler it becomes competitive with C++ andGo in terms of execution speed while still allowing for very fast prototyping. This is true of both sequentialand parallel programs. In most cases that arise in an academic environment, it therefore makes sense todevelop in ordinary Python, identify computational bottlenecks, and use Numba to remove them. Figure 1: Three solutions of the puzzle on an × board and one on a × board. Python currently is the dominant language in the field of Machine Learning and gives easy access to powerfulDeep Learning packages such as TensorFlow and PyTorch. However, it is known to be slow to perform someoperations such as loops, which are not always easy to vectorize away. In such situations, one might considerswitching to another language, such as C++ or the more recent Go language whose similarity to Python makesthem potentially attractive replacements. In this note, we will argue that this may be necessary only rarelybecause the Numba python compiler [8] delivers performance close to those of C++ while preserving thecompactness and ease of development that make Python such a powerful prototyping tool. Furthermore, itis easy to use. Once a set of Python functions has been identified as computationally intensive, one simplyadds Numba decorators before their definitions to instruct Python to compile them while leaving the rest ofthe code largely unchanged.To demonstrate this, we use the well known N -queens puzzle [1] as a benchmark. It involves placing N chess queens on an N × N chessboard so that no two queens threaten each other. Fig. 1 depicts three solutionson a standard × board and one on a larger × one. We will focus on finding the number of solutionsas a function of N . This is easy for small values of N but quickly becomes computationally intractable forlarger ones because the complexity of our algorithm is exponential with respect to N . There is no knownformula for the exact number of solutions and, to date, is the larger value of N for which the answer hasbeen computed [12].All our benchmarking code available is available online. We welcome comments and suggestions forpotential improvements. https://github.com/cvlab-epfl/n-queens-benchmark a r X i v : . [ c s . M S ] J a n Sequential Processing
We started from the recursive algorithm described in [15] to compute one single solution of the 8-queensproblem and translated it to Python. Our implementation relies on the fact that for two queens at boardlocations ( i , j ) and ( i , j ) not to be in conflict with each other, they must not be on the same row, on thesame same column, or on the same diagonal. The first two mean that i (cid:54) = i and j (cid:54) = j . The third holds if i + i (cid:54) = j + j and i − i (cid:54) = j − j . In other words, the two diagonals going through any ( i, j ) locationare completely characterized by i + j and i − j .To exploit this, the function allQueensRec of Tab. 1 allocates boolean arrays col , dg1 , and dg2 to keeptrack on which columns and diagonals are still available to place a new queen on an n × n board. It then invokesthe recursive function allQueensRecursive . At recursion level i , for each j , it adds a queen at location ( i, j ) if it is available, marks the column j and the diagonals i + j and i − j as unavailable for additional queens,and calls itself for row i + 1 . The recursion ends in one of two ways. Either i reaches n , meaning that allrows have been successfully filled, or no more queen can be added. If the first case, a counter is incremented.In the second case, nothing happens. In both cases, the program backtracks, undoes it earlier marking, andcontinues until all solutions have been found. This process could be sped up by exploiting the symmetries ofthe N -queens problems. However, this is not required for benchmarking purposes an we chose not to do it tokeep the code simple. We also chose to use the C++ and Go naming convention for functions and variables,that is, we use allQueensRec instead of the more typical all_queens_rec , so that we can use the same namesin all versions of the code we present.In Fig. 2(a), the red curves depicts the computation time on a 2.9 GHz Quad-Core Intel Core i7 Macrunning the Catalina operating system. In the top part of the figure, we plot the wall-clock time as as functionof the board size n using a standard scale. In the bottom part of the figure, we plot the same computation timesusing a log-scale instead, which results in an almost straight curve. This serves as a sanity check because thecomputational complexity grows exponentially with n . As allQueensRecursive performs loops, our vanillaPython implementation is inefficient. To remedy this, we used the Numba python compiler [8] as shown inTab. 2. The code is almost unchanged except for adding a couple of Numba decorators and Yet, as depicted bythe green curve in Fig. 2(a), these minor modifications deliver a 33-fold increase in average computing speedof allQueensNmb over allQueensRec .The Numba decorator njit() that appears in Tab. 2 is short for jit(nopython=True) . It ensures that ifthe code compile without errors it will not invoke python while running and will therefore be fast. Addition-ally, we could have used jit(nopython=True,nogil=True) to instruct Numba to release the Python GlobalInterpreter Lock [3] while executing the function, thus allowing several versions to run simultaneously onthreads of a single process, something that standard Python code cannot do because of the aforementionedlock. This does not have any significant impact on performance in a sequential execution scenario.To further assess how effective the Python/Numba combination is, we rewrote the code in Go and C++,as shown in Tabs. 3 and 4. The short variable declarations make the Go code very similar to the Python codewhile being statically typed. The C++ code is slightly more verbose and one must remember to deallocatewhat has been allocated because there is no garbage collector. As shown in Fig. 5, this can be remedied byusing more sophisticated containers such as the standard vectors of C++ that are automatically deallocatedat the end of the scope of their definition. Note that we used vector
Nearly every modern computer, including the one we used, has a multicore CPU and we can speed things upby running independent parts of the computation simultaneously on separate cores. In Go and C++, this canbe done using multiple threads. Standard Python cannot do this due to the
Global Interpreter Lock (GIL) [3]that we have already encountered in the previous section. Fortunately, there are several workarounds and weexplored two of them:1. Using Numba’s automatic parallelization [10]. Numba implements the ability to run loops in parallel asin Open Multi-Processing (OpenMP). The loop’s body is scheduled in separate threads and the systemautomatically takes care of data privatization and reduction.2. Using a pool of processes. The pool distributes the computation into separate processes and tasks aresent to the available processors using a FIFO scheduling. Each process has its own interpreter and GIL,so they do not interfere. The price to pay is that objects need to be serialized and sent to the processes.To test these two approaches, we parallelized the allQueensRec in a simple way. As shown in Tab. 7, wedefined a new function allQueensCol that puts a queen in column j of the first row and then invokes the func-tion allQueensNumba defined in Tab. 2 starting at the second row instead of the first, as in allQueensRec .Summing the results for all possible values of j yields the same results by performing n independent com-putations. In Tab. 8, we integrate this code into two functions that spread the tasks across separate cores: allQueensPara uses the first method described above while allQueensPool uses the second. We will referto them as para and pool respectively. Numba can take parallelization even further and produce functions thatexploit the GPU. However, we did not explore this aspect in this study because our problem is not conduciveto GPU processing.In Fig. 2(b), we compare runtimes of the sequential Numba-compiled code of the previous section withour two parallelized versions. As before, the sequential code is depicted by the green curve while the twoparallel versions are depicted by the red and blue curves, labeled para and pool respectively. para clearly4 @njit() def allQueensNmb(n,i=0,col=None,dg1=None,dg2=None): col = np.ones(n,dtype=np.bool_) dg1 = np.ones(2*n,dtype=np.bool_) dg2 = np.ones(2*n,dtype=np.bool_) return allQueensNumba(n,0,col,dg1,dg2) @njit() def allQueensNumba(n,i,col,dg1,dg2): if n == i : return 1 nsol=0 for j in range(n): if (col[j] and dg1[i+j] and dg2[i-j+n]): col[j] = False dg1[i+j] = False dg2[i-j+n] = False nsol+=allQueensNumba(n,i+1,col,dg1,dg2) col[j] = True dg1[i+j] = True dg2[i-j+n] = True return nsol Table 2: The python code of Tab. 1 slightly modified to force numba compilation. .delivers a significant improvement. However, for smaller values of n , we noted that para does not alwaysfully use the 8 cores of our machines, which impacts its performance. For values of n up to 13, the overheadinvolved in spawning new processes dominates the computational cost of pool and makes it uncompetitive.However, for n > , this overhead becomes negligible with respect to the rest of the computation and pool starts dominating, albeit only by a small margin for large values of n , as can be seen in Tab. 15.To again compare against Go and C++, we rewrote the code in these two languages using their built-inmulti-threading capabilities, as shown in Tabs 9 and 10. Note that we used the template function std::async function to make the C++ version compact. The corresponding performance measurements are depicted bythe cyan and purple curves of Fig. 2(b). As before for small values of n , para and pool are uncompetitivebecause the initial overhead is too large. However, for larger values of n they catch up and eventually dobetter than Go and almost as well as C++. In short, there are corner cases in which it might pay to switch fromPython to C++ or go but it is not clear how pervasive they are in our research practice. In the two previous sections, we have argued that Numba is a powerful tool to painlessly compile potentiallyslow Python code so that it runs almost as fast as Go and C++. However, it also has limitations: Only a subsetof Python [11] and NumPy [9] features are available inside compiled functions. Numba has a compilationmode that generates code able to handle all values as Python objects and uses the Python C API to perform alloperations on such objects. Unfortunately, relying on this mode results in almost no performance gain overnon-compiled code. This is why we used the njit() decorator in all our examples. It yields much faster codebut requires that the native types of all values in the function can be inferred, which is not necessarily true instandard Python. Otherwise, the compilation fails.In practice, this imposes an additional workload on the programmer who has to figure out what parts ofthe code are computationally expensive, encapsulate them in separate functions, and make sure that thesefunctions can be compiled using the no-python mode that njit() enforces. This is probably why there areongoing efforts to optimize whole Python programs such as PyPy [13]. Unfortunately, the results are not5 func allQueensRec(n int) int { // Allocate arrays col := make([]bool, n, n) dg1 := make([]bool, 2*n, 2*n) dg2 := make([]bool, 2*n, 2*n) // All columns and diagonals are initially available for i := 0; i < n; i++ { col[i] = true } for i := 0; i < 2*n; i++ { dg1[i] = true dg2[i] = true } // Perform the recursive computation and return the results return allQueensRecursive(n, 0, col, dg1, dg2) } func allQueensRecursive(n int, i int, col [32]bool, dg1 [64]bool, dg2 [64]bool) int { if n == i { return 1 } nsol := 0 for j := 0; j < n; j++ { if col[j] && dg1[i+j] && dg2[i-j+n] { col[j] = false dg1[i+j] = false dg2[i-j+n] = false nsol += allQueensRecursive(n, i+1, col, dg1, dg2) col[j] = true dg1[i+j] = true dg2[i-j+n] = true } } return nsol } Table 3: Go version of the python code of Tab. 1.always compatible with libraries utilizing the C API, such as those routinely used in the field of scientificcomputing. As discussed in appendix, Julia is a potential alternative to Python/Numba that supports bothhigh performance scientific computing and fast prototyping, is compiled, and could eventually address thisproblem.
As Computer Vision and Machine Learning researchers, we primarily need a language that allows us to testand refine ideas quickly while giving us access to as many mathematical, image processing, and machinelearning libraries as possible. The latter spares us the need to reinvent the wheel every time we want to trysomething new. Maintainability and ability to work in large teams are secondary considerations as our codeoften stops evolving once the PhD student or post-doctoral researcher who wrote it leaves our lab. Before thathappens, we typically make it publicly available to demonstrate that the ideas we published in conference andjournals truly work and, in the end, that is often its main function.Python fits that bill perfectly at the cost of being slow when performing operations such as loops. Fortu-nately, as we showed in this report, this shortcoming can be largely overcome by using the Numba compilerthat delivers performance comparable to that of C++, which itself tends to be faster than Go. This suggeststhat a perfectly valid workflow is to first write and debug a program in ordinary Python; identify the compu-tational bottlenecks; and use Numba to eliminate them. This will work most of the time. In the rare caseswhere it does not, we can rewrite the relevant section of the code in C++ and call it from Python, which canbe achieved using Cython [2] or pybind11 [4]. Interestingly, this approach harkens to the standard way oneused to work in the much older Common Lisp language, as discussed in the appendix.6 int allQueensRecursive(int n,int i,bool *col,bool *dg1,bool *dg2) { if (n == i) { return 1; } int nsol = 0; for (int j = 0; j < n; j++) { if (col[j] && dg1[i+j] && dg2[i-j+n]) { col[j] = false; dg1[i+j] = false; dg2[i-j+n] = false; nsol += allQueensRecursive(n, i+1, col, dg1, dg2); col[j] = true; dg1[i+j] = true; dg2[i-j+n] = true; } } return nsol; } int allQueensRec(int n) { // Allocate dynamic memory on the heap bool *col = new bool[n]; bool *dg1 = new bool[2*n]; bool *dg2 = new bool[2*n]; // All columns and diagonals are initially available memset((void *)col,1,n*sizeof(bool)); memset((void *)dg1,1,2*n*sizeof(bool)); memset((void *)dg2,1,2*n*sizeof(bool)); // Perform the recursive computation int nsol = allQueensRecursive(n,0, col, dg1, dg2); // No garbage collector, must deallocate to prevent memory leaks delete[] col; delete[] dg1; delete[] dg2; return nsol; } Table 4: C++ version of the python code of Tab. 1. To initialize the arrays, we could have used loops as in theGo code of Tab. 3. Instead we used the lower level instruction memset , which performs the same tasks withoutloops and can therefore be expected to be faster. typedef vector
In his book [15], N. Wirth proposed the N -queens algorithm implemented in Pascal and shown in Tab. 11.It is specific to the case n = 8 and is designed to stop when the first solution is found. It then returnsthe corresponding configuration. As shown in Tab. 12, this can be done even more concisely in Prolog,a logic programming language of the same era as Pascal. What makes Prolog particularly interesting isthat, of all the languages discussed in this report, it is the only one that forces a truly different approach toprogramming. It is well suited to performing the kind of systematic exploration and backtracking that solvingthe N -queens problem requires and is still used to solve specific tasks that involve rule-based logical queriessuch as searching databases.As not all implementations of the Pascal standard support dynamic arrays, extending the program ofTab. 11 that is specific to the n = 8 case to the general case would require manually allocating memoryusing pointers, much as in C. This would not be necessary in the even older Common Lisp language, asshown in Tab. 13. Although among the most ancient languages still in regular use, Lisp offers many of thesame amenities as Python, that is, dynamically allocated arrays, sophisticated loop structures, and garbagecollection among others. It can be used either in interpreted or compiled form, much like Python withoutand with Numba. When invoking the compiler, the optional type declarations help it generate faster code. Astandard approach to developing in Lisp is therefore to prototype quickly without the declarations and thenadd them as needed to speed up the code. Interestingly Python is now moving in that direction with its supportfor type hints but does not yet enforce that variable and argument values match their declared types at runtime.They are only intended to be used by third party tools such as type checkers, IDEs, and linters [14].As can be seen in Fig. 2(a), because it is compiled, Lisp does not perform so badly compared to themore recent languages we discussed in this report. The even newer Julia [6] language can be understood asbeing related to it in that it supports rapid prototyping by allowing interactive execution while being compiled.Type declarations are available but optional and the code is very Python-like, as can be seen in Tab. 14. LikeNumba, Julia uses LLVM [7] to perform low level optimizations and produce efficient native binary binarycode. Unlike in C++ or Go, there is no explicit compilation step and yet it delivers performance that are almoston par with those of the other compiled languages we discussed, as can be seen in Fig. 2(a) and Tab. 15.This make Julia a potentially attractive alternative to Python/Numba. Unfortunately, there are some sig-nificant obstacles to its adoption. First, it still is a new language. Some important features remain experi-mental and the number of third-party libraries is limited, whereas Python gives access to a wealth of powerfullibraries, such as the deep learning ones that have become absolutely central to our research activities. Fur-thermore, it features some design choices that differentiate it from currently popular languages [5], such as1-indexed arrays and multiple-dispatch instead of classes. Whether or not these choices are wise, they makeit harder to switch from an established language like Python to Julia. B Raw Data
The performance numbers we used to produce the plots of Fig. 2 are given in Tab. 15. For both the sequentialversions of the code and for each value of n , the time for the fastest implementation appears in red and theone for the second best in blue. These numbers were obtained on a 2.9 GHz Quad-Core Intel Core i7 Macrunning the Catalina operating system. For all versions of the code we ran 20 trials for ≤ n ≤ , 10for ≤ n ≤ , and 3 for n = 18 and computed the mean and variance in each case. We have rerunall these benchmarks on an Intel Xeon X5690 CPU running Ubuntu 18.04 and the overall ranking of theimplementations was unchanged. References [1] J. Bell and B. Stevens. A survey of known results and research areas for n-queens.
Discrete Mathematics ,309(1):1–31, 2009. 1[2] Cython: C-extensions for python. https://cython.org/ . 6113] Global interpreter lock. https://en.wikipedia.org/wiki/Global_interpreter_lock . 2, 4[4] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. pybind11 – seamless operability between c++11and python, 2017. https://github.com/pybind/pybind11 . 6[5] Julia: Noteworthy Differences from other Languages. https://docs.julialang.org/en/v1/manual/noteworthy-differences/ . 11[6] The Julia Programming Language. https://julialang.org/ . 11[7] The LLVM Compiler Infrastructure Project. https://llvm.org/ . 11[8] Numba: A High Performance Python Compiler. https://numba.pydata.org/ . 1, 2[9] Numba: Supported NumPy Features. https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html .5[10] Numba: Automatic parallelization with @jit. https://numba.pydata.org/numba-doc/dev/user/parallel.html . 4[11] Numba: Supported Python Features. https://numba.pydata.org/numba-doc/dev/reference/pysupported.html . 5[12] T. Preußer, B. Nägel, and R. Spallek. Putting queens in carry chains. 2009. 1[13] PyPy: a fast, compliant alternative implementation of the Python language. https://pypy.org/ . 5[14] Support for Type Hints in Python. https://docs.python.org/3/library/typing.html . 11[15] Niklaus Wirth et al.
Algorithms+ Data Structures . Prentice-Hall, 1976. 2, 1112 program eightqueen1(output); var i : integer; q : boolean; a : array[ 1 .. 8] of boolean; b : array[ 2 .. 16] of boolean; c : array[ -7 .. 7] of boolean; x : array[ 1 .. 8] of integer; procedure try( i : integer; var q : boolean); var j : integer; begin j := 0; repeat j := j + 1; q := false; if a[ j] and b[ i + j] and c[ i - j] then begin x[ i ] := j; a[ j ] := false; b[ i + j] := false; c[ i - j] := false; if i < 8 then begin try( i + 1, q); if not q then begin a[ j] := true; b[ i + j] := true; c[ i - j] := true; end end else q := true end until q or (j = 8); end; begin for i := 1 to 8 do a[ i] := true; for i := 2 to 16 do b[ i] := true; for i := -7 to 7 do c[ i] := true; try( 1, q); if q then for i := 1 to 8 do write( x[ i]:4); writeln end. Table 11: Pascal program by Niklaus Wirth in 1976. It finds one solution to the eight queens problem.13 /* Use clpfd package to loop through all configurations until a feasible one is found */ n_queens(N, Qs) :- length(Qs, N), Qs ins 1..N, safe_queens(Qs). /* Predicate is true if the configuration is feasible */ safe_queens([]). safe_queens([Q|Qs]) :- safe_queens(Qs, Q, 1), safe_queens(Qs). safe_queens([], _, _). safe_queens([Q|Qs], Q0, D0) :- Q0 abs(Q0 - Q) D1 safe_queens(Qs, Q0, D1). /* Example */ ?- n_queens(8, Qs), labeling([ff], Qs). Qs = [1, 5, 8, 6, 3, 7, 2, 4] ; Qs = [1, 6, 8, 3, 7, 4, 2, 5] . (defun allQueensRec(n) (declare (type fixnum n)) (let ((col (make-array n :initial-element t :element-type ’boolean)) (dg1 (make-array (* 2 n) :initial-element t :element-type ’boolean)) (dg2 (make-array (* 2 n) :initial-element t :element-type ’boolean))) (declare (type (array boolean 1) col dg1 dg2 )) (allQueensRecursive n 0 col dg1 dg2 0))) (defun allQueensRecursive(n i col dg1 dg2) ;; Optional declarations. Some compilers exploit them to speed up the code (declare (type (array boolean 1) col dg1 dg2 )) (declare (type fixnum n i)) (if (= i n) (let ((nsol 0)) (declare (type fixnum nsol)) (loop for j from 0 below n when (and (aref col j) (aref dg1 (+ i j)) (aref dg2 (- (+ i n) j))) do (setf (aref col j) nil (aref dg1 (+ i j)) nil (aref dg2 (- (+ i n) j)) nil) (incf nsol (allQueensRecursive n (+ i 1) col dg1 dg2)) (setf (aref col j) t (aref dg1 (+ i j)) t (aref dg2 (- (+ i n) j)) t)) nsol))) Table 13: Common Lisp version of the python code of Tab. 1.14 function allQueensRecursive(n, i, col, dg1, dg2) if n == i return 1 end nsol = 0 for j = 0:n-1 if (col[j+1] && dg1[i+j+1] && dg2[i-j+n+1]) col[j+1] = false dg1[i+j+1] = false dg2[i-j+n+1] = false nsol += allQueensRecursive(n,i+1,col,dg1,dg2) col[j+1] = true dg1[i+j+1] = true dg2[i-j+n+1] = true end end return nsol end function allQueensRec(n) col = ones(Bool, n) dg1 = ones(Bool, 2*n) dg2 = ones(Bool, 2*n) return allQueensRecursive(n,0,col,dg1,dg2) end Table 14: Julia version of the code of Tab. 1
Sequential
Parallel
Table 15: Benchmarking results in seconds per trial for nn