Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers
aa r X i v : . [ qu a n t - ph ] F e b Cache Blocking Technique to Large Scale QuantumComputing Simulation on Supercomputers st Jun Doi
IBM Quantum, IBM Research - Tokyo
Tokyo, [email protected] nd Hiroshi Horii
IBM Quantum, IBM Research Tokyo
Tokyo [email protected]
Abstract —Classical computers require large memory resourcesand computational power to simulate quantum circuits with alarge number of qubits. Even supercomputers that can storehuge amounts of data face a scalability issue in regard toparallel quantum computing simulations because of the latencyof data movements between distributed memory spaces. Here,we apply a cache blocking technique by inserting swap gates inquantum circuits to decrease data movements. We implementedthis technique in the open source simulation framework QiskitAer. We evaluated our simulator on GPU clusters and observedgood scalability.
Index Terms —Quantum Computing Simulation, Cache block-ing, parallel computing, GPGPU, GPU, Thrust, MPI
I. I
NTRODUCTION
Hundreds of qubits of quantum computers are becomingrealistic in the near future and existing quantum computersare available for development and evaluation of quantumapplications [1]–[3]. Some of these quantum computers areavailable as cloud services, but their computer resources arestill not enough or their accesses are limited. As well, numbersof qubits will still be limited and noise of quantum gateoperations will exist in this few decades. Therefore, quan-tum computing simulations running on classical computersare necessary for developing and debugging new quantumapplications as well as actual quantum hardware is. However,quantum computing simulators on classical computers requirehuge amounts of memory and computational resources. An n -qubit state vector requires an array of length n to be stored ona classical computer; for example, a -qubit simulation needs16 petabytes worth of storage for double precision floatingpoint numbers. This means that simulating large numbers ofqubits on a classical computer is a form of supercomputing.In this study, we focus on accelerating quantum computingsimulations by using GPUs on hybrid distributed parallelcomputers. Universal quantum computing simulations [4]–[6]are now available for developing quantum applications withsmaller numbers of qubits (around qubits) on classical com-puters, even on desktop or laptop personal computers. To sim-ulate rather more qubits (around -qubits), parallel simulators[7]–[11] must store the quantum state in the huge distributedmemory in parallel-processing computers. Parallel computinghas the advantage of accelerating simulations that require ahuge amount of computational power. Graphic processor units (GPUs) are particularly useful in this aspect: their thousands ofhardware threads update probability amplitudes in parallel andtheir high bandwidth memory (HBM) reduces bottlenecks inloading and storing the amplitudes. Some quantum computingsimulators [11]–[14] support the use of GPUs to acceleratesimulations.We focus on optimizing a state vector simulator, which issimple and stable to simulate noisy and noiseless quantumcomputing simulations on classical parallel computers. Thereare some approaches to decrease the memory usage of aquantum computing simulation on classical computers. Forexample, data compression specialized for quantum com-puting simulation were introduced in [15], [16] to simu-late noisy quantum computers with more than qubits onsupercomputers. However, their lossy data compression andlacks of fidelity limit quantum algorithms to be simulated.Use of tensor network [17] is another approach to simulatenoiseless quantum computers while storing quantum state astensor matrices. Tensor-network-based simulators can reducememory for specific applications where qubits are weeklyinteracted but not for general applications where qubits arestrongly interacted. We believe that a state vector simulatorwill be the most popular in developing and debugging quantumalgorithms even though it requires huge memory to directlystore n probability amplitudes. Users can simply increasemore available qubits with additional computing nodes anddata storage.Though system can provide enough memory, inefficientmemory usage in state vector simulators is a critical problemto simulate many qubits; the communication overheads overhybrid parallel computers become inhibitors to scale simula-tion performance. To parallelize a state vector simulator on adistributed memory space, probability amplitudes of quantumstate must be exchanged across different memory spaces. Ingeneral, network bandwidth between different memory spacesis much lower than memory bandwidth of CPU and GPU.Minimizing data exchanges across memory spaces is a key toscale simulation performance.Here, we propose a new technique to decrease the numberof data exchanges across distributed memory spaces by com-bining quantum circuit optimization and parallel optimization. Qubit reordering or remapping are circuit optimizations usedo map circuits onto the topology of quantum devices.
Qubitreordering can be used to decrease data exchanges betweenlarge numbers of qubit gates. The technique we developeddecreases data exchanges by moving all the gates associatedwith a smaller number of qubits by inserting noiseless swapgates. The operations of the optimized circuits resembles cacheblocking on a classical computer, and the concept of theoptimization is similar to one on a classical computer becausewe block the gates on the qubits that can be accessed faster.Section II overviews quantum computing and state vectorsimulators. In Section II, we describe how to parallelize statevector simulators on distributed computers, and in SectionIV, we describe the circuit optimization and cache blockingimplementation of a parallel state vector simulator. In SectionV, we discuss performance evaluations on a hybrid parallelcomputer accelerated by GPUs. We conclude the paper inSection VI with mention of future work.II. P
ARALLEL S TATE V ECTOR S IMULATION
A. State Vector Simulation Overview
We focus on simulating universal quantum computers basedon quantum circuits that consist of one-qubit rotation gates andtwo-qubit CNOT gates. These gates are known to be universal;i.e., any quantum circuit (for realizing some quantum algo-rithm) can be constructed from CNOT and one-qubit rotationgates.Quantum circuits run quantum computing programs byapplying quantum gates to qubits. A qubit has two basis states, | i = (cid:18) (cid:19) and | i = (cid:18) (cid:19) . While a classical bit stores or exclusively, a qubit allows superposition of the two basis statesas | ψ i = a | i + a | i , where the probability amplitudes a and a are complex numbers satisfying | a | + | a | = 1 .When a qubit is measured, an outcome of or is obtainedwith probability | a | or | a | , respectively. This means twocomplex numbers for tracking the quantum state of a one-qubitregister must be stored in the simulation.The quantum state of an n -qubit register is a linear super-position of n basis states, each of the form of the tensorproduct of n qubits. For example, the basis state of digit ‘2’(or in binary) can be represented as | i = | i ⊗ | i = | i = (0 , , , T , where T denotes the transpose of avector. Thus, the quantum state of the n -qubit register can bewritten as | ψ i = P n − i =0 a i | i i . Note that each state | i i has itsown probability amplitude a i in a complex number. Overall,simulating quantum circuits requires that these n complexnumbers be stored to enable tracking of the evolution of thequantum state of an n -qubit register. We denote these complexnumbers as ‘ qreg .’Quantum gates transform the quantum state of the n -qubitregister by rotating its complex vector. We focus on the gate setdefined by the OpenQASM specification [18]. The set consistsof an arbitrary single-qubit (rotation) gate (a u3 gate) anda two-qubit gate (controlled-not or CNOT or CX). The u3gates are rotations of the 1-qubit state and are mathematicallydefined as u θ, ψ, λ ) = (cid:18) cos( θ/ − e iλ sin( θ/ e iψ sin( θ/ e i ( ψ + λ ) cos( θ/ (cid:19) . (1)When it is applied to the k -th qubit of { q n · · · q } , a u3 gatetransforms the probability amplitudes of the basis states | x n · · · x k +1 x k − · · · x i and | x n · · · x k +1 x k − · · · x i for each x i in { , } by linear transformation of their previousprobability amplitudes. This implies that applying a single-qubit u3 gate can change all n complex numbers stored fortracking the evolution of the quantum state. However, thechanges can be computed in parallel, as each of the newprobability amplitudes depends on its own value and on thevalue of a probability amplitude whose basis state is differentat the k -th location.The CNOT gates are applied to two qubits: a control anda target qubit. If the control qubit is in state | i , the CNOTgate flips the target qubit. If the control qubit is in state | i ,the CNOT gate does not change the target qubit. If we havetwo qubits and take the higher bit as the control qubit andthe other as the target qubit, the CNOT gate is mathematicallydefined as CNOT = . (2)When applied to the c -th and t -th qubits (control and target,respectively), the CNOT gate swaps the probability ampli-tude of | x n · · · ( x c = 1) · · · ( x t = 0) · · · x i with that of the | x n · · · ( x c = 1) · · · ( x t = 1) · · · x i . The swaps, which affecthalf of the probability amplitudes of the quantum state, canalso be performed in parallel.On a classical computer, the state vector is stored as anarray of probability amplitudes which are expressed as floatingpoint complex numbers. Fig. 1 shows an example of a statevector array of a 4-qubit quantum circuit stored in the memoryof a classical computer. If we store a complex number as adouble precision floating point number, an n -qubit state vectorrequires × n bytes of memory on a classical computer.Quantum gate operations can be simulated on a classicalcomputer by multiplying a matrix with pairs of probabilityamplitudes stored in the state vector array. Fig. 2 shows anexample of a u gate, which rotates the -nd qubit (zero-based index) on a 4-qubit state vector stored in the memoryof a classical computer. A × matrix is multiplied by pairsof probability amplitudes, which are selected by applying the XOR operation to their addresses; e.g. binary indices and become a pair through ⊕ k =2 = 0111 . The u gate constructs pairs from all of the probability amplitudesin the state vector, meaning that all the elements in thestate vector are accessed and updated to simulate a gateoperation. This requires byte/f lop ; . in double precision,so simulating a gate operation on a classical computer is amemory-bandwidth bound problem. (cid:1)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:0)(cid:12)(cid:2)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30) (cid:31) !" r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i Fig. 1. Example of 4-qubits state vector array on a classical computer. Thearray has = 16 complex numbers to store probability amplitudes.
456 7 89 :;<=>?@ A BCDEFGHI J KL MN OPQ RST U VW X YZ[ r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i r , i Fig. 2. u gate applied to -nd qubit (zero-based index). Pairs of probabilityamplitudes whose distance is are selected to be rotated by multiplying itwith a × matrix. B. Chunk-based Parallelization
In parallel computing environment, distributed processesin nodes and GPUs have own memory spaces. Only whendata is allocated to a memory space, its process can performcalculation on it. To maximize parallelization, data should notbe frequently exchanged across spaces in general. On the otherhand, saving memory is very important to store n probabilityamplitudes. Once their size exceeds capacity of a memoryspace, some amplitudes are spilled out to slow disk storage.We divide a state vector array into sub-arrays chunk s andallocate them to distributed memory space without duplication.Fig. 3 shows an example where an n -qubit state vector isdivided to four distributed memory spaces with four chunks of n − probability amplitudes. Gate operations on qubits from to n − are performed within a process and data exchangeis necessary in operations for n − and n − qubits.Fig. 4 is a naive way to do this; it has a buffer to receive acopy of the probability amplitudes from remote memory. Thisimplementation is very simple; however, it requires double thememory space to have a copy of the remote memory and henceis not memory efficient. This hinders us from increasing the \ ]^_‘ab cdefghij klmnopqr Fig. 3. Distributing an n -qubit state vector to 4 distributed memory spaces.The number in each box shows the beginning index of the probabilityamplitude in each sub-array. s tuvwxy z{|}~(cid:127)(cid:128)(cid:129) (cid:130)(cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137)(cid:138) (cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148)(cid:149)(cid:150)(cid:151)(cid:152) (cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160) Fig. 4. Naive implementation of probability amplitude exchange betweendistributed memory spaces. In this example, we apply a gate to qubit k = n − . This implementation requires twice the memory space number of qubits to be simulated.We use a chunk as a unit of data exchange. Double memoryspaces are necessary for buffering if we divide a state vectorwith large chunks as shown in Fig 4. Therefore we dividethe state vector into nc small chunks and exchange them in ¡¢£⁄¥ƒ§¤ ' “«‹›fifl(cid:176) –†‡ ·(cid:181)¶ •‚„ ”»…‰(cid:190)¿(cid:192)`´ ˆ˜¯˘˙¨(cid:201)˚ ¸(cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212)(cid:213)(cid:214)(cid:215)(cid:216)(cid:217)(cid:218) Fig. 5. Dividing a state vector into small chunks and performing probabilityamplitude exchange by chunks. So we only need additional one chunk permemory space to perform probability amplitude exchange pipeline to minimize space for buffering. Fig. 5 shows anexample to use only one buffer to exchange chunks.In summary, a gate operation of a chunk on m memoryspaces has three types based on a target qubit k : if k is lowerthan nc , it can be independently processed, and if k is lowerthan log n m , referring other is necessary, otherwise, the chunkis exchanged. By minimizing gate operations on qubits of morethan equal to log n m , data exchange across memory spaces arereduced.III. I MPLEMENTATION OF S TATE V ECTOR S IMULATOR
A. Qiskit Aer Overview
Qiskit Aer [14] is an open source framework for quantumcircuit simulations and we implemented a parallel quantumcomputing simulator for high-performance computers as abackend of Qiskit Aer. Qiskit Aer is one of the components ofQiskit [4] that provides seamless access to quantum hardwareand simulators with libraries to help development of quantumsoftware. In addition to ideal (noiseless) simulation, QiskitAer supports a noise model for noisy simulation, which allowsusers to develop quantum algorithms for realistic noisy envi-ronments. The frontend of Qiskit-Aer translates such Qiskit-unique APIs to basic matrix operation as written in SectionII for backends. Depending on characteristics and usages ofquantum circuits, users select one of backends from statevector, unitary matrix, density matrix, stabilizer state, superoperator, and matrix product state (MPS) simulators. Outbackend extends the state vector simulator with accelerationof GPUs and distributed environment.
B. Accelerating Simulation by GPUs
Qiskit Aer provides a state vector simulator accelerated byusing GPUs, which is the basis of our backend for GPUs andMPI. Since Qiskit Aer is written in C++14 and the standardtemplate library (STL), its GPU codes use the Thrust library[19], which is a template class library optimized for GPUacceleration. Thrust provides readability and productivity inaddition to performance; its templates are more readable thanCUDA [20] and compilable to GPU, OpenMP, and Intel’sThreading Building Blocks (TBB) [21] binaries for varioushost computers.The Thrust library is based on the vector class it resembles std::vector class of STL. The library includes GPU-ready optimized operations on vectors such as copying, trans-forming, sorting, and reduction. The API supports simplelamda equations that work in CPU and GPU environment.The List 1 is an example of a customized kernel in Thrustthat calculates y = a × x + y in double precision. Insidethe thrust::transform , runtime of Thrust automaticallyload and store elements of two vectors x_dev and y_dev while evaluating a defined binary operator daxpy(a) toupdate corresponding elements of y_dev .Implementing quantum gate operations become more com-plex than List 1. Unlike thrust::transform with abinary operator daxpy(a) , two elements of a vector are ac-cessed and updated non-sequentially. Thus, to load and store a struct daxpy {double a;daxpy(double a_in) {a = a_in;}//user defined binary operation__host__ __device__double operator()(double& x,double& y){return a*x+y;}};thrust::host_vector
A. Transpiling Circuit for Cache Blocking
Since performance of the state vector simulator is boundedby memory-bandwidth, reduction of data movements is essen-tial to shorten simulation time. In hybrid parallel computing,two types of memory spaces exist; CPU and GPU. Therefore,we list five types of data movements to be minimized forsimulator performance. • Loading data from the memory of a CPU • Copying data from the memory of a CPU to the memoryof a GPU • Loading data from the memory of a GPU • Copying data from one GPU to another GPU • Copying data to other nodesCache blocking is a well-known technique to avoid repeat-ing fetch data from main memory to CPU caches in high-performance computing research. We extends this techniqueto avoid frequent data movements of the above by reusingdata stored in local fast memory as long as possible. Ifconsecutive gates perform qubits that are smaller than thenumber of chunk bits nc , any data movements do not occurwhile simulating them. We transform (transpile) a circuit tohave many of long sequences of such consecutive gates forsimulator performance.If some qubits are not updated in any gates, we can usebit reordering, which is a transpilation technique to map gates truct u3_gate {thrust::complex
Fig. 7. Memory hierarchy on a classical computer. Cache memory is placednear the CPU core; it is very fast, but its size is limited. In this example, onlyone chunk can be stored in cache. So if we fetch a chunk from main memory,the older chunk has to be swapped out. fetchswap out
Fig. 8. Cache memory of quantum circuit on the state vector simulator. Gateson the smaller qubits (smaller than nc ) can be calculated without moving data,so the memory is very fast, like a cache memory on a classical computer. Aswap gate can be used to fetch and swap out a pair of qubits. cached as shown in Fig. 7. The qubits in excess of nc can beassigned to slower memory, and by applying a swap gate to apair of qubits, one qubit is fetched to the cache and the otherqubit is swapped out from cache at the same time (Fig. 8).Then, gate operations on the fetched qubit can be performedin the cache.Before simulating a given quantum circuit, we apply tran-spiler to move all the gates inside a chunk by inserting swapgates. Fig. 9 shows the circuit transpiled from the input circuitshown in Fig. 6.The circuit consists of a sequence of gates. First, we collectthe gates that can be executed in a chunk by reordering them.The execution order of two gates can be swapped if each gateis applied to different qubits because these two gates can be Œº (cid:236)(cid:237)(cid:238)(cid:239)(cid:240)æ(cid:242)(cid:243) (cid:244)ı (cid:246)(cid:247)łøœß(cid:252)(cid:253)(cid:254)(cid:255)(cid:4)(cid:14)(cid:13)(cid:15)(cid:0)(cid:1)(cid:2)(cid:3)(cid:5) u3u3 u1 u3u3u3 u1 u1 u3u3 u3 (cid:16) u3 u1 u3
Fig. 9. Example of a cache blocked circuit. Four swap gates are added tomove all the gates to fewer qubits than nc , now all the gate can be performedwithout referring probability amplitudes over chunks. ut sequence of gates in queue REMAININGwhile REMAINING is not emptychoose QB_CHUNK[nc-1] qubits to store in chunkput noiseless swap_chunk gates in queue OUTPUTclear QB_BLOCKED[nc-1]for gates in REMAININGif gate’s qubits are in QB_CHUNKand not QB_BLOCKED is setput gate in queue OUTPUTelseif multi-qubit gateand some qubits are in QB_CHUNKfor gate qubitsset QB_CHUNKput gate in queue REMAINING_NEXTcopy REMAINING_NEXT to REMAINING Listing 3. traspilation for cache blocking executed independently on a real quantum computer. However,a pair of gates in a multi-qubit gate such as CNOT can notbe reordered if the pair of gates are laid on the same qubits.As well, all the qubits used in a multi-qubit gate should beplaced in a chunk at the same time. Thus, multi-qubit gatesare used to choose which qubits are to be stored in a chunk.List 3 describes the algorithm of the cache blocking tran-spiler. First, it searched for the multi-qubit gates from thesequence of gates and chooses nc − qubits that are used inthese multi-qubit gates. Then, noiseless swap gates are insertedto put the chosen qubits in a chunk. These are special swapgates, named chunk_swap , that are different from the usualswap gate. Then, the gates are reordered by putting thosewhich can be executed in a chunk. If there is a multi-qubit gatewhich is laid across a chunk, this gate prevents the remaininggates on the same qubits from being executed. So we set thesequbits as blocked qubits, and gates with blocked qubits arestored in queue. We repeat this procedure until the remaininggates are not empty.We also insert special gates to specify a section of gatesto be blocked in a chunk. These begin_blocking and end_blocking gates are inserted to specify a section, andthe gates in the section can be simulated independently in eachchunk.The algorithm we are proposing in this section is kind ofan heuristic algorithm and it is not the best optimized solutionbecause we are applying the cache blocking transpilation onthe fly before the actual simulation so it is not good to takelong time to get the best optimized transpilation. If we cantranspile the circuit outside of the simulation, there can bemore optimal algorithms to decrease chunk swaps. B. Data Structure of Parallel Qiskit Aer
As described in section III, Qiskit Aer has simulationmethods, and these methods are defined in the classes witha quantum register which stores the quantum state. Oneexecution of a quantum circuit, called a "shot" , is controlledin the
State class that has one qubit register class for thecorresponding simulation method, as shown in Fig. 10.
State classstate_t qreg_;
StatevectorUnitaryDensity_matrixStabilizerSuperoperatorMatrix_product_state qubit register classes
Fig. 10. Data structure of state class in serial Qiskit Aer. A state has onequbit register class that has a quantum state with the corresponding simulationmethod.
State chunk classstd::vector
StatevectorUnitaryDensity_matrixStabilizerSuperoperatorMatrix_product_state qubit register classes
StatevectorStatevectorUnitaryUnitaryDensity_matrixDensity_matrixStabilizerStabilizerSuperoperatorSuperoperatorMatrix_product_stateMatrix_product_state
Fig. 11. Extension of state class to multiple chunks for parallel Qiskit Aer.A chunk is defined on the basis of the usual qubit register classes.
To parallelize a simulation, we need to have multiple chunksin a state class; hence, we extend the
State class to the
State_chunk class that has multiple chunks in an array,as shown in Fig. 11. Each chunk can be implemented on thebasis of existing qubit register classes, with small changes,because we do not have to communicate between chunks whensimulating gates other than the swap gates inserted by thecache blocking transplier. Each qubit vector class does nothave to know where they are in the whole n -qubit state; theyonly have to be simulated as if the state has nc qubits. C. Parallel Implementation
The
State_chunk class manages multiple chunks andcontrols the simulation by handling a sequence of gates. Thereare n − nc chunks in total to simulate an n -qubit circuit, and ifwe have p processes, we distribute the chunks to all of theseprocesses. We add continuous chunk indices to each chunk andcan get the beginning address of the probability amplitude foreach chunk from chunkindex × nc . To initialize a state to | i , we only initialize the chunk with index to | i and setall other chunks to zero.List 4 describes how we simulate with multiplechunks in the State_chunk class. We scan threespecial gates, begin_blocking , end_blocking and (cid:7)(cid:8)(cid:9)(cid:10)(cid:11) chunk 0chunk 1 chunk 0chunk 1chunk 2chunk 3(a) sq < nc (b) sq (cid:12) nc (where sq < sq ) Fig. 12. Two cases of chunk swap applications: (a) half of the amplitudesin the chunks are swapped; (b) two chunks of four combined chunks areswapped. The latter case only occurs when we reorder qubits for the output. for gate in sequence of gatesif gate is begin_blockingfor all chunksfetch chunk on GPU if stored on CPUfor gate in blocking sectionapply gateswap out chunk to CPUelse if gate is chunk_swapfor all pairs of chunksapply swapelsefor all chunksapply gate
Listing 4. update multiple chunks chunk_swap inserted by the cache blocking transplier.The sequence of gates placed between begin_blocking and end_blocking can be independently simulated ineach chunk, so we can parallelize the loop of the chunks bymapping to multiple GPUs or multiple threads on a CPU.We use the memory of the GPUs as a cache to simulatea sequence of gates in the blocking section when some ofthe chunks are stored on the memory of a CPU. Since theperformance of gate simulation is bounded by the memorybandwidth, the high memory bandwidth of GPUs is an ad-vantage. However, the bandwidth between the CPU and theGPUs is much narrower than the memory bandwidth of theCPU memory, so the performance is much degraded from thatof a CPU if a chunk is transferred to simulate only one gateon a GPU. By using this cache blocking technique, we cantake the advantage of the GPU’s high memory bandwidth ifwe have enough gates in the blocking section.A chunk_swap gate is applied to a pair of chunks. As inthe case of the usual swap gate, the chunk_swap gate takestwo qubits sq and sq (where sq < sq ) to be swapped, andone or two of the target qubits are larger or equal to chunkqubit nc . Fig. 12 shows two cases of chunk pairing: (a) is when sq < nc ; here, a pair of chunks is selected and half of theamplitudes are swapped between chunks. (b) is when sq ≥ nc ; here, four chunks are selected and all the amplitudes in twochunks chunk 1 and chunk 2 are swapped. In both cases,we swap amplitudes between two chunks, so we select twochunks by referring to sq and sq and apply chunk_swap .Also, if the selected chunks are on different processes, wehave to exchange the chunks by using MPI communication. (cid:17)(cid:18)(cid:19) (cid:20)(cid:21)(cid:22) (cid:23)(cid:24)(cid:25) (cid:26)(cid:27)(cid:28) GPU 0 GPU 1 GPU 2 GPU 3 buffer buffer buffer buffer (cid:29)(cid:30)(cid:31)
CPU buffer
Fig. 13. Chunk allocation in a process with multiple GPUs. If we can notstore all the chunks in the memory of one GPU, we distribute the chunks tomultiple GPUs. If some of the chunks can not be stored in the GPUs, westore them in the memory of the CPU.
We send a chunk to the other process that has the other chunkand receive a chunk from the same process. We replace theamplitudes in the chunk with those of the received chunk butwe do not return the chunk to the pairing process.We added the chunk_swap function to the qubit registerclasses that takes a reference to the pairing qubit register classas a parameter to swap amplitudes between the chunks selectedin
State_chunk class. The implementation of this functiondepends on the simulator; the paragraph below describes theimplementation of the chunk_swap function for the statevector simulator.One chunk whose size is nc is stored in the qubit registerclass for the state vector simulator. If there are GPUs installedon the computer, we try to store the chunks in the memoryof one of the GPUs. If the chunks cannot be stored in thememory of one GPU, we try to store them in two or moreGPUs. If the memories of all the GPUs become full, we storethe remaining chunks in the main memory of the CPU, asshown in Fig. 13. To swap chunks stored in different memoryspaces, i.e. chunks on different GPUs or chunks on a GPUand chunks on a CPU, we have copy chunks into the bufferto swap amplitudes in the same memory space. As a result,we have to allocate buffer chunks before storing chunks inmemory.Since the raw data of a chunk depends on the simulationmethod, the qubit register classes prepare chunk data to besent to the pairing process by using MPI Send. A buffer toreceive a chunk from the pairing process by using MPI Recvis also prepared by the qubit register classes. For the statevector simulator, a chunk itself is used to send to the pairingprocess and a buffer allocated on the memory space is usedfor a receive buffer.V. P ERFORMANCE E VALUATION
A. Computer Environment
We compared the simulation times of a state vector simula-tor implementing the cache blocking technique with that of thesimulator not implementing the technique on the GPU cluster(See Table I [22] [23]). Each node of the cluster had six GPUs,and each GPU had 16GB of memory, so we could simulate upto 29 qubits on a single GPU with double precision. The sizeof the main memory was 512GB, on which we could simulate
ABLE IC
OMPUTER ENVIRONMENTS USED IN THE EVALUATION OF THE STATEVECTOR SIMULATOR
Cluster node IBM Power System AC922CPU POWER9Number of sockets per node 2Number of cores per socket 21CPU memory size 512GBGPU NVIDIA Tesla V100Number of GPUs per node 6GPU memory size 16GBCPU - GPU interconnect NVLink2Inter connect Infiniband EDROS Red Hat Enterprise Linux Server 7.6Compiler GCC 8.3.0CUDA Toolkit CUDA 10.1MPI IBM Spectrum MPI 10.3.1 up to 34 qubits. Moreover, by storing chunks in the memoriesof both the GPUs and CPU as shown in Fig. 13, we couldsimulate up to 35 qubits on a single node.The base implementation without the cache blocking tech-nique is described in [11] and chunks are exchanged whenthe target qubits of the gates are larger or equals to chunkqubit. The base implementation was implemented as the setof a
State and qubit register classes shown in Fig. 10, soparallelization was managed inside the qubit vector class. Weimplemented the kernels of gates by using Thrust on bothimplementations.For the evaluation we prepared two quantum circuits. Oneof the circuit we used is Quantum Volume [24] which is arandomly generated quantum circuit that uses all the givenqubits and all the qubits entangled by CNOT gates. TheQuantum Volume circuit itself does not have a mathematicalmeaning but is useful for evaluating quantum systems. Theother circuit is Quantum Fourier Transform (QFT) which isone of the most important quantum algorithms and occurs inmany quantum applications. As in Quantum Volume, there areentanglements over all the qubits but the other calculations aremultiplication of diagonal unitary matrices. On the state vectorsimulator diagonal matrix multiplications can be calculated in-dependently on each probability amplitudes, so data exchangebetween chunks of QFT is much less than that of QuantumVolume.
B. Single Node Performance Comparison
Fig. 14 and Fig. 15 show the measured simulation timeof Quantum Volume and QFT circuits on a single node ofan IBM Power System AC922 (Table I) equipped with sixNVIDIA Tesla V100 GPUs. They also show the time on aCPU for comparison. While the simulation time on the CPUis linear, we can see steps in the simulation times of the GPUimplementations (baseline and w/cache blocking).The first step from 25 to 29-qubits shows the simulationtime calculated by one GPU; there is no data exchangebetween GPUs, so the simulation time with cache blockingis not plotted because it is the similar to the baseline. On28 and 29-qubits in Fig. 15, we also plotted the simulation Q u a n t u m V o l u m e S i m u l a t i on T i m e [ sec ] number of qubits baseline w/cache blocking CPU Fig. 14. Single node simulation time of Quantum Volume (depth=10)measured on IBM Power System AC922 with 6 GPUs. Only one GPU isused ≤ qubits, and we can store all data in memory of six GPUs ≤ qubits. Both memory of GPU and CPU are used ≥ qubits. time by using multiple GPUs with cache blocking. We gotbetter performance than using one GPU even if we have toexchange chunks between GPUs because there are fewer dataexchange occurs, it seems the calculation time shortened byparallelization by multiple GPUs overcomes the overheads ofdata exchange.The second step from 30 to 32 qubits shows the simulationtimes when using six GPUs; here, the performance deteriorateson the baseline of Quantum Volume because of data exchangeoverheads, on the other hand with cache blocking we candecrease overheads of data exchange and we measured betterperformance. The performance of QFT on the second stepshows different behavior. Because there is less data exchangeoccurs on QFT, the performance of the baseline did notdeteriorate, so there is small advantage in cache blocking here.The third step from 33 to 35 qubits shows the simulationtimes when chunks are stored in the memories of the six GPUsand the CPU. Since the CPU has 512GB of memory, wecan simulate up to 34 qubits on it, but by storing chunkson both the GPUs and CPU, we can simulate 35 qubits.Thus, the cache blocking technique reduces the data exchangeoverhead, offering a speedup of about 20% from the baselinefor Quantum Volume and 80% for QFT. In the third step, thebaseline implementation calculates the chunks on the CPUwhen the chunks are stored in the memory of the CPU; onthe other hand, the cache blocking implementation transferschunks to the GPU to calculate the sequence of gates. TheGPU cache blocking also leads to a performance gain. C. Scalability Comparison
Next, we evaluated our cache blocking implementation onthe parallel nodes of an IBM Power System AC922 withup to sixteen nodes. Fig. 16 and Fig. 17 are the measured Q FT S i m u l a t i on T i m e [ sec ] number of qubits baseline w/cache blocking CPU Fig. 15. Single node simulation time of QFT measured on IBM Power SystemAC922 with 6 GPUs. Only one GPU is used ≤ qubits, and we can storeall data in memory of six GPUs ≤ qubits. Both memory of GPU and CPUare used ≥ qubits. Q uan t u m V o l u m e S i m u l a t i on T i m e [ s e c ] baseline w/cache blocking Fig. 16. Simulation time of 30-qubit Quantum Volume (depth=10) on IBMPower System AC922 with 6 GPUs (strong scaling). For 1 node, we used all6 GPUs but we only used 1 GPU per node for the larger number of nodecases. simulation times of two Quantum Volume circuits. Fig. 16shows a comparison of 30-qubit Quantum Volume calculatedon two to sixteen nodes with only one GPU per node. Becausethe required calculation is relatively small, the simulation timeis largely affected by data communications between nodes.When we add a node, the simulation time becomes worse inboth implementations. But the cache blocking implementationhas a smaller performance degradation. Fig. 17 compares thetimes of a 34-qubit Quantum Volume when chunks are storedon both the GPU and CPU for the one- and two-node cases.Because this circuit has enough calculations, the effect ofcache blocking is relatively larger, and the simulation timewith cache blocking is significantly shorter than that of thebaseline. Q uan t u m V o l u m e S i m u l a t i on T i m e [ s e c ] baseline w/cache blocking Fig. 17. Simulation time of 34-qubit Quantum Volume (depth=10) on IBMPower System AC922 with 6 GPUs (strong scaling). We used 6 GPUs pernode for all cases, and for 1 node and 2 nodes we also stored in the memoryof CPU. Q uan t u m V o l u m e S i m u l a t i on T i m e [ s e c ] baseline w/cache blocking Fig. 18. Simulation time of Quantum Volume (depth=10) on IBM PowerSystem AC922 with 6 GPUs and each node has amplitudes (weakscaling), i.e., 30 qubits for 1 node, 31 qubits for 2 nodes ... We also tested weak scalability by fixing the per node datasize. Fig. 18 and Fig. 19 show weak scaling on the parallelnodes of an IBM Power System AC922 for up to sixteennodes. In Fig. 19, the per node array size is fixed to andsix GPUs are used. This size is relatively small, meaning thatthe performance is significantly affected by the data transfertime between nodes; the baseline implementation shows alarge scalability degradation. On the other hand, the cacheblocking implementation shows better scalability. In Fig. 19,there are amplitudes per node, and this requires both GPUsand CPU to store. In this case as well, the cache blockingimplementation shows very good scalability.VI. S UMMARY
We proposed optimization techniques for parallel quan-tum computing simulations. We decreased data exchangesby moving all the gates of the circuits to lower qubits byinserting noiseless swap gates. This resembles cache block- Q uan t u m V o l u m e S i m u l a t i on T i m e [ s e c ] baseline w/cache blocking Fig. 19. Simulation time of Quantum Volume (depth=10) on IBM PowerSystem AC922 with 6 GPUs and each node has amplitudes (weakscaling), i.e., 33 qubits for 1 node, 34 qubits for 2 nodes ... ing on classical computers. Also, by combining chunk-basedparallelization with cache blocking, we found that we canparallelize simulators with small changes because we can reuseserial simulation codes except for swap gates between chunks.This contributes to the productivity of the simulation software.We implemented a parallel state vector simulator in the opensource quantum computing simulation framework, Qiskit Aer.We used the Thrust library instead of writing CUDA kernelcodes to accelerate simulation on the GPUs. Since Qiskit Aeris written in C++ and based on STL, Thrust extends Qiskit Aerto support GPU naturally, and it contributes to productivity andportability. We also parallelized the state vector simulator byusing MPI for hybrid parallel computers.By applying parallelization techniques and the cache block-ing technique described in this paper, we improved the scal-ability and simulation performance on a hybrid parallel com-puter. We improved both strong and weak scaling comparedwith previous implementations without cache blocking. Themain contribution to the improvement is the decrease indata exchanges between chunks especially those on differentnodes on a distributed parallel computer. Moreover, by storingchunks in the both the memory of the GPUs and the memoryof the CPU, we can use the memory of the GPUs as a cachememory to accelerate simulations using the chunks stored inthe CPU.We are planning to make this implementation public in theQiskit Aer package. We are also planning to parallelize the restof the simulation methods in Qiskit Aer, and we will evaluatethem on hybrid parallel computers. To increase the number ofqubits that can be be simulated, we can add slower storagesuch as NVMe to the node or network storage connected tothe parallel cluster. The cache blocking technique can alsoimprove performance by fetching chunks from these slowerstorages to the faster local storage and performing operationsthere and using blocking to decrease data movements from theslower storage. R EFERENCES[1] A. Aspuru-Guzik, A. D. Dutoi, P. J. Love, and M. Head-Gordon,“Simulated quantum computation of molecular energies,”
Science , vol.309, no. 5741, pp. 1704–1707, 2005.[2] A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M.Chow, and J. M. Gambetta, “Hardware-efficient variational quantumeigensolver for small molecules and quantum magnets,”
Nature , vol.549, no. 7671, p. 242, 2017.[3] P. O’malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean,R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding et al. , “Scalablequantum simulation of molecular energies,”
Physical Review X , vol. 6,no. 3, p. 031007, 2016.[4] Qiskit. (2017, May) Qiskit: An open-source framework for quantumcomputing. [Online]. Available: https://qiskit.org/[5] M. Corporation. (2018, May) The q
International Conference on Cloud Computingand Security . Springer, 2017, pp. 85–97.[8] T. H¨aner and D. S. Steiger, “0.5 petabyte simulation of a 45-qubit quan-tum circuit,” in
Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis . ACM,2017, p. 33.[9] T. Jones, A. Brown, I. Bush, and S. Benjamin, “Quest andhigh performance simulation of quantum computers,” arXiv preprintarXiv:1802.08032 , 2018.[10] M. Smelyanskiy, N. P. Sawaya, and A. Aspuru-Guzik, “qhipster: thequantum high performance software testing environment,” arXiv preprintarXiv:1601.07195 , 2016.[11] J. Doi, H. Takahashi, R. Raymond, T. Imamichi, and H. Horii,“Quantum computing simulator on a heterogenous hpc system,”in
Proceedings of the 16th ACM International Conference onComputing Frontiers , ser. CF ’19. New York, NY, USA: Associationfor Computing Machinery, 2019, pp. 85–93. [Online]. Available:https://doi.org/10.1145/3310273.3323053[12] Yan, Tennin. (2019) Development of fast cpu and gpubased quantum computer simulator for cirq. [Online]. Available:https://t.co/biMOsSh6fk?amp=1[13] T. Jones, A. Brown, I. Bush, and S. Benjamin, “Quest and high perfor-mance simulation of quantum computers,”
Scientific Reports , vol. 9, 022018.[14] Qiskit. (2019, Jan.) Qiskit Aer: A High PerformanceSimulator Framework for Quantum Circuits. [Online]. Available:https://qiskit.org/aer[15] H. Raedt, D. Willsch, M. Nocon, N. Yoshioka, N. Ito, S. Yuan, andK. Michielsen, “Massively parallel quantum computer simulator, elevenyears later,”
Computer Physics Communications , vol. 237, 11 2018.[16] X.-C. Wu, S. Di, E. M. Dasgupta, F. Cappello, H. Finkel, Y. Alexeev,and F. T. Chong, “Full-state quantum circuit simulation by using datacompression,” in
Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , ser. SC’19. New York, NY, USA: Association for Computing Machinery,2019. [Online]. Available: https://doi.org/10.1145/3295500.3356155[17] B. Villalonga, D. Lyakh, S. Boixo, H. Neven, T. Humble, R. Biswas,E. Rieffel, A. Ho, and S. Mandr`a, “Establishing the quantum supremacyfrontier with a 281 pflop/s simulation,”
Quantum Science and Technol-ogy , 03 2020.[18] A. W. Cross, L. S. Bishop, J. A. Smolin, and J. M. Gambetta, “Openquantum assembly language,” arXiv preprint arXiv:1707.03429 , 2017.[19] (2019, Jul) Thrust - parallel algorithms library. [Online]. Available:https://thrust.github.io[20] N. Corporation. (2007, Jul.) Cuda toolkit. [Online]. Available:https://developer.nvidia.com/cuda-toolkit[21] I. Corporation. Intel threading building blocks. [Online]. Available:https://software.intel.com/en-us/tbb[22] S. Vetter, R. Nohria, and G. Santos, “Ibm power system ac922 technicaloverview and introduction,”
An IBM Redpaper publication
Phys. Rev. A , vol. 100, p. 032328, Sep 2019. [Online].Available: https://link.aps.org/doi/10.1103/PhysRevA.100.032328 his figure "fig1.png" is available in "png"(cid:10) format from:http://arxiv.org/ps/2102.02957v1 S i m u l a t i on t i m e [ s e c ]]