Cryo-RALib -- a modular library for accelerating alignment in cryo-EM
Szu-Chi Chung, Cheng-Yu Hung, Huei-Lun Siao, Hung-Yi Wu, Wei-Hau Chang, I-Ping Tu
CCRYO-RALIB - A MODULAR LIBRARY FOR ACCELERATING ALIGNMENT IN CRYO-EM
Szu-Chi Chung (cid:63) , Cheng-Yu Hung (cid:63) , Huei-Lun Siao (cid:63) , Hung-Yi Wu (cid:63) , Wei-Hau Chang † , I-Ping Tu (cid:63)(cid:63) Institute of Statistical Science, Academia Sinica. † Institute of Chemistry, Academia Sinica.
ABSTRACT
Thanks to automated cryo-EM and GPU-accelerated process-ing, single-particle cryo-EM has become a rapid structuredetermination method that permits capture of dynamicalstructures of molecules in solution, which has been recentlydemonstrated by the determination of COVID-19 spike pro-tein in March, shortly after its breakout in late January 2020.This rapidity is critical for vaccine development in responseto emerging pandemic. This explains why a 2D classifica-tion approach based on multi-reference alignment (MRA)is not as popular as the Bayesian-based approach despitethat the former has advantage in differentiating subtle struc-tural variations under low signal-to-noise ratio (SNR). Thisis perhaps because that MRA is a time-consuming processand a modular GPU-acceleration package for MRA is stilllacking. Here, we introduced a library called
Cryo-RALib that contains GPU-accelerated modular routines for acceler-ating MRA-based classification algorithms. In addition, weconnect the cryo-EM image analysis with the python datascience stack so as to make it easier for users to performdata analysis and visualization. Benchmarking on the Tai-Wan Computing Cloud (TWCC) container shows that ourimplementation can accelerate the computation by one orderof magnitude. The library has been made publicly availableat https://github.com/phonchi/Cryo-RAlib.
Index Terms — Computational biology, cryo-EM, GPUacceleration, multiple reference alignment
1. INTRODUCTION
In contrast to X-ray crystallography, cryo-EM is a methodthat is amenable to the structural determination of proteinsin non-crystalline state. With the instrument automation andadvances in algorithms, single particle cryo-EM has becomea mainstream tool to solve 3D structures of molecules at near-atomic resolution. It is noted that as the automated data col-lection is becoming mature, the equipment that is up-running24/7 can generate 1000 to 4000 micrographs per day. There-fore, the GPU hardware has been invoked to accelerate thecryo-EM workflow in different stages to meet the demand forprocessing a large volume of data. The GPU is now usedin various steps in data processing including movie align- ment [1], contrast transfer function estimation [2], identifica-tion of particles within micrographs [3], Bayesian 2D classifi-cation algorithm like RELION [4] or cryoSPARC [5] and 3Drefinement algorithms [4,5]. Even with GPU acceleration, thepopular RELION 2D classification usually takes several daysto finish the classification task. Evidently, image classifica-tion has become the bottleneck in the workflow. We and oth-ers notice that the computational complexity of the Bayesianapproach is much higher than the one based on multireferencealignment (MRA). The complexity is at least in the order of O ( nL log L ) compared with O ( nL ) of MRA, where n isthe sample size and L is the pixel number for one directionof the particle image, and t is the translational shift searchrange [6]. Besides, to further extend to atomic resolution,one must carefully deal with heterogeneity within the imagedata. When such is concerned, an MRA-based method hasadvantage because it can better differentiate subtle structuredifferences under low SNR compared with the Bayesian ap-proach [7, 8].
2. PREVIOUS WORKS
2D and 3D classification algorithms based on the MRA orBayesian approach are standard steps in cryo-EM workflow[6]. The 2D algorithm is depicted as follows. We use 2D tosimplify the illustration while the extension to 3D is straight-forward. First, K initial 2D average images are randomly gen-erated or provided by the user. Second, the particle images arecompared with those K initials as references by tuning param-eters of shifts and in-plane rotations. The reference that max-imizes the cross-correlation (CC) is to be recorded for eachimage. The new 2D averages can then be updated with thealigned images. After multiple iterations of refinement on pa-rameters, the class assignment of each image along with itsshift and rotation parameters to a particular 2D class is usu-ally unambiguous. The popular Bayesian approach like RE-LION [4] is slightly different in that it uses fuzzy assignmentsfor both alignment parameters and classes instead of the bestone. The 2D averages are then obtained through weighted av-erages over all possible orientations and classes. Although theBayesian approach is very powerful in 3D refinement [4], thealgorithms based on MRA have been shown to provide betterperformance in 2D classification. For instance, CL2D [9] can a r X i v : . [ q - b i o . Q M ] J a n educe the influence of noise and avoid the unbalance classphenomenon, ISAC [10] is an algorithm that ensures robustclasses to be output by repeated tests, while Prime [11] can es-cape the local minimum by adopting stochastic hill-climbing.There are many implementation strategies for the alignmentstep in MRA [12]. Among these implementations, re-sampleto polar coordinate (RPC) has been shown to have the bestperformance under low SNR so that we use it in this work.It is noted that most of the works do not exploit GPU in theMRA step. However, since the calculation of similarities isindependent of particles, 2D reference and orientation, it ispossible to calculate them in parallel using GPU. Buffer
Rank0
GPUMemoryGPU
HDF5 /BDB
PCIe
CPUCPUMemory Buffer
Rank n
GPUMemoryGPU PCIeCPUCPUMemory Buffer
Rank n+1
PCIeCPUCPUMemory BufferPCIeCPUCPUMemory BufferPCIeCPUCPUMemory
Rank n+2 Rank N
Fig. 1 : The architecture is a summary of the coding frame-work underlying the GPU-accelerated ISAC [13]. We employthis architecture to implement MRA.Fortunately, RPC has been GPU-accelerated in reference-free alignment (RFA) of GPU ISAC [13], whose code hasbeen released in July 2020. We summarize the architecture ofthe coding framework underlying the GPU-accelerated ISACin Fig. 1. This architecture is implemented using messagepassing interface (MPI) and Nvidia CUDA and the process-ing flow is described as follows. First, the particle imagesare read in parallel and immediately pre-processed by all N MPI processes. Second, every process is sent to the first n MPI processes that contain GPU resources through MPI.Third, the first n MPI processes perform the alignment al-gorithm. Finally, the first n MPI processes send the databack to all N MPI processes and all the processes write themetadata and the transformed images to the disk. Noticethat the main task in RPC procedure is to compute the ro-tation cross-correlation function. The function is defined as c ( φ ) = (cid:82) r r (cid:82) π x ( r, θ ) y ( r, θ + φ ) | r | dθ dr . To speed up thehost-to-device memory transfer, the authors in [13] employthe texture memory to store the image as well as its metadata,and a floating array in global memory is used to store the im-ages after polar conversion, Fast Fourier Transform (FFT),Inverse FFT (IFFT) or alignment. The results of the CC com-putation are stored in a table which holds the CC informationof all (mirrored) images with all reference images togetherwith all shifts. Each row in the table stores all CC informa-tion for one image.
3. THE MRA FRAMEWORK
In this work, we exploit the parallelism and architecture men-tioned in Section 2. Our framework is built upon the general-purpose cryo-EM processing library called EMAN2 [14] andthe MRA is based on GPU ISAC [13]. The architecture anddata layout of MRA we use is from the GPU-accelerated RFAin [13]. We describe the acceleration of MRA in this sectionand provide a simple link between cryo-EM image process-ing and the python data science stack to enable rapid devel-opment of GPU-accelerated operations which is described inSection 4. The programs are integrated into a library called
Cryo-RALib which is accessible to the cryo-EM community.
Algorithm 1:
GPU Accelerated MRA
Input:
Set of particle images X Set of reference images Y Output:
Set of class averages Y (cid:48) repeat for each batch do Transfer X and Y in this batch from hostmemory into texture memory in device. do in parallel Convert y i to polar coordinates and applyFFT to obtain y (cid:48) i . for each shift do do in parallel Convert x i to polar coordinates withgiven shift and apply FFT to obtain x (cid:48) i . Compute CC between x (cid:48) i and y (cid:48) i and store CC into the table. Apply inverse FFT to CC table. do in parallel Find the largest CC value for each imageusing reduction. Find the correspondingalignment parameters and apply them toeach image. Compute new Y (cid:48) from aligned images. Transfer averages and parameters to host. until convergence ; Return Y (cid:48) and parameters.Our GPU implementation follows the framework of CPUimplementation from EMAN2, as shown in Algorithm 1.In other words, we provide a GPU version of MRA fromEMAN2. To do so, we made use of existing libraries withoptimized CUDA primitives, the polar coordinate conversion,FFT, IFFT and applying alignment parameters are from GPUISAC [13]. Our CC computation of F F T ( x ) (cid:48) × F F T ( y ) inMRA is similar to the one used in RFA of [13]. The differ-ence is that each block, which processes one particle image,now computes its CC values with all the given referencemages instead of single reference. This is done by addinganother layer of parallelism for the reference images in they-dimension of GPU grid.
11 3 8 1 5
Block-width Stride
11 7 8 12 7 6 12
Block-width Stride
11 18 22 25 14
13 19 (a)(b)
Fig. 2 : The reduction strategy. (a) Initialize the shared mem-ory. (b) General reduction on the shared memory.
HDF5/BDB
EMAN2/GPU ISAC
CuPy
Numpy Jupyter Notebook/PyData Stack
Developer AP IMPI/C++/Cuda
Custom Array
Wrapper
Pandas
Fig. 3 : The software stack of our framework. The shadedblocks represent the interfaces with other packages.The parallel reduction scheme we employed to find themaximum value of each row in the table is shown in Fig. 2.Here, we are not deciding how many threads to launch basedon the data size since the length of each row is very large andcan easily exceed the maximum number of threads allowedin a block. Instead, we launch a fixed number of threads andlet each thread loop through memory, computing partial re-duction operations on each element to initialize the memory.Finally, the standard parallel reduction is employed to reducethe initialized memory to find the largest element and its in-dex. It is noted that our approach also enables coalesced ac-cess of the memory within each block during the initializationwhich can improve the memory access speed. After findingthe index for each image, the corresponding alignment pa-rameters are calculated for each image in parallel with GPUkernel. Finally, the new class averages are calculated throughCuPy [15] using our custom class depicted in next section. tx t y (a) class (b) t-SNE 1 t - S N E (c) t-SNE 1 t - S N E (d) Fig. 4 : Histogram of (a) x and y shift, (b) class assignments.t-SNE plot of (c) all particles before alignment, (d) after align-ment with each class labeled in different colors. From (c),(d),we can perceive that the particles are better aligned after MRAas it groups nearest neighbors into the same class.
4. PROPOSED SOFTWARE STACK
The software stack of our library is shown in Fig. 3. Popu-lar computational cryo-EM packages are built around C andC++ [4, 14]. On the other hand, the recent trend of pack-ages designed for data analysis and data visualization is tobuild upon Python libraries. However, there is no convenientway to perform cryo-EM data analysis and visualization inthe Python environment. In our framework, the functionalityof EMAN2 and the CUDA code is made available throughseveral well-defined interfaces. We used CuPy to implementthe required interoperability layer. To compute the averagesand calculate the statistics of the transformed images, we im-plement a custom class as the GPU array interface. Thisclass encapsulates the cuda array interface which is cre-ated for interoperability between different GPU arrays in var-ious python projects. With this class, we can easily accessthe GPU buffer of the transformed images and calculate themin Python to reduce the need to transfer data between hostand device as much as possible. In the Python environment,we use NumPy [16] and CuPy array to represent the particleimages; it can be transformed into other popular computer vi-sion or machine learning libraries for analysis [17]. Finally,the metadata is represented as Pandas [18] dataframe for ex-ploratory data analysis. The images and metadata are storedinto either HDF5 or Berkeley DB (BDB) files. Developer oruser can use the Jupyter notebook interface with our API asa pipeline for common data analysis. Several notebooks arencluded in the library to show that basic image operationslike rotation and shift can be easily accelerated and the dataanalysis/visualization can be performed on the platform.The processing pipeline of our framework is illustrated asfollows. First, the data source (cryo-EM images and metadatastores in HDF5, BDB or text file) are to be read in through thePython wrapper in parallel. Second, the user performs pre-process or exploratory data analysis in the Jupyter notebook.Third, the alignment and clustering are performed by callingthe MRA script. Finally, the transformed images and meta-data can be read into a notebook for further analysis or visu-alization. Fig. 4 shows an example of exploratory data analy-sis. With the plotted 1D and 2D histogram of shifts and classassignments from MRA procedure are plotted, we can thenanalyze the performance on particle picking or diagnose theproblems like preferred orientations or the attraction by largeclusters [9]. In addition, we use 2SDR [19] and t-SNE [20]to plot the 2D embedding of all particles as shown in Fig. 4(c),(d). We can then get an idea about the alignment progressand the performance of 2D classification algorithm.
20 30 40 50 60 70 80 90 100 number of references e l a p s e d t i m e ( s ) (a) number of shift e l a p s e d t i m e ( s ) (b) number of GPU e l a p s e d t i m e ( s ) (c) number of GPU e l a p s e d t i m e ( s ) (d) Fig. 5 : (a) Computation time of MRA for a different num-ber of references. (b) Computation time of RFA for differentsearch ranges of shift. (c) The scaling behavior of MRA. (d)The scaling behavior of RFA.
5. THE EXPERIMENT RESULTS
Comparisons between different implementations are con-ducted on TaiWan Computing Cloud (TWCC). The programswere running in the TWCC c.super instance, with 4 coresof Xeon Gold 6154 processors and an Nvidia Tesla V100GPU. The benchmark dataset of Ribosome 80s [21] from RELION is used with height and width both down-sampledto 90 pixels. We first compared the CPU implementation ofMRA from EMAN2 as shown in Fig. 5(a). The search rangein the x and y direction is set to 3 and the radius of particlesis 36 pixels. We can perceive that the speedup is × to × with different reference numbers. We then comparedthe computation time of CPU implementation of RFA fromEMAN2 as shown in Fig. 5(b). The search step is set to 1and the speedup is . × to . × with different 2D shifts. Wealso compared with the RFA in GPU ISAC and the speedupis about . × . Fig. 5(a) presents the performance of ourimplementation increase that is gained over multiple-GPUon the c.4xsuper instance with 8 GPUs. It can be observedthat our implementation scales well from 1 to 4 GPUs andprovides near linear speedup compared to single GPU. Onthe other hand, when the number of GPU is greater than 4the speedup is saturated because the time is dominated by thedata transfer.
6. CONCLUSION AND FUTURE WORKS
With continuing enhancement of processing algorithms, cryo-EM has become a progressively powerful and efficient tech-nique to solve structures of molecules. Currently, the perfor-mance of the popular GPU-accelerated Bayesian approachfor 2D Classification in cryo-EM still poses a bottleneck inthe processing as the computation time and the classificationquality are sub-optimal. As a result, MRA-based approachesrepresent an option as it can better reveal structural variationsat the 2D level. To the best of our knowledge, until nowMRA-based methods are mostly implemented with CPU-supported algorithms and thereby time-consuming. In thiswork, a modular GPU-accelerated alignment library termed
Cryo-RALib is introduced. The library contains a set of align-ment routines that can be used to accelerate 2D classificationalgorithms. We also provide several well-defined interfacesto connect the cryo-EM image analysis with the python datascience stack to help users to perform analysis and visual-ization more easily. Our benchmarking results on TWCCshow this implementation can accelerate the alignment pro-cedure by one order of magnitude. In the future, it would beinteresting to expand the library to accommodate other cryo-EM packages into this framework, and to provide a friendlyenvironment for users and developers as well.
Acknowledgement
The authors gratefully acknowledge Ryan Jeng from Nvidiaand the NCHC GPU Hackathon for the help on our researchproject. The authors also acknowledge the open-sourceprojects that we built upon, especially the EMAN2 and GPUISAC 2.3.2. Finally, the authors acknowledge the insightfulsuggestions from Dr. Stefan Raunser. . REFERENCES [1] Shawn Q Zheng, Eugene Palovcak, Jean-Paul Armache,Kliment A Verba, Yifan Cheng, and David A Agard,“Motioncor2: anisotropic correction of beam-inducedmotion for improved cryo-electron microscopy,”
NatureMethods , vol. 14, no. 4, pp. 331–332, 2017.[2] Kai Zhang, “Gctf: Real-time ctf determination and cor-rection,”
Journal of structural biology , vol. 193, no. 1,pp. 1–12, 2016.[3] Thorsten Wagner, Felipe Merino, Markus Stabrin,Toshio Moriya, Claudia Antoni, Amir Apelbaum, Phi-line Hagel, Oleg Sitsel, Tobias Raisch, Daniel Prum-baum, et al., “Sphire-cryolo is a fast and accurate fullyautomated particle picker for cryo-em,”
Communica-tions Biology , vol. 2, no. 1, pp. 1–13, 2019.[4] Sjors HW Scheres, “Relion: implementation of aBayesian approach to cryo-em structure determination,”
Journal of Structural Biology , vol. 180, no. 3, pp. 519–530, 2012.[5] Ali Punjani, John L Rubinstein, David J Fleet, and Mar-cus A Brubaker, “cryosparc: algorithms for rapid un-supervised cryo-em structure determination,”
NatureMethods , vol. 14, no. 3, pp. 290–296, 2017.[6] Amit Singer and Fred J Sigworth, “Computationalmethods for single-particle electron cryomicroscopy,”
Annual Review of Biomedical Data Science , vol. 3, pp.163–190, 2020.[7] Jiayi Wu, Yong-Bei Ma, Charles Congdon, Bevin Brett,Shuobing Chen, Yaofang Xu, Qi Ouyang, and YoudongMao, “Massively parallel unsupervised single-particlecryo-em data clustering via statistical manifold learn-ing,”
PLoS ONE , vol. 12, no. 8, pp. e0182130, 2017.[8] S.C. Chung, H.H. Lin, P.Y. Niu, S H. Huang, I.P. Tu,and W.H. Chang, “Pre-pro is a fast pre-processor forsingle-particle cryo-em by enhancing 2d classification,”
Communications Biology , vol. 3, no. 1, pp. 1–12, 2020.[9] C.O. Sorzano, J.R. Bilbao-Castro, Y. Shkolnisky, M. Al-corlo, R. Melero, G. Caffarena-Fern´andez, M. Li, G. Xu,R. Marabini, and J.M. Carazo, “A clustering approach tomultireference alignment of single-particle projectionsin electron microscopy,”
Journal of Structural Biology ,vol. 171, no. 2, pp. 197–206, 2010.[10] Zhengfan Yang, Jia Fang, Johnathan Chittuluru, Fran-cisco J Asturias, and Pawel A Penczek, “Iterative sta-ble alignment and clustering of 2d transmission electronmicroscope images,”
Structure , vol. 20, no. 2, pp. 237–247, 2012. [11] Cyril F Reboul, Frederic Bonnet, Dominika Elmlund,and Hans Elmlund, “A stochastic hill climbing approachfor simultaneous 2d alignment and clustering of cryo-genic electron microscopy images,”
Structure , vol. 24,no. 6, pp. 988–996, 2016.[12] Laurent Joyeux and Pawel A Penczek, “Efficiency of 2dalignment methods,”
Ultramicroscopy , vol. 92, no. 2,pp. 33–46, 2002.[13] Fabian Schoenfeld, “GPU ISAC (2.3.2),” http://sphire.mpg.de/wiki/doku.php?id=gpu_isac/ , 2020.[14] Guang Tang, Liwei Peng, Philip R Baldwin, Deepin-der S Mann, Wen Jiang, Ian Rees, and Steven J Ludtke,“Eman2: an extensible image processing suite for elec-tron microscopy,”
Journal of Structural Biology , vol.157, no. 1, pp. 38–46, 2007.[15] ROYUD Nishino and Shohei Hido Crissman Loomis,“Cupy: A numpy-compatible library for nvidia gpu cal-culations,” , p. 151, 2017.[16] Charles R Harris, K Jarrod Millman, St´efan J van derWalt, Ralf Gommers, Pauli Virtanen, David Cour-napeau, Eric Wieser, Julian Taylor, Sebastian Berg,Nathaniel J Smith, et al., “Array programming withnumpy,”
Nature , vol. 585, no. 7825, pp. 357–362, 2020.[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-nay, “Scikit-learn: Machine learning in Python,”
Jour-nal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011.[18] The pandas development team, “pandas-dev/pandas:Pandas,” Feb. 2020.[19] S.C. Chung, S.H. Wang, P.Y. Niu, S.Y. Huang, W.H.Chang, and I.P. Tu, “Two-stage dimension reductionfor noisy high-dimensional images and application tocryo-em,”
Annals of Mathematical Sciences and Appli-cations , vol. 5, no. 2, pp. 283–316, 2020.[20] Laurens van der Maaten and Geoffrey Hinton, “Visu-alizing data using t-sne,”
Journal of machine learningresearch , vol. 9, no. Nov, pp. 2579–2605, 2008.[21] Wilson Wong, Xiao-chen Bai, Alan Brown, Israel S Fer-nandez, Eric Hanssen, Melanie Condron, Yan Hong Tan,Jake Baum, and Sjors HW Scheres, “Cryo-em structureof the plasmodium falciparum 80s ribosome bound tothe anti-protozoan drug emetine,”