A $4,000 Workstation for Mammalian Genome Assembly with Long Reads
AA $ 4 , 0 0 0 Wo r k s t a t i o n f o r M a m m a l i a n Genome Assembly with Long Reads
Hikoyu Suzuki and Norichika Ogata , * Nihon BioData Corporation, 3-2-1 Sakado, Takatsu-ku, Kawasaki, Kanagawa 213-0012, Japan
Abstract.
Long-read sequencing has enabled the de novo assembly of several mammalian genomes, but with high cost in computing. Here, we demonstrated de novo assembly of mammalian genome using long reads in an efficient and inexpensive workstation. Keywords: Single molecule real-time sequencing, Genome assembly, Computing cost, Inexpensive, Workstation
1. Introduction
Long reads derived from single molecule real-time (SMRT) sequencing provide useful pieces of infor-mation for high-quality genome assembly. Because the sequencing error rate is quite high, long reads have been used in combination with accurate Illumi-na reads (hybrid assembly). Later, however, the suc-cesses of non-hybrid assembly methods were report-ed [1-3]. These non-hybrid assembly methods enable to obtain high-quality genome assemblies, whereas the computing costs were high. Recently, non-hybrid assembly methods were applied to characterization of mammalian genomes [4-7]. In these studies, abun-dant computer resources were supposed to be spent. For example, “an NFS-based computing clusters” was used [6]. In other cases, “the computational re-sources and staff expertise provided by the Depart-ment of Scientific Computing at the Icahn School of Medicine at Mount Sinai” [4] or “used the computa-tional resources of the Biowulf system at the Nation-al Institutes of Health” [7]. It was also said that “Google used 405,000 CPU hours so assemble a hu-man genome from PacBio data” [8]. Today, if we had 405,000 CPU hours on a cloud computing service, it would cost over 50,000 USD. It is clear that both inexpensive long-read sequencing technologies and assembly methods without clustered computing re-sources allow individual laboratories to obtain high-quality genome assemblies. Canu [9], the latest as-sembler for long reads, was expected to require much less computing resources. Here, we developed an inexpensive workstation and demonstrated de novo assembly of mammalian genome by Canu.
2. Materials and Methods
We developed a dual-processor workstation with two CPUs, Xeon E5-2620 v4 (8 cores, 2.1 GHz, 20 MB cash, 45,996 JPY, Intel, Santa Clara, CA), a Motherboard, X10DAi (SSI-EEB, dual-LGA2011-v3, Intel C612 Chipset, 62,531 JPY, Super Micro Computer Inc.,
San Jose, CA), a video card, GF-GTX750TI-LE2GHD ( 640 cores, 1020 MHz, 2GB, GDDR5, 9,826 JPY, Kuroutoshikou, Japan), 8 RAMs, KVR21R15D4K4/128 (32GB, DDR4, 2133 MHz, ECC, 24,980 JPY, Kingston Technology, Fountain Valley, CA), a HDD, MD04ACA200 (3.5 inch, 2 TB, 7200 rpm, 128 MB cash, SATA 6 Gbps, 7,150 JPY, TOSHIBA, Tokyo, Japan), two HDDs, MD04ACA400 (3.5 inch, 4 TB, 7200 rpm, 128 MB cash, SATA 6 Gbps, 12,798 JPY, TOSHIBA, Tokyo, Japan), a SSD, MX300 (2.5 inch, 525 GB, SATA 6 Gbps, 16,598 JPY, Micron Technology, Inc., Boise, ID), two CPU coolers, SST-AR08 (90 mm fan, LGA2011-v3, 3,800 JPY, SilverStone Technology Co., Ltd, New Taipei, Taiwan), CPU grease, M X - 4 / 4 g ( 4 g , 8 8 6 J P Y , Z A W A R D CORPORATION, Tokyo, Japan), a power source, SST-ST75F-P (ATX 750 W PLUS SILVER, 13,000 JPY, SilverStone Technology Co., Ltd, New Taipei, Taiwan) and a case, SST-GD07B/B (SSI-EEB, 120
Corresponding author. E-mail: [email protected] * m side fun, 120 mm bottom fun, 22,846 JPY, SilverStone Technology Co., Ltd, New Taipei, Taiwan) (Table 1). The total price of workstation parts was approximately 4,000 USD (457,869 JPY, 17 March 2017). Operation system was Ubuntu 14.04 LTS. Canu v1.4 was installed.
Approximately 78-fold long reads of mammalian genome were obtained from PacBio RS II. Median length was 6k and mean length was 7k. Shannon en-tropy of read length was 24.4 shannon.
We executed Canu in our workstation with option-al parameters ‘minReadLength=1000 minOverlap-Length=750 genomeSize=2500m maxThreads=16 -pacbio-raw’ and other default parameters. We per-formed assembly using whole sequencing data (78-fold) and partial sequencing data (34-fold).
To evaluate the genome assembly, we compared with other two draft genomes. One was assembled by ALLPATH-LG [10] using only Illumina reads. The other was gap-closed by PBJelly [11] using whole PacBio reads based on the Illumina-based assembly. Relative values of maximum length, N10-90, total number and total length of contigs/scaffolds were calculated based on the data from hybrid-assembly using both Illumina and PacBio reads.
3. Results and Discussion
Canu was executed in 16 parallel threads. Assem-bly using whole sequence data (78-fold) was finished in 29 days. Assembly using partial sequence data (34-fold) was finished in 22 days.
Assembly using whole sequencing data (78-fold) was better than Illumina-based draft genome. Inter-estingly, it was also better than hybrid-assembled genome particular in max length, N10-30, mean length and total number of contigs (Figure 1). This data might reflect some restriction of Illumina-based assembly method. Partial sequencing data (34-fold) was not enough to obtain high-quality mammalian genome assembly. It was realized that the quality of assembly could be remarkably improved by only adding reads.
Figure 1. Comparison of Assembly
4. Conclusion
In this study, we could obtain high-quality mam-malian genome assembly by Canu, executed in an inexpensive workstation within a month. Our demon-stration shows that clustered computing systems are not necessarily required for even mammalian genome assembly, however, the computing costs should be still improved. Further reduction of computing costs for assembly is expected in the near future.
5. References
1. Chin CS, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT se-quencing data. Nat Methods 2013;10(6):563-9. PMID: 23644548. 2. Chin CS, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 2016;13(12):1050-1054. PMID: 27749838. 3. Miyamoto M, et al. Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes. BMC Genomics 2014;15(1):699. PMID: 25142801. 4. Pendleton M, et al. Assembly and diploid archi-tecture of an individual human genome via sin- R e l a t i v e v a l u e s M a x l e n g t h N N N N N N N N N M ea n l e n g t h To t a l nu m b e r To t a l l e n g t h Illumina Illumina+PacBio PacBio (78-fold) PacBio (34-fold) le-molecule technologies. Nat Methods 2015;12(8):780-6. PMID: 26121404. 5. David G, et al. Long-read sequence assembly of t h e g o r i l l a g e n o m e . S c i e n c e 2016;352(6281):aae0344. PMID: 27034376. 6. Shi L, et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun 2016;7:12065. PMID: 27356984. 7. Derek B, et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics 2017. PMID: 28263316. 8. h t t p s : / / t w i t t e r. c o m / m i k a e l h u s s / s t a t u s /557165295207714816 Cited 18 March 2017. 9. Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 2017;(0). PMID: 28298431. 10. Gnerre S, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 2011;108(4):1513-8. PMID: 21187386. 11. English AC, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read s e q u e n c i n g t e c h n o l o g y . P L o S O n e 2012;7(11):e47768. PMID: 23185243. able 1. Parts list of the dual-processor workstationable 1. Parts list of the dual-processor workstation