Genome Variant Calling with a Deep Averaging Network
Nikolai Yakovenko, Avantika Lal, Johnny Israeli, Bryan Catanzaro
GGenome Variant Calling with a Deep AveragingNetwork
Nikolai Yakovenko
NVIDIA [email protected]
Avantika Lal
NVIDIA [email protected]
Johnny Israeli
NVIDIA [email protected]
Bryan Catanzaro
NVIDIA [email protected]
Abstract
Variant calling, the problem of estimating whether a position in a DNA sequencediffers from a reference sequence, given noisy, redundant, overlapping short se-quences that cover that position, is fundamental to genomics. We propose a deepaveraging network designed specifically for variant calling. Our model takes intoaccount the independence of each short input read sequence by transforming indi-vidual reads through a series of convolutional layers, limiting the communicationbetween individual reads to averaging and concatenating operations. Trainingand testing on the precisionFDA Truth Challenge (pFDA), we match state of theart overall 99.89 F1 score. Genome datasets exhibit extreme skew between easyexamples and those on the decision boundary. We take advantage of this property toconverge models at 5x the speed of standard epoch-based training by skipping easyexamples during training. To facilitate future work, we release our code, trainedmodels and pre-processed public domain datasets . Genome variant calling is an important problem in computational biology. Distinguishing betweencandidate variants and the reference genome forms a core input into most downstream genomicstudies. The uses range from cancer risk prediction to ancestry studies. A typical human genomecontains 3.4 million known short variants (less than 50 basepairs) in trusted regions alone. Smallchanges in DNA can have large impacts on biological traits. Even one SNP (single nucleotidepolymorphism) can have a decisive effect on a downstream classification. Thus, in order for a variantcalling system to be useful, it must provide recall and accuracy of over 99%.Introduced on the precisionFDA (pFDA) Truth Challenge, DeepVariant [16] demonstrated that deepneural networks can be competitive with traditional variant calling methods. More recent DeepVariantversions have outpaced state of the art non-deep learning variant calling tools such as GATK (GenomeAnalysis Toolkit) [11] and Sentieon [3] on several human genome benchmarks. They also showedthat their network adapts to new modalities such as instrument changes, given enough high qualitytraining data [1, 6].However, DeepVariant adapts the Inception network [17] that was designed for image classification.Training and inference therefore requires transforming the genomic input data into 300x300 pixelRGB images. This motivates investigation into whether a deep learning model designed directly forvariant calling could do better. https://github.com/clara-genomics/DL4VC Preprint. Under review. a r X i v : . [ q - b i o . GN ] M a r e propose a custom architecture for variant calling. This model transforms individual reads througha series of convolutional layers, and limits the communication between reads to averaging andconcatenating. Training and testing on pFDA, we match state of the art overall F1 score. Genomedatasets exhibit an extreme skew between easy examples and those on the decision boundary. Wetake advantage of this property to converge models at 5x the speed of standard epoch-based training. Human genome
The human genome consists of 3.2 billion base pairs (each base is one of adenine(A), cytosine (C), guanine (G), and thymine (T)), split across 23 chromosomes. Individuals differfrom a “reference human genome” in approximately 1/1000th of those locations . These 3-4 milliondifferences are known as variants, of which there are three major types: • SNP (single nucleotide polymorphism) – a single base replacement. Denoted A -> T • Insertion – one or more bases are added at a reference location. Denoted A -> ATT • Deletion – one or more bases are removed at a location. Denoted ATT -> AInserts and deletion are referred to jointly as “Indels.” Within a human genome, SNPs outnumberIndels approximately 10-1. Indels are more difficult than SNPs to classify properly, and thusclassification accuracy on SNPs and Indels is usually reported separately. A human genome ispresent in two copies, with one copy inherited from each parent. Thus SNP and Indel variants aresub-classified into two types: • Homozygous – the same variant occurs in both copies of the genome. • Heterozygous – a variant occurs in one copy but not the other.Approximately two thirds of variants in a human genome are heterozygous.There are also “multi-allele” variants, where a different variant occurs on each strand of the DNA,in a given reference location. Multi-allelic variant sites are rare but not insignificant. There areapproximately 30,000 such locations, out of 3-4 million variants, about 1% of the data. See Table 1for example of a complex multi-allelic site.There are several versions of the reference human genome. The precisionFDA Truth Challenge isbased on the hs37d5 standard, while most recent work is done with the updated hg38 version of thereference.
Single read alignments
Sequencing a human genome starts with collecting short “reads” of se-quenced DNA fragments. These reads are typically less than 300 bases, depending on the sequencingmachine used . These single reads are aligned to the reference genome, using partial string-matchingalgorithms [8]. This alignment process works reasonably well in most locations of the genome,although string matching can lead to indeterminate results, within long repeat regions of the genome[8]. Calling variants
Variant calling is the process of calling variants – creating the diff between thereference genome and a newly sequenced genome – based on information from a “pileup” of alignedshort reads. This process typically proceeds at a high level as follows. First, we align the short readsto the reference genome. Second, we generate candidate variants – a high recall, low precision set of(almost all) possible variants. Third, we score variant probabilities based on local information aroundthe variant in question, such as a reads pileup as demonstrated in Figure 1. This work focuses on thethird step - we use traditional techniques to align reads and generate candidate variants.The presence of even a single variant can lead to the diagnosis of an inherited disease, thus the aimis to classify all variants with a high degree of accuracy. Human genomes are usually sequencedwith enough coverage depth to allow variant calling algorithms to achieve 99% precision and recalloverall. Accuracy is much lower for challenging regions such as long Indels and repeat regions. in trusted regions, ignoring structural variants Most short read sequencing takes place on Illumina machines, HiSeq (older) and NovaSeq (post year 2017).
Genome Chrom Location Ref Variant Depth Allele Frequency TruthHG002 6 51564718 A AGT 33 0.121212 FalseHG002 6 51564718 A AGTGC 33 0.030303 FalseHG002 6 51564718 A AGTGT 33 0.151515 TrueHG002 6 51564718 A AGTGTGT 33 0.303030 True
GATK[11], the most widely used variant calling tool, uses a combination of logistic regression,hidden Markov models, and naive Bayes, combined with hand-crafted features to remove likely falsepositives.DeepVariant [16] demonstrated that a deep neural network trained with gradient descent couldproduce variant calls competitive with statistically based state of the art methods.The DeepVariant method involves converting aligned sequence reads for each candidate variantregion into an RGB image, along with additional read information, such as the base quality scores.This image is fed into the Inception image classification convolutional neural network, predictinga softmax over three classes for each candidate variant: {no variant (false positive), heterozygousvariant, homozygous}.After the pFDA result, DeepVariant significantly improved their model, for both SNPs and Indels(see Table 4), by training on 10x additional human genomes. This demonstrates that the deep neuralnetwork approach benefits from additional training data, and would likely out-pace statistically drivenand hand-tuned approaches to variant calling, given enough quality training examples. Althoughadditional data helps, the additional data is not public, and so for reproducibility, in this work wefocus on approaches trained on the pFDA dataset.The DeepVariant method has since been applied to variant calling for non-human genomes [21, 2] ,as well as to the output of other sequencing machines such as Illumina NovaSeq [1] and technologies,such as PacBio Circular Consensus Sequencing [6].
We propose a new deep neural network for variant calling.The task is one of counting and comparing single reads to form a consensus, in this case for thelikelihood of a heterozygous or homozygous variant. The individual reads are more like a sequence3able 2: Example read encoding for variant proposal A -> ATT.
Read Bases G A T T C G A - - CReference Bases G A - - C G A - - CBase Quality 70 60 50 45 50 60 50 65 35 55Strand Direction 1 1 1 1 1 2 2 2 2 2Reference Mask 0 0 0 0 0 0 1 1 1 0Variant Mask 0 1 1 1 0 0 0 0 0 0Var Length Mask 0 1 1 1 0 0 1 1 1 0 of letters or symbols than an image. Yet recent attempts to represent variant calling as sequences andnot images have not been competitive with DeepVariant, or other state of the art methods [10, 19].Our goal is simple: design a deep learning network that processes individual reads independently,unlike DeepVariant’s 2-dimensional convolutional operations. Information between different readsmust ultimately be shared to produce the result. Our aim was to do so in a small number of simpleoperations, specifically as average pooling across all reads in a pileup. By doing so, we take advantageof the structure of the data: since the reads are each produced independently, we hypothesize that aneural network that processes the reads independently more accurately reflects the structure of theproblem.
The precisionFDA Truth Challenge, sponsored by the FDA in 2016, is a competition on genomicsdata. Teams compete to predict variants on a genome dataset for HG002 (human genome 002 fromGenome in a Bottle – GIAB), with training provided for HG001 (human genome 001, also fromGIAB). Both training and test set BAMs are built from reads from an Illumina HiSeq2500 machine,downsampled to 50x coverage. Within high confidence trust regions (known for HG001, unknownbut similar for HG002) there are approximately 3.4 million true variants.Teams were measured on their accuracy (F1 score) for predicting variants on SNPs and Indels, withprizes awarded for the highest precision, recall and F1 for SNPs and Indel variants. The top resultsare reported on the precisionFDA website , and reproduced in Table 4.Teams are expected to predict the zygosity of variants, as well as joining any multi-allele sites. Whileaccuracies for zygosity and multi-allele are not reported for the challenge, predicting either of thesecategories wrong results in multiple errors. We reproduced precisionFDA results for SNPs and Indelsby running the Hap.py program [7] on the HG002 variant calls. DeepVariant reports their pFDA results (which won the top prize for SNPs F1), as well as a “liveGitHub” version of DeepVariant, which gets the top F1 for SNPs and Indels on pFDA. This updatedmodel was trained with 10x new HG001 datasets, demonstrating that DeepVariant’s model generalizedbetter with more training data.Similarly, we trained our model with the pFDA training data, and additional genome datasets, alsoHiSeq sequenced HG001, drawn from a public dataset [18].We demonstrate that our model also improves on the pFDA challenge when training with threeadditional HG001 datasets. 4igure 2: Network layout.
Our model transforms individual reads through a series of 1-dimensional convolutions, pools the finallayer outputs across all reads, and outputs final predictions through a fully connected neural network,as illustrated in Figure 2.
Encoding individual reads
The input consists of a pileup of aligned reads, such as in Fig.1, and avariant candidate proposal. In addition, the network takes in base quality scores, strand direction foreach read, and masks representing the reference and the variant proposal, as shown in Table 2.The reads and reference bases are expanded into a learned multi-dimension embedding, similar tolearned embeddings for a deep language model [12]. We also add sinusoidal positional embeddingsto each dimension of the learned base embeddings, as introduced in [20].
One-dimensional convolutional layers
We transform the individual reads through as a series ofconvolutional layers, with small one-dimensional convolutional filters, not sharing informationbetween single reads.
Final layer pooling
We combine disparate single reads by performing mean pooling and max poolingoperations across all locations and channels. The mean and max pool outputs are then flattened, andinput to a fully connected network.This network, similar to the DAN (Deep Averaging Network) [5] has the additional property ofignoring read order in the pileup, since all operations are performed at the individual read level, thenthe final outputs are averaged across all reads.
Highway layers
Passing all read level information through seven convolutional layers followed by awide pooling layer may not be efficient. We concatenate a small amount of information, for everyread, directly to the final fully connected network. Details in Table 3.
Fully connected network for variant candidate classification
We connect the concatenated outputsof the pooling layers and the highway layers, to a fully connected network, including dropout andReLU activation layers after every fully connected layer. The final output is a softmax prediction for{no variant, heterozygous, homozygous variant}.
Additional Early Pooling layer
Notably, information between disparate reads is not shared until thefinal layer pooling. To allow read comparison computation to take place in the convolutional layersinstead of in the fully connected network, we insert a second mean pooling layer after the secondconvolutional layer. https://precision.fda.gov/challenges/truth Category Parameters ValuesPileup maximum single reads 100read length 201 (100 to left and right)Input embedding dimensions 20total input dimensions 100x201x45Conv layers number of layers 7residual layers 5,6,7output channels 128activation ReLUbatch normalization [4] truedilation [22] 2, except first layerPooling mean pool, max pool final layerearly mean pool after layer 2Highway reduce dim to 32 channels every layerfinal highway output 100x32 per layerFCN input dimensions 73856layers 1025, 256activation ReLUdropout 0.1Training optimizer ADAMlearning rate 0.0002focal loss γ = 0 . label smoothing (cid:15) = 0 . easy example window (cid:15) easy examples skip rate 0.85 Although the main focus of this work is the variant calling neural network, we describe the othercomponents necessary for this network to function as a complete variant calling system.
Candidate generation
The goal of candidate generation is to produce a high recall set of candidatevariants that we will then be scored by our variant calling network. We use a simple heuristicto generate candidates. First, we count any mapped reads that disagree with the reference at anylocation in the trusted regions. Then, we create a variant candidate at any location, as long as theallele frequency (percentage of reads matching the variant candidate) is above a threshold we set forhigh-recall.We use thresholds of 0.05 for SNP candidates, and 0.02 for Indel candidates. For a 50x coveragedataset, this means that we accept all possible Indel variants as candidates, but we restrict very lowfrequency SNP candidates from our candidate dataset.On the pFDA HG002 test set, this produces 13.4 million SNP candidates and 1.22 million Indelcandidates. Our candidate generator has 99.995% recall for SNP variants and 99.48% recall for Indelvariants on the HG002 test set . Thresholding
Our model produces softmax outputs {no variant, heterozygous, homozygous variant}for each candidate. To produce actual variant calls, we need to threshold both variant truth, andzygosity. Our candidate generator misses 124 SNPs, 316 insertions and 1433 deletions within the HG002 trustedregion. Since our neural network scores candidates but does not propose them, our overall accuracy depends ongood candidate generation, and we believe a more sophisticated candidate generation procedure would furtherimprove accuracy. In other words, candidate generation bounds the accuracy of our model, as shown in Table 4
Multi-allele inference
We take a naive approach to multi-allele training and inference. All alleles,are trained and inferred as independent examples. We merge the top two alleles, unless the top alleleis homozygous and the second allele is below a 0.95 variant probability.Following this simple rule, we classify multi-allelic sites on the pFDA test set with 0.98154 F1.
The genome variant calling dataset is heavily skewed, not just by label frequency, but by the difficultyof the training examples.We train our model with label smoothing [15] to avoid saturating the softmax outputs. We also foundthat focal loss [9] helps convergence. Focal loss reduces the loss weight on well-classified examples,increasing gradient contibutions from mis-classified examples.After one epoch of training, 99.08% of training examples have been classified correctly, within . ∗ (cid:15) of the true label, where (cid:15) is the label smoothing value (after two epochs, easy examples grow to99.70%). With focal loss, we are already driving the loss weight on those examples to zero, thus itwould save us a lot of training time just to skip those examples. We are not aware of similar techniquesof active data downsampling for skewed supervised learning tasks, although similar techniques arewidely used in reinforcement learning [13].Our pFDA training starts with 14,656,643 training candidate examples, 4 epochs of training, no decayand 300 global batch size. We trained our model in the PyTorch [14] framework, on a single NVIDIATesla V100 GPU. Additional training details are listed in Table 3.Table 4: pFDA Truth Challenge results and results with supplementary training data. Type F1 Recall Precision TP FN FPrpoplin-dv42 Overall 0.998597 0.998275 0.998919 3,393,136 5,864 3,671(DeepVariant) Indels 0.989802 0.987883 0.991728 340,370 4,175 2,839SNPs 0.999587 0.999447
When training on the pFDA HG001 dataset, our model achieves a better overall F1 than DeepVariant“pFDA,” which was limited to the pFDA dataset. Our model matches the best overall F1 when7ombining SNPs and Indels, and it would have a prize for the best Indel recall, according the rules ofthe pFDA Truth Challenge.When trained with three additional HiSeq HG001 datasets, our model achieves a better result onSNPs, and also on Indels, than any submission to the pFDA challenge. This result also closes thegap between our pFDA submissions and the DeepVariant v0.4 result, which was trained with 10xadditional datasets.Thus we demonstrate that our model, like DeepVariant, benefits from more training data, even whenthat data is a different run of a similar sequencing machine, on the same underlying genome. Thesemodels generalize better to the pFDA HG002 genome, suggesting the gains are not simply overfit tothe HG001 Truth Set. The details of our neural network architecture are described in Table 3. In Table 5, we demonstratesome ablation studies, from reducing the number of layers, to removing network components such asthe highway layers.Specifically we notice that the model generalized less well, when the highway layers are removed.When the pooling layer dimensions are reduced from 128 to 64 or 32 channels, this greatly reducedthe model’s parameter count, and also increases the test loss and decreases test accuracy. However, asmaller highway dimension appears optimal for a smaller pooling channel output, suggesting thatthese parameters must be kept in balance.We also notice that down-sampling easy examples after the first epoch, appears to improve testaccuracy, as well as save 5x in training time.Lastly, we notice that test results are slightly unstable. This effect is greatly diminished when trainingpFDA with additional datasets. This not only improves generalization as show in Table 4, but themodel appears to be more stable when increasing the number of difficult examples by training onseveral genomic datasets.Table 5: Ablation studies, when changing training configurations from Table 3. Since every variant isimportant, notice the difference in the FN and FP counts, as well as overall F1 scores.
TP FN FP F1 Recall Precision(baseline) 3,394,460 4,172 3,138 0.998924 0.999076 0.998772no early pool 3,394,352 4,264 3,199 0.998902 0.999058 0.9987450.01 label smoothing 3,394,098 4,522 2,984 0.998895 0.999122 0.9986690.1 label smoothing 3,392,884 5,759 4,925 0.998428 0.998551 0.998306no Strand, no Q-Scores 3,392,780 5,858 4,369 0.998495 0.998714 0.998276no HW layers 3,393,121 5,513 4,297 0.998557 0.998735 0.998378128 -> 64 final channels 3,394,159 4,457 3,204 0.998873 0.999057 0.998689128 -> 32 final channels 3,393,643 4,991 3,335 0.998775 0.999018 0.9985316 conv layers 3,393,907 4,710 3,142 0.998845 0.999075 0.9986145 conv layers 3,394,147 4,486 2,861 0.998919 0.999158 0.9986804 conv layers 3,393,100 5,514 3,414 0.998686 0.998995 0.998378no focal loss 3,393,591 5,041 3,521 0.998740 0.998964 0.998517
We presented a deep neural network for genome variant calling. Our work shows that it is possibleto solve the variant calling problem with a individual read-sequence level model, without any two-dimensional convolutions or pooling.Our approach generalizes well enough to match state of the art on the pFDA Truth Challenge, and itbenefits substantially from additional training data. We also demonstrate how it is possible to converge There is a discrepancy between DeepVariant v0.4 results [16] and official pFDA results. An additional15,000 Indels have been added to HG002 evaluation, despite reference to v3.2.2 of the GIAB Truth Set.
Thanks to Mike Vella, Joyjit Daw, Michelle Gill and the NVIDIA genomics group for help withgenomics tooling, as the variant calling problem was new to us before embarking on this work.Thanks also to Andrew Tao, Patrick LeGresley, Boris Ginsburg, Robert Pottorff, Ryan Prenger andSaad Godil of NVIDIA’s applied deep learning research group, for insightful discussions about theneural network design and training of our model. Our methods borrow from disparate deep learningproblems of speech generation, NLP, computer vision and reinforcement learning. Finally thanks toJason Chin, Yih-Chii Hwang and the DNANexus research team, as well as Ali Torkamani and theScripps Research Translational Institute for helping us with additional sources of human genomedata in the public domain, beyond the precisionFDA Truth Challenge.
References [1] A. Carroll. Evaluating the performance of ngs pipelineson noisy wgs data. https://blog.dnanexus.com/2018-01-16-evaluating-the-performance-of-ngs-pipelines-on-noisy-wgs-data/ ,2018.[2] A. Day and R. Poplin. Analyzing 3024 rice genomes characterized by deep-variant. https://cloud.google.com/blog/products/data-analytics/analyzing-3024-rice-genomes-characterized-by-deepvariant , 2019.[3] D. N. Freed, R. Aldana, J. A. Weber, and J. S. Edwards. The sentieon genomics tools-a fast andaccurate solution to variant calling from next-generation sequence data.
BioRxiv , page 115717,2017.[4] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In
Proceedings of the 32Nd International Conference on InternationalConference on Machine Learning - Volume 37 , ICML’15, pages 448–456. JMLR.org, 2015.[5] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III. Deep unordered compositionrivals syntactic methods for text classification. In
Association for Computational Linguistics ,2015.[6] A. Kolesnikov, P.-C. Chang, A. Carroll, and J. Chin. Highly accurate snp andindel calling on pacbio ccs with deepvariant. https://blog.dnanexus.com/2019-01-14-highly-accurate-snp-indel-calling-pacbio-ccs-deepvariant/ ,2019.[7] P. Krusche. Haplotype vcf comparison tools. https://github.com/Illumina/hap.py ,2016.[8] H. Li and R. Durbin. Fast and accurate short read alignment with burrows–wheeler transform. bioinformatics , 25(14):1754–1760, 2009.[9] T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. , pages 2999–3007, 2017.[10] R. Luo, F. J. Sedlazeck, T.-W. Lam, and M. C. Schatz. A multi-task convolutional deep neuralnetwork for variant calling in single molecule sequencing.
Nature communications , 10(1):998,2019.[11] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella,D. Altshuler, S. Gabriel, M. Daly, et al. The genome analysis toolkit: a mapreduce frameworkfor analyzing next-generation dna sequencing data.
Genome research , 20(9):1297–1303, 2010.[12] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations ofwords and phrases and their compositionality.
CoRR , abs/1310.4546, 2013.[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Ried-miller. Playing atari with deep reinforcement learning.
CoRR , abs/1312.5602, 2013.[14] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In
NIPS-W , 2017.915] G. Pereyra, G. Tucker, J. Chorowski, L. Kaiser, and G. E. Hinton. Regularizing neural networksby penalizing confident output distributions.
CoRR , abs/1701.06548, 2017.[16] R. Poplin, P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Di-jamco, N. Nguyen, P. T. Afshar, et al. A universal snp and small-indel variant caller using deepneural networks.
Nature biotechnology , 36(10):983, 2018.[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-ture for computer vision. In
The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2016.[18] A. Telenti, L. C. Pierce, W. H. Biggs, J. Di Iulio, E. H. Wong, M. M. Fabani, E. F. Kirkness,A. Moustafa, N. Shah, C. Xie, et al. Deep sequencing of 10,000 human genomes.
Proceedingsof the National Academy of Sciences , 113(42):11901–11906, 2016.[19] R. Torracinta and F. Campagne. Training genotype callers with neural networks.
BioRxiv , page097469, 2016.[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, andI. Polosukhin. Attention is all you need. In
Advances in Neural Information Processing Systems30 , 2017.[21] W. Wang, R. Mauleon, Z. Hu, D. Chebotarov, S. Tai, Z. Wu, M. Li, T. Zheng, R. R. Fuentes,F. Zhang, et al. Genomic variation in 3,010 diverse accessions of asian cultivated rice.
Nature ,557(7703):43, 2018.[22] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In