Probabilistic modeling of occurring substitutions in PAR-CLIP data
PProbabilistic modeling of occurring substitutions in PAR-CLIP data
Monica Golumbeanu , Pejman Mohammadi , and Niko Beerenwinkel
Department of Biosystems Science and Engineering, ETH Z¨urich, Basel, Switzerland SIB Swiss Institute of Bioinformatics, Basel, Switzerland Equal contribution E-mail [email protected]
Abstract
Photoactivatable ribonucleoside-enhanced cross-linking and immunoprecipitation(PAR-CLIP) is an experimental method based on next-generation sequencing for iden-tifying the RNA interaction sites of a given protein. The method deliberately insertsT-to-C substitutions at the RNA-protein interaction sites, which provides a second layerof evidence compared to other CLIP methods. However, the experiment includes sev-eral sources of noise which cause both low-frequency errors and spurious high-frequencyalterations. Therefore, rigorous statistical analysis is required in order to separate trueT-to-C base changes, following cross-linking, from noise. So far, most of the existingPAR-CLIP data analysis methods focus on discarding the low-frequency errors andrely on high-frequency substitutions to report binding sites, not taking into accountthe possibility of high-frequency false positive substitutions. Here, we introduce
BMix ,a new probabilistic method which explicitly accounts for the sources of noise in PAR-CLIP data and distinguishes cross-link induced T-to-C substitutions from low and high-frequency erroneous alterations. We demonstrate the superior speed and accuracy ofour method compared to existing approaches on both simulated and real, publicly avail-able human datasets. The model is implemented in the Matlab toolbox
BMix
RNA molecules interact with proteins and form ribonucleoprotein complexes (RNPs) actively in-volved in a plethora of essential biological processes such as translational regulation, alternativesplicing or RNA transport (Lunde et al. , 2007; Muller-McNicoll et al. , 2013). A well-known exampleof RNA-binding proteins (RBPs) consists of members of the Argonaute family, components of theRNA-induced silencing complex (RISC), which bind to diverse small RNA molecules and regulategene silencing (Meister, 2013). Gerstberger et. al. report 1,542 RBPs in humans (Gerstberger et al. , 2014), many of these having been found dysregulated in diseases including cancer (Lukong et al. , 2008; Kechavarzi et al. , 2014). Therefore, characterizing the interactions between RNA andRBPs represents an important step towards understanding RNA function.High-throughput sequencing technology allows querying the binding sites of a specific RBP in atranscriptome-wide fashion (Blencowe et al. , 2009; Kloetgen et al. , 2014). One of the recently devel-oped high-throughput sequencing-based experimental protocols aiming to identify the binding sitesof RBPs throughout the transcriptome is Photo-Activatable Ribonucleoside-enhanced Crosslinkingand Immunoprecipitation (PAR-CLIP) (Hafner et al. , 2010). According to this method (Figure 1A),1 a r X i v : . [ q - b i o . Q M ] A p r T T coverage AB U culture with a photoactivatable ribonucleoside UV crosslink isolation, RNase digestion protease digestion, cDNA library construction sequencing CT UU U xl U xl U xl C T substituted with Caligned readsreference genome T C s ub s t i t u t i on f r equen cy sequence variants cross-link background v ψ ε γ (x i ,y i )z i i ∈ G θ Figure 1: (A)
The main steps of the PAR-CLIP protocol including cell culture with 4-thiouridine( S U ), UV cross-link, isolation of the RNA-RBP complexes and sequencing of the bound RNAfragments where systematic T-to-C substitutions occur. (B) Schematic representation of the BMixrationale. The method takes as input the aligned sequencing reads and uses a three-componentmixture based on the observed substitution counts to infer whether the observed T to C substitu-tions are most likely caused by sequencing errors, sequence variants or cross-link. (C)
Graphicalrepresentation of the statistical model. For each evaluated locus i on the genome G , the coverage x i and the mismatch count y i are observed. The latent variable z i used to indicate the differentcomponents of the mixture, and the parameters (cid:15), γ, θ, ν , and ψ define the model (cf. Methods).a synthetic photoactivatable ribonucleoside such as S U (4-thiouridine) or, less commonly, S G (6-thioguanosine) is integrated into the RNA of cultured cells. Upon exposure to ultraviolet (UV)light, cross-linking of RBPs to RNA occurs. The cross-linked RNA-RBP pairs are subsequentlyisolated using immunoprecipitation with an antibody targeting the protein of interest, and the RNAfragment is retrieved upon protein digestion. A complementary DNA (cDNA) sequencing libraryis generated by reverse complementing the RNA fragments. Due to the incorporated nucleoside,systematic T-to-C (for S U ) or G-to-A (for S G ) substitutions appear in the cDNA library at theinteraction sites (Hafner et al. , 2010). Therefore, PAR-CLIP brings the advantage of having anadditional layer of evidence by introducing specific base changes at the binding sites.PAR-CLIP data is characterized by prevalent T-to-C substitutions observed at different mis-match frequencies (Hafner et al. , 2010). However, compared to RNA-Seq, PAR-CLIP has beenobserved to also introduce a large number of other substitutions, different from T-to-C, notably atlow and high mismatch frequency (see Results). This indicates the presence of noise and contam-ination in the PAR-CLIP procedure which can as well introduce erroneous T-to-C substitutions,with high potential to be mistaken for true cross-link alterations. Therefore, of great importancein PAR-CLIP data analysis is discarding the low-frequency sequencing errors, as well as high-frequency substitutions induced by contamination or sequence variants.Currently, there is a handful of methods available for analyzing PAR-CLIP data, trying to iden-tify the RNA-RBP binding sites using various techniques, such as kernel density estimation (Cor-coran et al. , 2011), non-parametric mixture models (Sievers et al. , 2012; Comoglio et al. , 2015),2ayesian hidden Markov models (Yun et al. , 2014), and binomial tests (Chen et al. , 2014). Someof the available methods focus on analyzing data from one specific type of RBP, such as, for ex-ample, AGO2 PAR-CLIP data (Erhard et al. , 2013). However, the non-cross-link, high-frequencysubstitutions are usually reported within high-confidence binding sites by most of these methods.To the best of our knowledge, only one of the currently existing PAR-CLIP analysis methods,namely WavClusteR (Sievers et al. , 2012; Comoglio et al. , 2015), accounts for these substitutions.Sievers et.al. show that cross-link loci reside in moderately altered sites (Sievers et al. PAR-CLIP and RNA-Seq reads were clipped from their adapter using the fastx clipper tool (http://hannonlab.cshl.edu) and only reads larger than 13 nucleotides were kept for furtheranalysis. The reads were aligned to the human reference genome assembly hg19 with the bowtie alignment tool version 0.12.9 (Langmead et al. , 2009) using the same parameters employed byPARalyzer and WavClusteR, -n 1 --best -m 100 -k 1 -l 50 . The data consists of PAR-CLIP cDNA sequencing reads. After alignment to the reference genome,at each position in an aligned read, the observed nucleotide can either match or differ from thereference. When the nucleotide differs from the reference, several causes are possible. First, theobserved nucleotide could be a sequence variant, caused by single-nucleotide polymorphism (SNP),or foreign, non-cross-linked RNA fragments mapped to a reference location highly similar to theirsequence (contamination). Second, if the reference is T and the corresponding read nucleotide isC, then a cross-link-induced substitution could have occurred. Third, a mismatch could have alsohappened due to sequencing error. These three events are not mutually exclusive.In order to detect RNA-protein cross-link-induced T-to-C substitutions, we model, for each po-sition i in the genome, where the reference, r i , is different from C, the probability of the observedT-to-C, A-to-C, or G-to-C substitution. We define x i as the sequencing coverage at position i , and y i as the number of times the reference nucleotide is substituted with C in all the reads covering po-3ition i . We assume that the observed T-to-C alterations are due to (i) sequencing error, (ii) SNPsor contamination, or (iii) PAR-CLIP cross-link-based substitution, whereas the observed A-to-Cand G-to-C substitutions are assumed to originate only from (i) sequencing error, or (ii) SNPs orcontamination. We ignore the cases of photo-activated sequence variants (i.e., where the referenceis A, or G, while the sequence variant is T, and is substituted to C due to photo-activation byPAR-CLIP). With this simplification, we introduce the latent random variable z i ∈ { , , } corre-sponding to the three possible reasons (i)-(iii) that can explain the observed nucleotide at locus i onthe genome. Specifically, for reference T positions, z i = 1 refers to background, z i = 2 correspondsto a sequence variant, and z i = 3 refers to an RNA-RBP cross-link. For reference A or G positions,only z i = 1 and z i = 2 are possible.We define (cid:15) as the probability of inducing a substitution due to sequencing noise. This proba-bility accounts for all the modeled nucleotide substitutions (i.e., T-to-C, A-to-C, G-to-C, C-to-T,C-to-A, and C-to-G) and is expected to be low. Consequently, at background positions ( z i = 1),the probability of occurrence of a specific substitution is (cid:15) . Furthermore, 3 (cid:15) represents the prob-ability of a nucleotide to mutate to any possible base. Therefore, in the case of sequence variantloci ( z i = 2), where one can assume that the aligned reads originate from a different sequence thanthe reference, substitutions happen with a success probability of 1 − (cid:15) . Finally, at cross-link loci( z i = 3), which can, at the same time, be affected by sequencing errors, T-to-C substitutions occurwith probability θ = (1 − γ ) (cid:15) + (1 − (cid:15) ) γ, (1)where γ corresponds to the probability of a T nucleotide to be mutated to C following photo-activation and cross-link during PAR-CLIP, i.e., to the efficiency of the protocol to induce T-to-Csubstitutions at cross-link loci. We assume that the probability θ is bounded between (cid:15) and 1 − (cid:15) ,which results in the constraint (cid:15) ≤ . ν = P ( z i = 2) the probability of a locus on the genome to be a sequencevariant, and, for the remaining cases which are not sequence variants, by ψ the probability that agenomic locus is a cross-link site. The observed data at each T reference position on the genome isthen modeled by a constrained mixture of three binomial distributions, and the probability of anobserved data point is P (( x i , y i ) | r i = T ) = (cid:88) z i =1 P (( x i , y i ) | z i , r i = T ) P ( z i | r i = T )= (1 − ψ )(1 − ν ) Bin ( y i ; x i , (cid:15) ) (cid:124) (cid:123)(cid:122) (cid:125) background ++ νBin ( y i ; x i , − (cid:15) ) (cid:124) (cid:123)(cid:122) (cid:125) sequence variant + ψ (1 − ν ) Bin ( y i ; x i , θ ) (cid:124) (cid:123)(cid:122) (cid:125) cross − link (2)where (cid:15), θ, γ, ν , and ψ are the parameters of the model, and the notation Bin ( k ; n, p ) correspondsto the probability mass function of the Binomial distribution, precisely the probability of having k successes within n trials with success probability p .In absence of a control PAR-CLIP experiment, our model readily incorporates information fromA-to-C and G-to-C alterations for a better estimation of the sequencing error (cid:15) and the probability ν . The probability of observed coverage and mismatch count for observed A-to-C and G-to-Calterations is P (( x i , y i ) | r i ∈ { A, G } ) = (1 − ν ) Bin ( y i ; x i , (cid:15) ) (cid:124) (cid:123)(cid:122) (cid:125) background + νBin ( y i ; x i , − (cid:15) ) (cid:124) (cid:123)(cid:122) (cid:125) sequence variant . (3)4he model (Figure 1C) is thus fully defined by equation 2 for reference genome T positions, andequation 3 for A and G positions in the reference genome. Using these equations, we can derivethe likelihood for the entire set of observations D = { ( x i , y i ) } i ∈G throughout the whole genome G as follows: L ( (cid:15), θ, γ, ν, ψ ) = P ( D | (cid:15), θ, γ, ν, ψ ) = (cid:89) i ∈G P (( x i , y i ) | r i ) (4)We infer all the parameters of our model by maximizing the above defined likelihood witha gradient-based nonlinear constrained optimization (Powell, 1978). To classify each T locus onthe genome as either background, sequence variant, or cross-link, we choose the maximum of theposterior probabilities of the latent variable z i : P ( z i = 1 | ( x i , y i ) , r i = T ) ∝ (1 − ψ )(1 − ν ) Bin ( y i ; x i , (cid:15) ) P ( z i = 2 | ( x i , y i ) , r i = T ) ∝ νBin ( y i ; x i , − (cid:15) ) P ( z i = 3 | ( x i , y i ) , r i = T ) ∝ ψ (1 − ν ) Bin ( y i ; x i , θ ) (5)where the ∝ symbol is used to represent proportionality between the posterior probability and theproduct between the likelihood and the prior, thus the normalization constant (same in all threecases) being omitted. Once the high-confidence cross-link T loci were identified, we report candidate RNA-RBP bindingsites by using the sequencing reads that span these loci. By binding site we denote the region on thetranscriptome where the protein of study attaches in order to fulfill a specific function. In order toconstruct the binding sites, all the reads spanning a cross-link locus are grouped into a cluster. Thelow-coverage boundaries are trimmed and overlapping clusters (at least 1 nucleotide) are groupedinto contigs and reported as candidate RNA-RBP binding sites (Supplementary Figure S1).
For the generation of realistic synthetic data, we used the AGO2 PAR-CLIP data publishedin (Kishore et al. , 2011) and we introduced systematic A-to-C substitutions. Precisely, startingfrom 2000 known random nucleotide A positions on chromosome 1, we introduced an A-to-C basechange in each sequencing read with probability µ at the respective positions. Furthermore, wechanged to C, using the same probability, the neighboring A nucleotides within an interval of 50bases centered in each of the 2000 positions. By altering the neighboring positions, we built regionswhere the incorporated A-to-C substitutions were more dense, simulating binding sites. The prob-ability µ denotes how likely an activated nucleotide T within a binding site is mutated to C in thereal PAR-CLIP protocol. The produced simulated data contains a realistic amount of sequencingerrors and contamination and is based on the alteration of a reference base different from T. Weassessed the performance of BMix, PARalyzer and WavClusteR on the simulated data. The T-to-Csubstitutions were ignored by all the applied methods on the simulated data and did not affect theiroutcome. We compared BMix to PARalyzer v1.1 (Corcoran et al. , 2011) and WavClusteR v2.0.0 (Comoglio et al. , 2015) on the produced synthetic data, as well as on publicly available PAR-CLIP datasetspublished in (Kishore et al. , 2011) and (Sievers et al. , 2012). On the simulated data, the three5 . . . . . . . . . . . . . A C s u b s titu tio n p ro b a b ility A v e r age a cc u r a cy We validated our model on synthetic and real human PAR-CLIP data and assessed its performancecompared to PARalyzer and WavClusteR methods (see Methods).
In the absence of a ground truth dataset for protein-binding sites, we generated a set of syntheticdatasets in order to evaluate our model and compare it to the existing methods WavClusteR andPARalyzer. We started from real world PAR-CLIP data and mimicked in silico the PAR-CLIPprotocol for a different substitution than T-to-C, precisely A-to-C. In this way, we kept the intrinsicnoise and contamination levels specific to PAR-CLIP data and, at the same time, introduced vali-dation loci. Furthermore, the simulated data was built independently from our model, providing anunbiased test set for all the methods. Knowing thus which A-to-C substitutions were introduced,we could test how accurate BMix as well as other methods were in detecting the altered loci. Foreach substitution probability µ in the simulation, ranging from 0 . .
9, we generated 100 datasets(cf. Methods) and computed the accuracy of BMix, WavCluster, and PARalyzer.On simulated data, BMix had a significantly higher accuracy (Wilcoxon test p < − ) than6 . . . . . . log10 coverage A C m i s m a t c h f r e q u e n c y . . . . . . log10 coverage A C m i s m a t c h f r e q u e n c y A B
Figure 3: Genome-wide A-to-C observed mismatch frequency as a function of log10 of coverage ina PAR-CLIP dataset (A) , as well as in the matched control RNA-Seq dataset (B) .the other two methods, for all the employed substitution probabilities µ (Figure 2). BMix hadon average an accuracy of 97%, compared to 90% for WavClusteR and 78% for PARalyzer. Allmethods reached their lowest accuracy at µ = 0 .
1. However, while WavClusteR and PARalyzerhad similar accuracy for this case (76% and 74%, respectively), BMix outperformed them with anaccuracy of 89%.By looking at the outcome of the three methods on a randomly chosen synthetic dataset with µ = 0 . We ultimately applied our method on three published human PAR-CLIP datasets correspondingto proteins AGO2, HUR (Kishore et al. , 2011), and MOV10 (Sievers et al. , 2012), and inferred themodel parameters for each dataset (Supplementary Table 1). Additionally, we applied WavClusteRand PARalyzer on the same data and we compared the three different methods by evaluating theirresults according to specific characteristics of the proteins such as miRNA, 3’UTR, and intronaffinity, as well as enrichment of protein-specific RNA recognition elements (RREs).By comparing the PAR-CLIP data for the AGO2 protein to matched RNA-Seq data from thesame sample, published in (Kishore et al. , 2011), we observed the expected prevalence of T-to-Csubstitutions (Supplementary Figure S3), but also a significantly larger amount of A-to-C and G-to-C alterations (one tailed Wilcoxon test p < − ) in the AGO2 PAR-CLIP dataset (Figure 3and Supplementary Figure S4) indicating a high level of contamination. Identification of AGO2 binding sites nRNAantisensem isc_RNAprocessed_pseudogenerRNAlincRNAotherm iRNAretained_intronnonsense_m ediated_decayprocessed_transcriptunannotatedprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5cross-linksequence variantbackground A rR N Am isc_R N Aantisenseprocessed_pseudogenelincR N Aotherm iR N Anonsense_m ediated_decayretained_intronunannotatedprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5BMix common sitesW avC lusteR-only sitesPAR alyzer-only sites B Figure 4: (A)
Proportion of BMix-classified loci within each Ensembl gene type retrieved us-ing the UCSC table browser (Karolchik et al. , 2004) for replicate A of the AGO2 PAR-CLIPdataset (Kishore et al. , 2011). (B)
Annotation according to the Ensembl gene types for bindingsites commonly identified by BMix and the other three methods, as well as for the additional sitesreported by PARalyzer and WavClusteR. The proportion of binding sites within each gene typeis displayed. All the Ensembl types which contained less than 0.1% sites were grouped under thename ”other” and all the sites which did not fall within any annotation were marked as ”unanno-tated” .Two replicate PAR-CLIP datasets were tested for the AGO2 protein. Since AGO2 is one of theproteins involved in RISC, its affinity to miRNAs and 3’UTRs was expected to be elevated (Hen-drickson et al. , 2008).BMix identified 15,317 binding sites for replicate A, and 9,615 binding sites for replicate Bof the AGO2 dataset. We annotated the three classes of loci reported by BMix (background,sequence variant and cross-link) according to the Ensembl gene types retrieved using the UCSCTable Browser (Karolchik et al. , 2004). A higher proportion of background and sequence variant locioverlapped with ribosomal RNA (rRNA) and unannotated regions than the cross-link loci whichmostly covered protein-coding regions (Figure 4A). Both PARalyzer and WavClusteR reportedaround 4000 more binding sites than BMix (Supplementary Table 2). Annotation of the bindingsites according to the same Ensembl types showed that a large proportion of these additional sitescovered significantly more rRNA and unannotated regions and less protein coding regions comparedto the common sites (Figure 4B). To assess the reproducibility of the three methods, each methodwas applied on each AGO2 replicate dataset independently, and the number of common miRNAsfound within the binding sites between replicates was reported. All the three methods yieldeda similar high percentage of reproducible miRNAs: 88.6% for BMix, 88.98% for PARalyzer and89.1% for WavClusteR (Figure 5A and Supplementary Table 2).For replicate A, over 95.37% of the binding sites found with BMix overlapped with the onesfound by PARalyzer, and 99.99% overlapped with WavClusteR. Similarly, for replicate B, 96.65%of the sites reported by BMix overlapped with the PARalyzer binding sites, and 99.99% with theWavClusteR sites (Supplementary Table 2). BMix reported on average 4% more of its bindingsites within 3’UTRs than the other two methods (Figure 5B and Supplementary Table 2). Over70% of the binding sites reported by BMix were less than 30 nucleotides long (SupplementaryFigure 8A), in concordance with the expected small length of miRNA targets. Furthermore, giventhat it has been previously reported that some targeted 3’UTRs may have high miRNA target-siteabundance (Garcia et al. , 2011), we investigated the number of identified binding sites in each3’UTR. We found that over 85% of the covered 3’UTRs contained only one binding site, while asmall proportion had several binding sites (Supplementary Figure 8B).8
AGO2 R e p r o d u c e d m i R N A s B AGO2
Rep. A Rep. B o v e r l a p w i t h ' U T R s C MOV10 o v e r l a p w i t h ' U T R s D HUR R e p r o d u c e d RR E s E HUR
Rep. A Rep. B o v e r l a p w i t h ' U T R s & i n t r o n s BMix PARalyzer WavClusteR
F G H log10 sample size e x e c u t i o n t i m e ( h ) I observed TC mismatch frequency % r e p o r t e d l o c i AGO2 Rep. A observed TC mismatch frequency % r e p o r t e d l o c i HUR Rep. A observed TC mismatch frequency % r e p o r t e d l o c i MOV10
Figure 5: Summary of results from applying BMix, PARalyzer and WavClusteR on human datasetsfor proteins AGO2, MOV10 and HUR. (A)
Reproducibility of the methods in terms of percentageof commonly reported miRNAs between two AGO2 PAR-CLIP replicates. (B)
Percentage ofbinding sites reported in 3’UTRs for the AGO2 datasets. (C)
Percentage of binding sites reportedin 3’UTRs for the MOV10 dataset. (D)
Reproducibility of the methods in terms of commonlyreported RREs between two HUR PAR-CLIP replicates. (E)
Percentage of binding sites reportedin 3’UTRs and introns for the HUR datasets. (F-H)
Fraction of reported loci with a particularT-to-C substitution frequency out of the total number of observed loci with that substitutionfrequency. (I)
Execution time for the three methods as function of log10 of the number of reportedbinding sites.
Identification of MOV10 binding sites
Next, we applied the three methods on a dataset for the MOV10 protein, also expected to pref-erentially bind in 3’UTRs (Sievers et al. , 2012). In absence of replicates, we could not assess thereproducibility of the methods on this dataset. However, as for the previous dataset, BMix found ahigher percentage of its binding sites (56.39%) in 3’UTRs than PARalyzer and WavClusteR whichobtained an overlap of 40.14% and 50.4%, respectively (Figure 5C and Supplementary Table 3).Furthermore, more than 96% of the binding sites reported by BMix overlapped with the ones foundby the two other methods.
Identification of HUR binding sites
Finally, we applied a similar evaluation scheme on two replicate PAR-CLIP datasets for the HURprotein, a well-characterized RBP involved in maintaining mRNA stability and regulating geneexpression (Peng et al. , 1998). It has been shown that this protein preferentially binds AU-rich re-gions in 3’UTRs of messenger RNAs, as well as intronic regions (Lebedeva et al. , 2011). Therefore,we quantified the amount of binding sites found by each method within these genomic featuresfor each replicate independently, as well as their enrichment for HUR-specific RNA recognitionelements (RREs) described in (Ma et al. , 1996).For both replicates, the binding sites found by BMix overlapped over 90% with the binding sitesreported by the other methods (Supplementary Table 4). To assess the reproducibility of the meth-ods, we evaluated the percentage of RRE-enriched sites common between the two replicates. Eventhough it reported less binding sites than the other methods, BMix reached a higher reproducibil-9ty of 50.8% compared to 49.45% obtained with PARalyzer and 46.27% reported by WavClusteR(Figure 5D and Supplementary Table 4). For both replicates, BMix had also a superior percentageof RRE-enriched sites than the other two methods (Supplementary Table 4).BMix, PARalyzer and WavClusteR were applied on each replicate HUR dataset independently.For both replicates, BMix reported more binding sites within 3’UTRs and intronic regions com-pared to the other two methods (Figure 5E). Precisely, 68.1% of the sites reported by BMix forreplicate A overlapped with 3’UTRs and intronic regions, while PARalyzer and WavClusteR at-tained 66.95% and 68.06%, respectively (Supplementary Table 4). For replicate B, despite the lowernumber of reported sites, BMix had a superior overlap with 3’UTRs, namely 66.83%, compared to59.83% and 59.37% obtained by PARalyzer and WavClusteR, respectively (Supplementary Table 4).For all the datasets, the additional cross-link loci reported by PARalyzer and WavClusteR com-pared to BMix had either a low substitution frequency and low coverage, or a high substitutionfrequency. To illustrate this, we calculated, for each method, the fraction of reported loci with aparticular T-to-C substitution frequency out of the total number of observed loci with that substi-tution frequency. PARalyzer reported all the high-frequency altered loci with no exception, whileWavClusteR learned a more lenient threshold than BMix for the low and high substitution fre-quency regions (Figures 5F-H and Supplementary Figure S5). Furthermore, the additional bindingsites reported by the two other methods were more prevalent in rRNA and unannotated regionscompared to the BMix sites (Figure 4B and Supplementary Figure S7). A large proportion of thebinding sites reported by BMix were less than 30 nucleotides long, and several covered 3’UTRscontained more than one binding site (Supplementary Figures S8-S12).We analyzed the impact of the alignment strategy on the results of our method by testing BMixon Bowtie-aligned AGO2 PAR-CLIP data (Kishore et al. , 2011), allowing for one, two and threemismatches. The number of BMix-reported binding sites increased to over 40%, from 15,317 siteswith one mismatch to 21,607 and 25,289 sites with two and three mismatches, respectively. Over50% of the identified binding sites with two or three mismatches were also detected with one mis-match (Supplementary Table 5). More than 85% of the binding sites identified with BMix usingone mismatch were also found by using two or three mismatches. Nevertheless, the configurationof the three types of loci classified with BMix changed as the number of mismatches was increased.Precisely, the fraction of reported cross-linked loci decreased, whereas the other two classes gainedin size as the number of allowed mismatches increased (Supplementary Figure S13A). On the otherhand, PARalyzer doubled the number of reported binding sites when more than one mismatch wasallowed, reporting 19,248 sites for one mismatch, 40,522 sites for two mismatches, and 51,029 sitesfor three mismatches. WavClusteR functions specifically for alignments with one allowed mismatch,therefore testing for more mismatches was not possible. We performed the same analysis with BMixby choosing TopHat (Trapnell et al. , 2009) as aligner (Supplementary Figure S13B) on the samedataset, although PAR-CLIP reads are generally too short to be efficiently used by a splice-awarealigner. Similar results were obtained with TopHat as with Bowtie, especially when one or twomismatches were allowed (Supplementary Tables 5 and 6, and Supplementary Figure S13).In terms of execution time, on one core of a Linux machine with a clock rate of 2.3GHz, BMixproved to be considerably faster than the other two methods, running on average in less than 40minon all the datasets. On the contrary, the execution time of both PARalyzer and WavClusteR onthe same machine increased with the sample size from several hours to multiple days for the HURdatasets (Figure 5I). 10
Discussion
In the presented work, we have proposed BMix, a new probabilistic model for identifying high-confidence RNA-protein interaction sites from PAR-CLIP data. BMix uses a constrained semi-supervised three-component binomial mixture model to describe the T-to-C substitutions observedat genomic loci in three categories: low-frequency errors due to sequencing noise, true cross-linksites, or high-frequency sequence variants caused by SNPs or contamination. Therefore, our modelbrings the novelty of accounting for both low and high-frequency erroneous T-to-C alterations inPAR-CLIP data. We validated and demonstrated the superior performance of BMix compared tothe methods WavClusteR v2.0.0 and PARalyzer v1.1 both on synthetic and real data.Most of the current PAR-CLIP analysis methods focus on filtering low-frequency altered loci andconsider the high-frequency substitutions as reliable indicators of cross-linking. We have observedthis behavior also within our study on real data, where PARalyzer has selected all the high-frequencyaltered loci within its reported binding sites (Figures 5F-H and Supplementary Figure S5). However,methods like PARalyzer ignore the possibility that high-frequency alterations could have beencaused by other factors such as contamination or single nucleotide variants. By comparing PAR-CLIP and matched control RNA-Seq data from the same experiment, the prevalence of highlyaltered loci was clearly observed also for non T-to-C substitutions in published data (Figure 3),which motivates the need of identifying and discarding spurious highly altered loci from the analysis.So far, only the WavClusteR method accounts for high-frequency substitutions. However, becauseit uses relative substitution frequency values instead of actual read counts, the method loosesperformance especially in low and high substitution regions, as presented in real and synthetic dataapplications (Figures 2, 5F-H and Supplementary Figures S2 and S5).In an extensive simulation study using varying PAR-CLIP substitution probability µ , we haveobserved that the accuracy of both BMix and WavClusteR attains a global maximum at µ = 0 . µ = 0 .
7, respectively and then decreases. In other words, a perfect 100% PAR-CLIP T-to-Csubstitution efficiency would not improve the result; on the contrary it would make difficult todifferentiate between induced substitutions and high-frequency errors.The Ensembl annotation of the three classes of loci reported by BMix showed that our modelcaptures more rRNA and unannotated RNA within its background and sequence variant mixturecomponents, while the reported cross-link loci mainly cover protein-coding regions and miRNAs.The PARalyzer approach, based on selecting high-frequency mutations as high-confidence bindingsites is therefore at risk of reporting a large amount of spurious alterations as cross-link loci.WavClusteR is exposed to the same risk by using relative frequencies instead of substitutionscounts, thus disregarding uncertainty in the low coverage regions. As a result, BMix identified lessbinding sites than the other two methods, at the same time keeping high the proportion of bindingsites overlapping with features of interest. The additional binding sites detected by the other twomethods overlapped more with rRNA and unannotated regions than the sites commonly identifiedwith BMix.BMix, as well as the other PAR-CLIP analysis methods, relies on the alignment of sequencingreads to a reference genome. In this work, we have chosen the standard alignment strategy employedby PARalyzer and WavClusteR, aligning the sequencing reads with Bowtie and allowing for onemismatch. However, a strict alignment strategy would potentially discard reads with viable cross-link T-to-C alterations. Due to explicit modeling of noise and contamination, our model is lesssensitive to the choice of alignment and is able to control the false positive rate even for morelenient alignment parameters. As a result, the user of BMix can pick the alignment procedure whichbest suits the data without having to maintain a too strict control on the alignment parameters.This procedure depends on multiple aspects such as, for example, the quality of the sequencing, the11ength of the reads, or the binding protein (K¨onig et al. , 2012). Ultimately, a systematic comparisonof the results obtained from different alignment strategies can be utilized for a better quantificationof the binding sites.A limitation of our method consists in the difficulty of sorting out T-to-C substitutions ofmoderate frequency which have not been introduced by PAR-CLIP. These can occur followingdiverse molecular processes such as, for example, RNA editing by ADAR enzyme (Samuel, 2011;Cattaneo and Billeter, 1992), following contamination, or as heterozygous SNPs. In this case, anyanalysis relying entirely on coverage and substitution counts does not have sufficient statisticalpower to discard these false loci, and our method can report false positives originating from theseloci. Nevertheless, these substitutions would also appear during a control experiment. A controlPAR-CLIP experiment that would include the same steps as in the normal PAR-CLIP protocol,except for inducing T-to-C alterations, could be helpful. Therefore, a potential solution to thislimitation would be to use a control PAR-CLIP or RNA-seq experiment. The control experimentswould reveal these substitutions and one can afterwards subtract them from the PAR-CLIP cross-link T-to-C alterations identified with BMix.Due to the reduction of sequencing costs, a high increase in sequencing-based experiments suchas PAR-CLIP is expected in the near future. BMix provides a rigorous probabilistic method whichis significantly faster and more accurate than the current state-of-the art methods for detectingRNA-protein interaction sites in PAR-CLIP data.
Acknowledgement
The authors thank Federico Comoglio and Cem Sievers for valuable feedback and discussions, aswell as their support with WavClusteR.
References
Blencowe, B. J. et al. (2009). Current-generation high-throughput sequencing: deepening insightsinto mammalian transcriptomes.
Genes & Dev. , (23), 1379–1386.Cattaneo, R. and Billeter, M. (1992).
Mutations and A/I Hypermutations in Measles Virus Persis-tent Infections , volume 176 of
Current Topics in Microbiology and Immunology . Springer BerlinHeidelberg.Chen, B. et al. (2014). PIPE-CLIP: a comprehensive online tool for CLIP-seq data analysis.
Genomebiology , (1), R18.Comoglio, F. et al. (2015). Sensitive and highly resolved identification of RNA-protein interactionsites in PAR-CLIP data. BMC Bioinformatics , (1), 32+.Corcoran, D. et al. (2011). PARalyzer: definition of RNA binding sites from PAR-CLIP short-readsequence data. Genome Biology , (8), R79.Erhard, F. et al. (2013). PARma: identification of microRNA target sites in AGO-PAR-CLIP data. Genome Biology , (7), R79.Garcia, D. M. et al. (2011). Weak seed-pairing stability and high target-site abundance decreasethe proficiency of lsy-6 and other microRNAs. Nat Struct Mol Biol , (10), 1139–1146.12erstberger, S. et al. (2014). A census of human RNA-binding proteins. Nature Reviews Genetics , (12), 829–845.Hafner, M. et al. (2010). Transcriptome-wide identification of RNA-binding protein and microRNAtarget sites by PAR-CLIP. Cell , (1), 129–41.Hendrickson, D. G. et al. (2008). Systematic identification of mRNAs recruited to Argonaute 2 byspecific microRNAs and corresponding changes in transcript abundance. PloS one , (5), e2126.Karolchik, D. et al. (2004). The UCSC Table Browser data retrieval tool. Nucleic Acids Research , (suppl 1), D493–D496.Kechavarzi, B. et al. (2014). Dissecting the expression landscape of RNA-binding proteins in humancancers. Genome biology , (1), R14.Kishore, S. et al. (2011). A quantitative analysis of CLIP methods for identifying binding sites ofRNA-binding proteins. Nature methods , (7), 559–64.Kloetgen, A. et al. (2014). Biochemical and bioinformatic methods for elucidating the role ofRNA-protein interactions in posttranscriptional regulation. Briefings in Functional Genomics .K¨onig, J. et al. (2012). Protein-RNA interactions: new genomic technologies and perspectives.
Nature , (February), 77–83.Langmead, B. et al. (2009). Ultrafast and memory-efficient alignment of short DNA sequences tothe human genome. Genome Biology , (3), R25.Lebedeva, S. et al. (2011). Transcriptome-wide Analysis of Regulatory Interactions of the RNA-Binding Protein HuR. Molecular Cell , (3), 340 – 352.Lukong, K. E. et al. (2008). RNA-binding proteins in human genetic disease. Trends in genetics :TIG , (8), 416–25.Lunde, B. et al. (2007). RNA-binding proteins: modular design for efficient function. Nat Rev MolCell Biol , (6), 479–90.Ma, W.-j. et al. (1996). Cloning and Characterization of HuR, a Ubiquitously Expressed Elav-likeProtein. The Journal of Biological Chemistry , (14), 8144–8151.Meister, G. (2013). Argonaute proteins: functional insights and emerging roles. Nature reviews.Genetics , (7), 447–59.Muller-McNicoll, M. et al. (2013). How cells get the message: dynamic assembly and function ofmRNA - protein complexes. Nature Reviews Genetics , (4), 275–287.Peng, S. S.-Y. et al. (1998). RNA stabilization by the AU-rich element binding protein, HuR, anELAV protein. The EMBO Journal , (12), 3461–3470.Powell, M. (1978). A fast algorithm for nonlinearly constrained optimization calculations , volume630 of
Lecture Notes in Mathematics . Springer Berlin Heidelberg.Samuel, C. E. (2011). Adenosine deaminases acting on RNA (ADARs) are both antiviral andproviral.
Virology , (2), 180 – 193. Special Reviews Issue 2011.13ievers, C. et al. (2012). Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. Nucleic Acids Research .Trapnell, C. et al. (2009). Tophat: discovering splice junctions with RNA-Seq.
Bioinformatics , (9), 1105–1111.Yun, J. et al. (2014). Bayesian hidden Markov models to identify RNA - protein interaction sitesin PAR-CLIP. Biometrics , (2), 430–440. 14 upplementary materials for:Probabilistic modeling of occurring substitutions in PAR-CLIP data Monica Golumbeanu, Pejman Mohammadi, Niko Beerenwinkel
Supplementary Figures mapped reads genomecontig (reported binding site)cluster
Figure S1: Construction of a binding site starting from high-confidence cross-link T loci (red circles).First, the reads spanning these loci (blue segments) are grouped into clusters. Low coverage marginsare trimmed off. Overlapping clusters are merged into contigs and reported as RNA-RBP bindingsites. . . . . . . log10 coverage A C m i s m a t c h f r e q u e n c y AllA CTrueA C 1 2 3 4 5 . . . . . . log10 coverage A C m i s m a t c h f r e q u e n c y AllA CBMixA C1 2 3 4 5 . . . . . . log10 coverage A C m i s m a t c h f r e q u e n c y AllA CWavClusteR A C 1 2 3 4 5 . . . . . . log10 coverage A C m i s m a t c h f r e q u e n c y AllA CPARalyzerA C
A BC D
Figure S2: BMix, PARalyzer and WavClusteR outcomes on a simulated dataset with incorporatedA-to-C alterations with µ = 0 . (A) True A-to-C induced substitutions (green) within the simulated data (grey). (B)
A-to-C substitutionsdetected by BMix (blue). (C)
A-to-C substitutions identified by WavClusteR (blue). (D)
A-to-Csubstitutions reported by PARalyzer (blue). 15 B . . . . . . log10 coverage T C m i s m a t c h f r e q u e n c y . . . . . . log10 coverage T C m i s m a t c h f r e q u e n c y Figure S3: Genome-wide T-to-C observed mismatch frequency as a function of log10 of coveragein a PAR-CLIP dataset (A) compared to the matched control RNA-Seq dataset (B) . . . . . . . log10 coverage G C m i s m a t c h f r e q u e n c y A . . . . . . log10 coverage G C m i s m a t c h f r e q u e n c y B Figure S4: Genome-wide G-to-C observed mismatch frequency as a function of log10 of coveragein a PAR-CLIP dataset (A) compared to the matched control RNA-Seq dataset (B) . A B observed TC mismatch frequency % r e p o r t e d l o c i BMixPARalyzerWavClusteR
AGO2 Rep. B observed TC mismatch frequency % r e p o r t e d l o c i BMixPARalyzerWavClusteR
HUR Rep. B
Figure S5: Fraction of reported loci with a particular T-to-C substitution frequency out of the totalnumber of observed loci with that substitution frequency for two PAR-CLIP datasets from proteinsAGO2 and HUR. 16 rocessed_pseudogenesnRNAsnoRNAantisensemisc_RNArRNAotherlincRNAretained_intronnonsense_mediated_decaymiRNAprocessed_transcriptunannotatedprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5cross-linksequence variantbackground A rRNAsense_intronicsense_overlappingprocessed_pseudogeneantisenseotherlincRNAretained_intronunannotatednonsense_mediated_decayprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5 0.6 BC rRNAantisenselincRNAotherunannotatedretained_intronnonsense_mediated_decayprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5 D rRNAantisenseotherlincRNAretained_intronunannotatednonsense_mediated_decayprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure S6: Proportion of BMix-classified loci within each Ensembl gene type retrieved using theUCSC Table Browser for the replicate B of the AGO2 dataset (A) , MOV10 dataset (B) , replicateA of the HUR dataset (C) and replicate B of the HUR dataset (D) . All the Ensembl types whichcontained less than 0.1% loci were grouped under the name ”other” and all the loci which did notfall within any annotation were marked as ”unannotated” . snRNAsnoRNAmisc_RNArRNAantisenseotherlincRNAmiRNAnonsense_mediated_decayretained_intronprocessed_transcriptunannotatedprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5BMix common sitesWavClusteR-only sitesPARalyzer-only sites A rRNAantisenselincRNAotherretained_intronunannotatednonsense_mediated_decayprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5 0.6 B rRNAantisenselincRNAotherunannotatedretained_intronnonsense_mediated_decayprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5 C rRNAantisenseotherlincRNAretained_intronnonsense_mediated_decayunannotatedprocessed_transcriptprotein_coding fraction0.0 0.1 0.2 0.3 0.4 0.5 D Figure S7: Annotation according to the Ensembl gene types for binding sites commonly identifiedby BMix and the other two methods, as well as for the additional sites reported by PARalyzerand WavClusteR for the replicate B of the AGO2 dataset (A) , MOV10 dataset (B) , replicate Aof the HUR dataset (C) and replicate B of the HUR dataset (D) . The proportion of binding siteswithin each gene type is displayed. All the Ensembl types which contained less than 0.1% siteswere grouped under the name ”other” and all the sites which did not fall within any annotationwere marked as ”unannotated” . 17 ength (nucleotides) N u m be r o f b i nd i ng s i t e s contained binding sites in each 3'U T R nu m be r o f ' U T R s A B
Figure S8: (A)
Histogram of the length of the binding sites reported by BMix for replicate A ofthe AGO2 dataset. (B)
Histogram of the number of binding sites reported by BMix in each 3’UTRfor replicate A of the AGO2 dataset.
A B
Length (nucleotides) N u m be r o f b i nd i ng s i t e s
20 60 100 140 contained binding sites in each 3'U T R nu m be r o f ' U T R s Figure S9: (A)
Histogram of the length of the binding sites reported by BMix for replicate B ofthe AGO2 dataset. (B)
Histogram of the number of binding sites reported by BMix in each 3’UTRfor replicate B of the AGO2 dataset.
A B
Length (nucleotides) N u m be r o f b i nd i ng s i t e s contained binding sites in each 3'U T R nu m be r o f ' U T R s Figure S10: (A)
Histogram of the length of the binding sites reported by BMix for replicate A ofthe HUR dataset. (B)
Histogram of the number of binding sites reported by BMix in each 3’UTRfor replicate A of the HUR dataset. 18 B Length (nucleotides) N u m be r o f b i nd i ng s i t e s contained binding sites in each 3'U T R nu m be r o f ' U T R s Figure S11: (A)
Histogram of the length of the binding sites reported by BMix for replicate B ofthe HUR dataset. (B)
Histogram of the number of binding sites reported by BMix in each 3’UTRfor replicate A of the HUR dataset.
A B
Length (nucleotides) N u m be r o f b i nd i ng s i t e s contained binding sites in each 3'U T R nu m be r o f ' U T R s Figure S12: (A)
Histogram of the length of the binding sites reported by BMix for the MOV10dataset. (B)
Histogram of the number of binding sites reported by BMix in each 3’UTR for theMOV10 dataset. f r a c t i on A f r a c t i on B Figure S13: (A)
Proportion of each BMix mixture component out of the total number of classifiedloci when Bowtie is used to align the PAR-CLIP reads with one, two or three allowed mismatches. (B)
Proportion of each BMix mixture component out of the total number of classified loci whenTopHat is used to align the PAR-CLIP reads with one, two or three allowed mismatches.19 upplementary Tables dataset replicate strand (cid:15) γ θ ν ψ(cid:15) γ θ ν ψ