What Is the Ideal Reads to See Alternative Splice Variants From Rna Sequencing
Ann Appl Stat. Author manuscript; available in PMC 2015 Mar 1.
Published in final edited grade as:
Ann Appl Stat. 2014 Mar; 8(1): 309–330.
PMCID: PMC4005600
NIHMSID: NIHMS563544
QUANTIFYING Culling SPLICING FROM PAIRED-Finish RNA-SEQUENCING DATA
David Rossell
†Academy of Warwick
Camille Stephan-Otto Attolini
‡Institute for Enquiry in Biomedicine of Barcelona
Manuel Kroiss
§LMU Munich
¶TU Munich
Almond Stöcker
§LMU Munich
Abstract
RNA-sequencing has revolutionized biomedical research and, in particular, our ability to study gene alternative splicing. The problem has of import implications for homo health, equally alternative splicing may exist involved in malfunctions at the cellular level and multiple diseases. However, the high-dimensional nature of the data and the existence of experimental biases pose serious data analysis challenges. We find that the standard information summaries used to study alternative splicing are severely limited, as they ignore a substantial amount of valuable information. Current data analysis methods are based on such summaries and are hence sub-optimal. Further, they have limited flexibility in accounting for technical biases. We propose novel data summaries and a Bayesian modeling framework that overcome these limitations and determine biases in a not-parametric, highly flexible manner. These summaries adapt naturally to the rapid improvements in sequencing engineering science. We provide efficient point estimates and dubiety assessments. The approach allows to report alternative splicing patterns for individual samples and tin can also be the ground for downstream analyses. We plant a several fold comeback in estimation hateful square fault compared pop approaches in simulations, and substantially higher consistency between replicates in experimental data. Our findings indicate the need for adjusting the routine summarization and assay of alternative splicing RNA-seq studies. We provide a software implementation in the R packet casper *
Keywords and phrases: Alternative Splicing, RNA-Seq, Bayesian modelling, Estimation
i. Introduction
RNA-sequencing (RNA-seq) produces an overwhelming corporeality of genomic data in a single experiment, providing an unprecedented resolution to accost biological problems. We focus on gene expression experiments where the goal is to study alternative splicing (Equally), which we briefly introduce. AS is an of import biological process past which cells are able to express several variants, also known as isoforms, of a single cistron. Each splicing variant gives rise to a different poly peptide with a unique structure that tin can perform dissimilar functions and respond to internal and environmental needs. AS is believed to contribute to the complication of higher organisms, and is in fact specially common in humans (Blencowe, 2006). Additionally, information technology is known to be involved in multiple diseases such equally cancer and malfunctions at the cellular level. Despite its importance, due to limitations in earlier technologies most cistron expression studies have ignored As and focused on overall gene expression.
Consider the hypothetical instance of a gene with three splice variants shown in Effigy i. The factor is encoded in the DNA in iii exons, shown as boxes in Figure 1.When the gene is transcribed as messenger RNA (mRNA), it can give ascent to three isoforms. Variant i is formed by all 3 exons, whereas variant 2 skips the second exon and variant 3 the tertiary exon. Commonly, multiple variants are expressed simultaneously at any given time. In our example, variant one makes upwards for 60% of the overall expression of the gene, variant 2 for 30% and variant iii for 10%. In practice, these proportions are unknown and our goal is to estimate them as accurately as possible.
We focus on paired-finish RNA-seq experiments, equally they are the current standard and provide higher resolution for measuring isoform expression than competing technologies, e.g. microarrays (Pepke, Wold and A., 2009). RNA-seq sequences tens or even hundreds of one thousand thousand mRNA fragments, which can then be aligned to a reference genome using a variety of software, e.g. TopHat (Trapnell, Pachter and Salzberg, 2009), SOAP (Li et al., 2009) or BWA (Li and Durbin, 2009). Throughout, we assume that the software tin can handle gapped alignments (we used TopHat in all our examples). Early RNA-seq studies used single-end sequencing, where only the left or right finish of a fragment is sequenced. In contrast, paired-end RNA-seq sequences both fragment ends. Table one shows three hypothetical sequenced fragments respective to the gene in Figure 1. 75 base pairs (bp) were sequenced from each terminate. For case, both ends of fragment 1 align to exon i. Every bit the three variants contain exon 1, in principle this fragment could have been generated past any variant. For fragment 2 the left read aligned to exons 1 and 2 (i.east. it spanned the junction between both exons), and the right read to exon 3. Hence, fragment 2 can only have been generated from variant 1. Finally, fragment three visits exons 1 and 2 and hence it could have been generated either past variants 1 or 3. The example is simply meant to provide some intuition. In practice most genes are essentially longer and take more complicated splicing patterns. Precise probability calculations are required to ensure that the conclusions are sound.
Table 1
Chromosome | Left read | Correct read | Exon path | |
---|---|---|---|---|
| ||||
Fragment 1 | chr1 | 110–185 | 200–274 | {one}, {i} |
Fragment 2 | chr1 | 361–400; 1001–1035 | 2011–2085 | {1,ii}, {3} |
Fragment 3 | chr1 | 301–375 | 1021–1095 | {1}, {2} |
… |
Ideally, one would want to sequence the whole variant, so that each fragment tin be uniquely assigned to a variant. Unfortunately, current technologies sequence hundreds of base pairs, which is orders of magnitude shorter than typical variant lengths. Current statistical approaches are based on the observation that, while virtually sequenced fragments cannot be uniquely assigned to a variant, it is possible to make probability statements. For case, fragment iii in Table 1 may have originated either from variants 1 or iii, but the probability that each variant generates such a fragment is unlike. As we shall see below, this observation prompts a direct use of Bayes theorem.
In principle, i could formulate a probability model that uses the full information, i.e. the verbal base of operations pairs covered by each fragment such equally provided in Table 1 due east.g. Glaus, Honkela and Rattray (2012)). Nevertheless, our findings bespeak that such strategies can be computationally prohibitive and deliver no obvious comeback (Department 4). Further, data storage and transfer requirements impose a need for reducing the size of the data. Several authors proposed summarizing the data by counting the number of fragments either covering each exon or each exon junction (e.g. Xing et al. (2006), Mortazavi et al. (2008), Jiang and Wong (2009)). In fact, large-calibration genomic databases study precisely these summaries, eastward.g. The Cancer Genome Atlas projection i . 1 tin can then pose a probability model that uses count data from a few categories as raw data, which profoundly simplifies ciphering. While useful, this approach is seriously limited to considering pairwise junctions, which discards relevant information. For instance, suppose that a fragment visits exons i, 2 and three. Simply adding 1 to the count of fragments spanning exons 1–2 and 2–3 ignores the joint information that a single fragment visited 3 exons and decreases the confidence when inferring the variant that generated the fragment. Our results suggest that ignoring this information tin consequence in a serious loss of precision. It is non uncommon that a fragment spans more than two exons. Holt and Jones (2008) establish a substantial proportion of fragments bridging several exons in paired-end RNA-seq experiments. In the 2009 RGASP experimental data gear up (Department 4) 38.0% and 40.9% fragments spanned ≥three exons in replicate 1 and 2, respectively (we sub-divided exons so that they are fully shared/not shared by all variants in a gene). In the 2012 ENCODE data set up we found 64.seven% and 65.2% in each replicate. The 2012 data had substantially longer reads and fragments, which illustrates the rapid advancements in applied science. As sequencing evolves these percentages are expected to increase further.
Nosotros propose novel information summaries that preserve most information relevant to culling splicing, while maintaining the computational burden at a manageable level. We record the sequence of exons visited by each fragment finish, which we refer to as exon path, and and then count the number of fragments post-obit each exon path. The left end of Fragment 2 in Table 1 visits exons 1 and ii and the correct end exon iii, which we announce as {i, 2}, {iii}. Notice that a fragment following the path {1}, {2, 3} visits the same exons, and so 1 could exist tempted to simply tape {i, two, 3} in both cases. However, the probability of observing {1, 2}, {3} for a given variant differs from {1}, {2, 3}, and hence combining the two paths would result in a potential loss of data. Table 1 contains hypothetical exon path counts for our example gene. We use these counts as the basic input for our probability model.
Paired-end RNA-seq is disquisitional for As studies. Intuitively, compared to single-cease sequencing it increases the probability of observing fragments that connect exon junctions. Lacroix et al. (2008) showed that, although neither protocol guarantees the existence of a unique solution, in practice paired finish (merely not single end) can provide asymptotically correct estimates for 99.7% of the human genes. In contrast, for unmarried-stop data the figure is ane.fourteen%. Unfortunately, much of the electric current methodology has been designed with single finish information in mind. Xing et al. (2006) codify the trouble as that of traversing a directed acyclic graph and formulate a latent variable based approach to gauge splice variant expression. Jiang and Wong (2009) propose a similar approach within the Bayesian framework. Both approaches were designed for single-finish RNA-seq information. Ameur et al. (2010) proposed strategies to detect splicing junctions, and Katz et al. (2010) and Wu et al. (2011) introduced models to approximate the percentage of isoforms skipping private exons. Even so, these approaches exercise not judge expression at the variant level.
Several authors propose strategies that employ paired-ends. Mortazavi et al. (2008), Montgomery et al. (2010), Trapnell et al. (2010) and Salzman, Jiang and Wong (2011) model the number of fragments spanning exon junctions. These approaches focus on pairwise exon connections, ignoring valuable higher-lodge information, and have limitations in incorporating important technical biases. Showtime, the sample preparation protocols usually induce an enrichment towards the 3' cease of the transcript, i.e. fragments are not uniformly distributed along the gene. Roberts et al. (2011b), Wu, Wang and Zhang (2011) or Glaus, Honkela and Rattray (2012) relax the uniformity supposition. Farther, the fragment length distribution plays an important role in the probability calculations and needs to be estimated accurately. While the approaches in a higher place acknowledge this result, they either utilise sequencing facility reports (i.e. they do not judge the distribution from the data) or they impose strong parametric assumptions. Our examples illustrate that facility reports can be inaccurate, and that parametric forms practice not capture the observed asymmetries, heavy tails or multi-modalities. Further, all previous approaches presume that fragment start and length distributions are constant across all genes. Nosotros provide empirical evidence that this assumption can exist flawed, and suggest a strategy to relax the assumption.
A business organization with current genome annotations is that they may miss some splicing variants. Our arroyo tin can be combined with methods that predict new variants such every bit Cufflinks RABT module (Trapnell et al., 2010; Roberts et al., 2011a), Scripture (Guttman et al., 2010) or SpliceGrapher (Rogers et al., 2012). This choice is implemented in our R package and illustrated in Section 4.3.
In summary, nosotros suggest a flexible framework to guess alternative splicing from RNA-seq studies, past using novel data summaries and bookkeeping for experimental biases. In Department two we formulate a probability model that goes beyond pairwise connections by considering exon paths. We model the read outset and fragment size distributions non-parametrically and allow for separate estimation within subsets of genes with like characteristics. Section 3 discusses model fitting and provides algorithms to obtain signal estimates, asymptotic credibility intervals and posterior samples. Nosotros prove some results in Section four and provide concluding remarks in Section v.
2. Probability model
We formulate the model at the cistron level and perform inference separately for each cistron. In some cases, exons from different genes overlap with each other. When this occurs we group the overlapping genes and consider all their isoforms simultaneously. It is also possible that 2 variants share only a part of an exon. We sub-divide such exons into the shared part and the part that is specific to each variant. For simplicity, from here on we refer to cistron groups merely as genes and to sub-divided exons simply every bit exons.
Consider a cistron with E exons starting at base pairs s 1, …, southE and ending at e 1, …, eE . Denote the set of splicing variants under consideration by ν (assumed to be known) and its cardinality by |ν|. Each variant is characterized by an increasing sequence of natural numbers i 1, i 2,… that indicates the exons contained therein.
2.1. Likelihood and prior
Every bit discussed in a higher place, we codify a model for exon paths. Let thousand exist the number of exons visited by the left read, and 1000′ exist that for the right read (i.eastward. m = k′ = 1 when both reads overlap a single exon). We denote an exon path by ι = (ι l , ι r ), where ι l = (ij , …, i j+m ) are the exons visited past the left-finish and ι r = (ij ′, …, i j′+k′) those by the right-end. Let 𝒫* exist the fix of all possible exon paths and 𝒫 exist the subset of observed paths, i.due east. the paths followed past at least 1 sequenced fragment.
The observed information is a realization of the random variable Y = (Y 1, …, YN ), where North is the number of paired-finish reads and Yi ∈ {i, …,𝒫*} indicates the exon path followed by read pair i. Formally, Yi arises from a mixture of |ν| discrete probability distributions, each component corresponding to a different splicing variant. The mixture weights π = (πi, …, π|ν|) give the proportion of reads generated past each variant, i.e. its relative expression. That is,
where pkd = P(Yi = grand|δ i = d) is the probability of path k under variant d and δ i is a latent variable indicating the variant that originated Yi . Permit Si and 50i announce the relative showtime and length (respectively) of fragment i. The exon path Yi is completely determined given Due southi , Li and the variant δ i . Hence,
p chiliad d = ∫∫I(Y i =k|S i =s i ,Fifty i =l i ,δ i =d)d P 50 (fifty i |δ i )d P S (s i |δ i ,50 i ),
(1)
where PL is the fragment distribution and PS is the read start distribution conditional on L. Equally discussed in Section 2.ii, by assuming that PSouth and PL are shared across sets of genes with similar characteristics it is possible to estimate them with high precision. Hence, for practical purposes we can treat pkd equally known and pre-compute them before model plumbing fixtures. Total derivations for pkd are provided in Appendix A.
Assuming that each fragment is observed independently, the likelihood function can be written equally
(two)
where is the number of reads following exon path chiliad. (2) is log-concave, which guarantees the existence of a unmarried maximum. Log-concavity is given by (i) the log office being concave and monotone increasing, (ii) being linear and therefore concave, and (iii) the fact that a composition thou ◦ f where k is concave and monotone increasing and f is concave is once more concave. To run across (3), notice that
g◦f(t z 1 + (1 −t)z two) ≥chiliad(t f(z 1) + (ane −t)f(z 2)) ≥t g◦f(z 1) + (i −t)thousand◦f(z 2)
where the first inequality is given by 1000 being increasing and f concave, and the 2nd inequality is given by yard being concave.
Nosotros complete the probability model with a Dirichlet prior on π:
In Section iv we assess several choices for qd . By default we set up the adequately uninformative values qd = ii, equally these induce negligible bias and stabilize the posterior mode by pooling it abroad from the boundaries 0 and 1. It is easy to see that (iii) is log-concave when qd ≥ 1 for all d. Given that (2) is likewise log-concave, this choice of q guarantees the posterior to be log-concave, and therefore the uniqueness of the posterior fashion.
2.2. Fragment length and read kickoff distribution estimates
Evaluating the exon path probabilities in (1) that announced in the likelihood (ii) requires the fragment start distribution PS and fragment length distribution P50 . Given that information technology is not possible to estimate (PL , PSouthward ) with precision for each private gene, we assume they are shared across multiple genes (restricting fragments to be no longer than the variant they originated from). Past default we assume that (PFifty , PS ) are common across all genes, merely we also studied posing carve up distributions according to gene length. Supplementary Section 1 shows experimental evidence that, while P50 remains substantially constant, PS can depend on gene length and the experimental setup. While this option is implemented in our R bundle, to allow a straight comparing with previous approaches here we assumed a common (P50 , PS ).
Denoting by T the length of variant δ i (in bp), we let PL (l | δ) = PL (l | T) = P(L = l)I(l ≤ T)/P(l ≤ T). That is, the conditional distribution of Fifty given δ is but a truncated version of the marginal distribution.
Further, we assume a common fragment start distribution relative to the variant length T. Provisional on 50 and T, PDue south is truncated then that the fragment ends before the stop of the variant. Specifically,
(iv)
where z = s/T and is the distribution of the relative read beginning .
To estimate PL annotation that the fragment length is unknown for fragments that span multiple exons, but it is known exactly when both ends autumn in the aforementioned exon. Therefore, we select all such fragments and estimate P50 with the empirical probability mass role of the observed fragment lengths. In order to preclude short exons from inducing a option bias, we merely use exons that are substantially longer than the expected maximum fragment length (past default > 1000bp).
Estimating the fragment start distribution PS is more challenging, as we do not know the variant that generated each fragment and therefore its relative commencement position cannot be determined. We address this issue by selecting genes that take a single annotated variant, every bit in principle for these genes all fragments should have been generated by that variant. Of course, the annotated genome does non contain all variants, and therefore a proportion of the selected fragments may non have been generated by the assumed variant. However, the annotations are expected to contain virtually common variants (i.e. with highest expression), and hence most of the selected fragments should correspond to the annotated variant. Under this assumption, we can determine the exact beginning Si and length Li for all selected fragments. A difficulty in estimating the read start distribution is that the observed (Si , Fiftyi ) pairs are truncated so that Southward i + L i < T, whereas we crave the untruncated cumulative distribution part ρ(·) in (four). Fortunately, the truncation point for each (Si , 50i ) is known and therefore one tin can simply obtain a Kaplan-Meier estimate of ρ(·) (Kaplan and Meier, 1958). We apply the role survfit in the R survival package (Therneau and Lumley, 2011).
3. Model fitting
We provide algorithms to obtain a signal estimate for π, asymptotic credibility intervals and posterior samples.
Following a 0–1 loss, every bit a point approximate we report the posterior mode, which is obtained by maximizing the product of (2) and (iii), subject to the constraint . Nosotros note that maximum likelihood estimates are obtained by simply setting qd = i in (three). This constrained optimization tin can be performed with many numerical optimization algorithms. Here we used the EM algorithm (Dempster, Laird and Rubin, 1977), as it is computationally efficient fifty-fifty when the number of variants |ν| is large. For a detailed derivation see Appendix B. As noted to a higher place, for qd > 1 the log-posterior is concave and therefore the algorithm converges to the unmarried maximum. The steps required for the algorithm are:
-
Initialize .
-
At iteration j, update .
Step two is repeated until the estimates stabilize. In our examples we required for all d. Find that pkd and xthousand remain constant through all iterations, and hence they need to be computed only one time.
We characterize the posterior uncertainty asymptotically using a normal approximation in the re-parameterized space θ d = log (π d+1/πi), d = i, …, |ν|−one and the delta method (Casella and Berger, 2001). Denote by μ the posterior style for θ = (θ1, …, θ|ν|−ane) and by Due south the Hessian of the log-posterior evaluated at θ = μ. Further, let π(θ) be the changed transformation and 1000(θ) the matrix with (d, fifty) chemical element . Detailed expressions for S, π(θ) and Thousand(θ) are provided in Appendix C. The posterior for θ tin be asymptotically approximated past Northward(μ, Σ), where Σ = Southward −ane. Hence, the delta method approximates the posterior for π with Due north (π (μ), G(μ)′ SG(μ)).
The asymptotic approximation is also useful for the following independent proposal Metropolis-Hastings scheme. Initialize θ(0) ~ T 3(μ, Σ) and notice that a prior P π(π) on π induces a prior P θ(θ) = P π(π(θ)) × |G(θ)| on θ, where G(θ) is as above. At iteration j, perform the following steps:
-
Propose θ*~ T iii(μ, Σ) and let π* = π(θ*).
-
Set up θ (j) = θ* with probability min {one, λ}, where
(five)
-
Otherwise, set θ (j) = θ (j−1).
Posterior samples can be obtained by discarding some burn-in samples and repeating the process until practical convergence is achieved. Past default we suggest 10,000 samples with a 1,000 burn-in, as it provided sufficiently high numerical accuracy when comparison 2 contained chains (Supplementary Department 2).
iv. Results
Nosotros assess the performance of our arroyo in simulations and 2 experimental data sets. We obtained the two man sample K562 replicates 2 from the RGASP projection (www.gencodegenes.org/rgasp), and ii ENCODE Project Consortium (2004) replicated samples obtained from A549 cells (accession number wgEncodeEH002625 3 ). We compare our results with Cufflinks (Trapnell et al., 2012), FluxCapacitor (Montgomery et al., 2010) and BitSeq (Glaus, Honkela and Rattray, 2012). Cufflinks is based on a probabilistic model akin to Casper, but uses exon and exon junction counts instead of full exon paths, assumes that fragment lengths are usually distributed and estimates the read start distribution in an iterative manner. FluxCapacitor is besides based on exon and exon junction counts, just uses a method of moments-type computer. BitSeq uses a full Bayesian model at the base-pair resolution (i.e. data is not summarized as counts) and estimates the read commencement distribution with a two-step procedure.
Regarding sequence alignment, for Casper, Cufflinks and FluxCapacitor we used TopHat (Trapnell, Pachter and Salzberg, 2009) with the human genome hg19, using the default parameters and a 200bp average insert size. BitSeq required aligning to the transcriptome with Bowtie (Langmead et al., 2009).
4.ane. Simulation study
We generated human genome-wide RNA-seq data, setting the simulations to resemble the K562 RGASP data in order to keep them as realistic equally possible. Figure 3 (left) shows our estimates S and 50 . We set PS and π for each cistron with 2 or more than variants to their estimates in the K562 data. For each gene we simulated a number of fragments equal to that observed in the K562 sample.
Nosotros considered a Casper-based and a Cufflinks-based simulation scenario. In the quondam we ready π and P50 to the Casper estimates (qd = ii). The 2d scenario was designed to favor Cufflinks past using its π estimates and setting PFifty to its assumed Normal distribution (hateful=200, standard deviation=twenty). We indicated the data-generating PFifty to Cufflinks, whereas the remaining methods estimated it from the data. An important departure between scenarios is that Casper estimates with qd = 2 are pooled away from the purlieus, hence π d is never exactly 0 or 1, whereas the Cufflinks estimates were often in the boundary (Supplementary Figure four). Genes with less than ten reads per kilobase per 1000000 (RPKM) were excluded from all calculations to reduce biases due to depression expression.
We estimated π from the simulated information using our arroyo with prior parameters qd = 1 and qd = 2, Cufflinks and FluxCapacitor. Table iii reports the absolute and square errors (|π d − d | and (π d − d )2) averaged across all xviii, 909 isoforms and 100 simulated datasets for both simulation settings. We also report the average squared bias and variance. The Cufflinks and FluxCapacitor MAE are over 2.five and 2.7 folds greater than that for Casper with qd = 2 (1.5 and 1.6 for qd = 1, respectively) in the Casper-based scenario. In the Cufflinks-based simulation the reductions were 1.14 and 1.24 fold (1.27 and i.38 for qd = 1). The improvements in MSE are fifty-fifty more pronounced, with an over 2 fold comeback for qd = 2 even in the Cufflinks-based simulation. Casper likewise shows a marked comeback in bias for qd = 1 and variance for qd = ii. Meet Supplementary Figure 4 for respective plots.
Table 3
Casper-based simulations | ||||
---|---|---|---|---|
MAE | MSE | Bias sqrt | Variance | |
| ||||
Casper (qd = i) | 0.094 | 0.028 | 0.004 | 0.024 |
Casper (qd = 2) | 0.055 | 0.004 | 0.004 | 0.004 |
Cufflinks | 0.141 | 0.050 | 0.028 | 0.022 |
FluxCapacitor | 0.151 | 0.054 | 0.022 | 0.032 |
Cufflinks-based simulations | ||||
---|---|---|---|---|
MAE | MSE | Bias sqrt | Variance | |
| ||||
Casper (qd = 1) | 0.100 | 0.055 | 0.021 | 0.034 |
Casper (qd = 2) | 0.111 | 0.035 | 0.032 | 0.003 |
Cufflinks | 0.127 | 0.073 | 0.045 | 0.028 |
FluxCapacitor | 0.138 | 0.078 | 0.036 | 0.042 |
Effigy 2 (top) shows the MAE for each transcript as a role of RPKM, a mensurate of overall factor expression. Casper improves the estimates for substantially all RPKM values in both simulation settings. Figure 2 (bottom) assesses the MAE vs. the hateful pairwise difference between variants in a gene (number of base of operations pairs not shared). When variants in a gene share almost exons this deviation is small, i.e. variants are harder to distinguish. Casper estimates are the most accurate at all similarity levels, with the MAE decreasing every bit variants become more differentiated. Interestingly, Cufflinks and FluxCapacitor show lower MAE as similarity increases from low to medium, simply then MAE becomes college and more variable in genes with medium-highly differentiated variants. These results illustrate the reward of using full exon paths, which provide more resolution in assigning reads to splicing variants.
Finally, we assessed the frequentist coverage probabilities for the asymptotic 95% brownie intervals (Section 3), finding that in 95.04% of the cases they contained the true value.
4.2. Experimental data from RGASP projection
The two K562 replicates were sequenced in 2009 with Solexa sequencing. The read length was 75bp and the hateful fragment length indicated in the documentation is 200bp for both replicates. Figure 3 (top, left) shows the estimated fragment length distributions. We observe that the mean length differs significantly from 200bp, and that there are important differences between replicates. Replicate ii shows a heavy left tail that indicates a subset of fragments substantially shorter than average. This distributional shape cannot be captured with the usual parametric distributions. Figure 3 (left, bottom) shows the relative first distribution. We find more sequences located near the transcript end in replicate 1, i.eastward. a college 3' bias. The differences between replicates illustrate the demand of flexibly modelling these distributions for each sample separately. In fact, nosotros plant that S differed across genes with varying length (Supplementary Section 1 and Supplementary Figure 1), the 3' bias being stronger in genes shorter than three kilo-bases.
Nosotros estimated the expression of human splicing variants in the UCSC genome version hg19 for the two replicated samples separately. Effigy 4 (left) and Table 4 compare the estimates obtained in the 2 samples. The Hateful Absolute Difference (MAD) betwixt replicates was 0.064 for Casper, 0.126 for Cufflinks (97% increase), xvi.2 for FluxCapacitor (253% increase) and 8.5 for BitSeq (31% increase). Figure 4 show a roughly linear correlation for Casper, Cufflinks and FluxCapacitor, the latter two frequently providing d = 0 in one replicate and d = 1 in the other. BitSeq avoids these boundaries but exhibits a strongly non-linear association. In terms of computational fourth dimension, all methods required roughly 10–20min on 4 processors. Because BitSeq models the data at the base pair resolution, it required substantially longer time to run on 12 cores.
Tabular array 4
K562 | Encode | |||
---|---|---|---|---|
MAD | CPU | MAD | CPU | |
Casper+ | 6.four×10−2 | eleven.1min | five.7×ten−ii | 2h 11min |
Cufflinks+ | 12.6×10−2 | 21.4min | 9.0×ten−ii | 2h 13min |
Flux+ | 16.2×10−2 | 9.0min | 12.vii×10−2 | 1h 17min |
BitSeq* | 8.5×10−2 | 1day 13h | 9.8×10−2 | 8h 40min |
These results propose that Casper provides clear advantages even with earlier sequencing technologies.
four.three. Experimental data from ENCODE project
The two A549 replicated samples were sequenced in 2012 using Illumina HiSeq 2000. The read length was 101bp and the boilerplate fragment length was roughly 300bp (Figure 3, acme right). These are substantially longer than the 2009 samples from Department 4.two, and reflect the improvement in sequencing technologies. Similar to Department iv.ii, Effigy three reveals important differences in the fragment length (meridian, right) and start (bottom, right) distributions between samples. See also Supplementary Department 1 and Supplementary Effigy 2, where S exhibits a stronger 3' bias for genes longer than v kilo-bases.
Effigy 4 (left) and Table 4 compare the estimates obtained in the 2 replicates. Similar to the RGASP study (Section 4.ii), Casper shows a roughly linear association and substantially higher consistency betwixt replicates. The MAD between replicates was 0.057 for Casper, nine.0 for Cufflinks (58% increase), 12.7 for FluxCapacitor (223% increase) and 0.098 for BitSeq (72% increase). The computational fourth dimension for Casper was comparable to that of Cufflinks, higher than FluxCapacitor and essentially lower than BitSeq. The findings show that the advantage of modeling exon path counts over pairwise exon connections remains pronounced equally technology evolves to sequence longer fragments.
Nosotros at present consider the possibility that some expressed transcripts may non be present in the UCSC genome annotations. We used Cufflinks RABT module to identify novel transcripts, then run Casper to jointly estimate their expression with UCSC transcripts. Cufflinks-RABT plant 12,512 factor islands with no new transcripts, six,229 with some new transcripts and i,527 completely new genes in sample 1. For sample 2 the figures were 11,912, 6,983 and 1,378 completely new genes. While new transcripts had negligible influence on genes with no new transcripts, in the remaining genes d decreased so that a proportion of the expression could be assigned to the new variants. For farther details encounter Supplementary Department 4. These findings advise that current genome annotations may miss a non-negligible number of expressed variants. For a careful assessment we recommend post-obit a strategy alike to ours here, i.east. combining our approach with a de novo transcript discovery method.
v. Discussion
Nosotros proposed a model to estimate the expression of a set of known alternatively spliced variants from RNA-seq data. The model improves upon previous proposals by using exon paths, which are more informative than single or pairwise exon counts, and by flexibly estimating the fragment offset and length distributions. Nosotros provided computationally efficient algorithms for obtaining point estimates, asymptotic credibility intervals and posterior samples.
We found that a fairly uninformative prior with qd = 2 improves precision relative to the typical qd = 1 equivalent to maximum likelihood interpretation. The advantages stem from the usual shrinkage argument: qd = ii pools the estimates abroad from the boundaries and reduces variance. Compared to competing approaches, we observed substantial MSE reductions in simulations and increased correlation between experimental replicates. In mod studies we found that roughly 2 sequences out of three visited > 2 exon regions distinguishing variants, suggesting that the electric current standard of reporting pairwise exon junctions adopted by most public databases is far from optimal. Reporting exon paths would let researchers to approximate isoform expression at a much college precision. Given that the methodology is implemented in the R bundle casper, we believe that it should be of great value to practitioners.
Table two
Exon path | Count |
---|---|
| |
{1}, {ane} | 2824 |
{2}, {ii} | 105 |
{iii}, {three} | 5042 |
{1}, {2} | 27 |
{1}, {i,2} | 423 |
{1}, {iii} | 127 |
{2,3}, {three} | 394 |
{1,2}, {3} | 2 |
{i}, {2,3} | xiii |
Supplementary Cloth
supplementary data
ACKNOWLEDGEMENTS
D.R. was partially supported past grants R01 CA158113-01 from the NIH (USA) and MTM2012-38337 from the Ministerio de Economía y Competitividad (Spain). C.S. was partially supported past AGAUR Beatriu de Pinós fellowship BP-B 00068 (Spain). M.Thousand. and A.Southward. were partially supported past Bayerisches Arbeitsministerium foreign trainee fellowship (Germany). The authors wish to thank Modesto Orozco for useful discussions.
APPENDIX A
DERIVATION OF EXON PATH PROBABILITIES
Here we describe how to compute the probability pkd of observing exon path k for any splicing variant d. Equivalently, we denote d past δ = (i i, …, i |δ|), where ij indicates the jth exon within d. Consider variant δ subsequently splicing, i.e. after removing the introns. The new exon start positions are given by and for chiliad = ane, …, |δ|−one. The cease of exon k is . Denote by S be the read kickoff position, Fifty the fragment length, r the read length, and let be the transcript length of δ.
The goal is to compute P(ι l = (ij , …, i j+k ), ι r = (ij ′, …, i j′+k′)|δ). Nosotros note that both ij , …, i j+thou and ij ′, …, i j′+k′ must be consecutive exons under variant δ, otherwise the probability of the path is zilch. The left read follows the exon path ι fifty = (ij , …, i j+thou ) if and merely if the read:
-
Starts in exon j, i.due east.
-
Ends in exon j + k, i.e.
Similarly, the right read follows ι r = (ij ′, …, i j′+k′) if and only if and . This implies that the desired probability can be written equally P(a 1 ≤ Southward ≤ b i, a 2 ≤ S+L ≤ b 2|δ), where
(6)
Assuming that the distribution of (S, 50) depends on δ only through its transcript length T, nosotros can write
(7)
In club to evaluate (vii) nosotros demand to estimate the fragment length distribution P(50 = 50|T) and the distribution of the read kickoff position Southward given 50. We assume that P(50|T) = P(L = l)I(fifty ≤ T)/P(50 ≤ T), i.east. the conditional distribution of L given T is simply a truncated version of the marginal distribution. Further, notice that the fragment end must happen before the terminate of the transcript, i.e. S + L − 1 ≤ T or equivalently the relative first position is truncated S/T ≤ ST = (T − Fifty + one) /T. The relative first distribution is therefore truncated, i.e. , where is the distribution of the relative read start .
Under these assumptions, the probability of observing the exon path ι l = (ij , …, i j+thou ), ι r = (ij ′, …, i j′+k′) under variant δ is equal to
where a 1, b 1, a 2 and b 2 are given in (six). Given that highly precise estimates of P(L = l) and φ (·) are typically bachelor, for computational simplicity we care for them equally known and plug them into (eight).
APPENDIX B
EM ALGORITHM DERIVATION
1. East-step
Let δ i ∈ {1, …, |ν|} be latent variables indicating the variant that reads i = 1, …, N come from. The augmented log-posterior is proportional to to
(8)
Considering δ i as a random variable, the expected value of (viii) given y and π = π (j) is equal to
(9)
ii. M-step
The goal is to maximize (9) with respect to π′. Allow γ id = P(δ i = d|yi , π (j)) and re-parameterize . Setting the partial derivatives with respect to to nothing gives the system of equations
which has the trivial solution . By plugging in , we obtain
Finally, since nosotros tin group all yi 's taking the same value and find the maximum as
(10)
re-normalizing π′ so that .
APPENDIX C
ASYMPTOTIC POSTERIOR APPROXIMATION
Here we derive an asymptotic approximation to P(π|ν, Y), the posterior distribution of the splicing variants expression π provisional on a model ν and the observed data Y. Given that π = (π1, …, π|ν|) ∈ [0, i]|ν| with , nosotros re-parameterize to θ = (θ1, …, θ|ν|−1) ∈ ℜ|ν|−1, where for d = 1, …, |ν| − i. The goal is to approximate P(θ | ν, Y) ~ N(μ, Σ). For notational simplicity, in the residuum of the department we drop the conditioning on ν.
A prior P π(π) induces a prior P θ(θ) = P π(π(θ)) × |G(θ)| on θ, where G(θ) is the matrix with (d, 50) element and changed transform , π d (θ) = πane(θ)exp{θ d−ane} for d > one.
Define f(θ) = log (P(Y|θ))+log (P θ(θ)). Upwardly to an additive abiding, f(θ) is equal to the target log-posterior distribution of θ given Y. We center the approximating Normal at the posterior mode, i.due east. μ = argmax θ∈ ℜ|ν|−1 f(θ). We set Σ = South −1where Southward is the Hessian of f(θ) evaluated at θ = μ with (50, m) element . We approximate , where π* is the posterior fashion for π provided past the EM algorithm.
Nether a π ~ Dirichlet (q) prior, unproblematic algebra gives
(11)
where is the number of reads following exon path chiliad, pkd = P(Yi = thou | δ = d) is the probability of observing path m nether variant d, the gradient for π d (θ) is every bit earlier and the Hessian is .
We consummate the derivation by providing expressions for One thousanddl and Hdlm . Let , and then
(12)
and
(13)
Footnotes
REFERENCES
- Ameur A, Wetterbom A, Feuk Fifty, Gyllensten U. Global and unbiased detection of splice junctions from RNA-seq data. Genome Biology. 2010;11:R34. [PMC free article] [PubMed] [Google Scholar]
- Blencowe BJ. Culling Splicing: New Insights from Global Analyses. Cell. 2006;126:37–47. [PubMed] [Google Scholar]
- Casella One thousand, Berger RL. Statistical inference. 2 ed. Duxbury Press; 2001. [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Information via the EM Algorithm. Journal of the Royal Statistical Lodge. Series B. 1977;39:one–38. [Google Scholar]
- ENCODE Projection Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Scientific discipline. 2004;306:636–640. [PubMed] [Google Scholar]
- Glaus P, Honkela A, Rattray M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 2012;28:1721–1728. [PMC gratuitous article] [PubMed] [Google Scholar]
- Guttman 1000, Garber One thousand, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan Fifty, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnoly. 2010;28:503–510. [PMC free commodity] [PubMed] [Google Scholar]
- Holt RA, Jones SJM. The new epitome of menses cell sequencing. Genome Research. 2008;xviii:839–846. [PubMed] [Google Scholar]
- Jiang H, Wong WH. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009;25:1026–1032. [PMC free article] [PubMed] [Google Scholar]
- Kaplan EL, Meier P. Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association. 1958;53:457–481. [Google Scholar]
- Katz Y, Wang ET, Airoldi EM, Burge CB. Assay and pattern of RNA sequencing experiments for identifying isoform regulation. Nature Methods. 2010;7:1009–1015. [PMC free article] [PubMed] [Google Scholar]
- Lacroix Five, Sammeth Thousand, Guigo R, Bergeron A. Proceedings of the eighth international workshop on Algorithms in Bioinformatics. WABI '08. Berlin, Heidelberg: Springer-Verlag; 2008. Verbal Transcriptome Reconstruction from Short Sequence Reads; pp. fifty–63. [Google Scholar]
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10:R25. [PMC free article] [PubMed] [Google Scholar]
- Li H, Durbin R. Fast and accurate curt read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009;25:1754–1760. [PMC free commodity] [PubMed] [Google Scholar]
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen Thousand, Wang J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25:1966–1967. [PubMed] [Google Scholar]
- Montgomery SB, Sammeth G, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464:773–777. [PMC free commodity] [PubMed] [Google Scholar]
- Mortazavi A, Williams BA, McCue Grand, Schaeffer L, B W. Mapping and quantifying mammalian transcriptomes past RNA-Seq. Nature Methods. 2008;5:621–628. [PubMed] [Google Scholar]
- Pepke S, Wold B, A Grand. Computation for Chip-seq and RNA-seq studies. Nature Methods. 2009;6:S22–S32. [PMC free article] [PubMed] [Google Scholar]
- Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011a;27:2325–2329. [PubMed] [Google Scholar]
- Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011b;12:R22. [PMC free article] [PubMed] [Google Scholar]
- Rogers MF, Thomas J, Reddy As, Ben-Hur A. SpliceGrapher: detecting patterns of culling splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol. 2012;13:R4. [PMC free article] [PubMed] [Google Scholar]
- Rossell D, Stephan-Otto Attolini C, Kroiss M, Stöcker A. Supplement to "Quantifying alternative splicing from paired-end RNA-sequencing data" 2013 [Google Scholar]
- Salzman J, Jiang H, Wong WH. Statistical modeling of RNA-Seq data. Statistical Science. 2011;26:62–83. [PMC gratis article] [PubMed] [Google Scholar]
- Therneau T, Lumley T. survival: Survival analysis, including penalised likelihood. R parcel version 2.36-x. 2011 [Google Scholar]
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. [PMC complimentary article] [PubMed] [Google Scholar]
- Trapnell C, Williams BA, Pertea M, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter Fifty. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28:511–515. [PMC free article] [PubMed] [Google Scholar]
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pi-Mentel H, Salzberg SL, Rinn JL, Pachter L. Differential factor and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols. 2012;7:562–578. [PMC free commodity] [PubMed] [Google Scholar]
- Wu Z, Wang 10, Zhang X. Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics. 2011;27:502–508. [PubMed] [Google Scholar]
- Wu J, Akerman M, Sun Southward, McCombie WR, Krainer AR, Zhang MQ. SpliceTrap: a method to quantify culling splicing nether single cellular conditions. Bioinformatics. 2011;27:3010–3016. [PMC free article] [PubMed] [Google Scholar]
- Xing Y, Yu T, Wu YN, Roy G, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006;34:3150–3160. [PMC free article] [PubMed] [Google Scholar]
What Is the Ideal Reads to See Alternative Splice Variants From Rna Sequencing
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4005600/
0 Response to "What Is the Ideal Reads to See Alternative Splice Variants From Rna Sequencing"
Post a Comment