糾錯工具之 - Proovread


BioInf-Wuerzburg/proovread - Github

主要是來解讀 proovread 發表的文章,搞清楚它內在的原理。

Proovread,這個工具絕對沒有你想的那么簡單,它引入了很多局部模型,而且在總體設計上也是很有眼光的。

原文:proovread: large-scale high-accuracy PacBio correction through iterative short read consensus

摘要

動機:目前邊合成邊測序的二代技術占主導,雖然准,但太短,導致分析困難。近期,SMRT可以解決這個問題,它生產超長的reads。但是高錯誤率阻礙了SMRT的應用,因此,混合利用SR和LR的方法已經開發出來了,但是目前的實現方法都太依賴硬件,不好。這限制了它的應用。

結果:我們開發了一個混合糾錯流程,能靈活地運行與普通台式機和大型集群,在基因組和轉錄組的測試中,准確度高達99.9%,勝過現有的所有混合糾錯軟件,而且更長量多。

引言

過去十年,二代改寫了測序的歷史,Today, a single run of a HiSeq2500 can generate as much as 600Gb high-quality output data, which covers a human genome 200. 但是,太短,不好組裝,尤其是重復區域。因此,大量的SR組裝軟件出現了,Allpath-LG (Gnerre et al., 2011), the Celera Assembler (Miller et al., 2008; Myers et al., 2000) and SOAPdenovo (Li et al., 2010).

比SR長的重復不能被解決,目前的好的組裝方案是,聯合short reads和long insert libraries和額外的fosmid測序。

但是,SMRT出現了,With the latest chemistry, this approach delivers reads44 kb. 而且無偏向性,Their third-generation sequencer, PacBio RS II, generates to date up to 400Mb per sequencing run.

LR 的准確度太低,二代99%,而三代只有80%-85%,而且錯誤分布模型也不同,Although Illumina reads mainly contain miscalled bases with increasing frequency toward read ends, SMRT generates primarily insertions (10%) and deletions (5%) in a random pattern (Ross et al., 2013).  SMRT可以CCS,但這同時也減少了reads的長度,從而失去了三代的優勢。

目前有兩種方法用於SMRT的校正:

(i) The hierarchical genome-assembly process (HGAP) uses shorter SMRT reads contained within longer reads to generate pre-assemblies and to calculate consensus sequences (Chin et al., 2013). (缺陷:coverage of 80 to 100)

(ii) PacBioToCA (Koren et al., 2012) and LSC (Au et al., 2012) use Illumina SRs in a hybrid approach to correct SMRT reads. These approaches result in higher quality LRs.(需要大量計算資源,PacBioToCA lost >40%數據,LCS只能轉錄組,WGS集成,不好調用)

本方法優點:

(i) run on standard computers as well as computer grids and

(ii) can be easily adapted to different use cases.

Obviously, these objectives should not be at the cost of accuracy, length of corrected reads or throughput.

實現

Mapping—sensitive and trusted hybrid alignments

比對 - 敏感的可信的混合比對

比對是一個大問題,尤其是二代比三代,絕對不能用現有的比對模型來比對。

本軟件基於以下假設設定了一套獨特的比對得分體系:

(i) The expected error rates for SMRT sequencing are 10% for insertions and up to 5%for deletions (Ono et al., 2013; Ross et al., 2013). Thus, the costs for gaps in the LR, which correspond to deletions, are about twice as high as for gaps in the SR, which represent insertion.

(ii) Substitutions are comparatively rare (1%). This is reflected by a mismatch penalty of at least 10 times the cost of SR gaps.
(iii) The distribution of SMRT sequencing errors is random. Hence, contrasting to biological scenarios, continuous insertions or deletions are less likely, resulting in higher costs for gap extension than for gap opening.

本軟件使用SHRiMP2作為首選,Its versatile interface allowed us to completely implement the hybrid scoring model with the following parameters: insertions are the most frequent errors and are penalized as gap open with –1. Deletions occur
about half as often and are thus penalized with –2. Extensions for insertions and deletions are scored with –3 and –4, respectively. Mismatches are at least 10 times as rare, resulting in a penalty of –11 (Supplementary Table S1).

(罰分總結:insertion最多那就open設為1,deletion為它的一般open就設為2,extend代價更高,分別加2,設為-3、-4,mismatch出現概率最低,所以就該多罰,設為-11)

image

本軟件使用Bowtie2作為次選,However, corrections using Bowtie2 lagged延遲 behind owing to a limited set of parameters regarding scoring and sensitivity. 可以自己trim(sickle,https://github.com/najoshi/sickle),corrected SRs(Quake)

比對,自然要區分真比對和假比對,重復區自然會導致reads的堆積,error還會影響比對得分,We therefore assess length normalized scores in a localized context.

引入了Bin的概念:LRs are internally represented by a consecutive series of small bins.

Only the highest scoring alignments of each bin, not the overall highest scoring alignments, up to the specified coverage cutoff are considered for the next step—the calculation of the consensus sequence.

 

Consensus call with quality computation and chimera detection

 

 

Quality and chimera trimming

untrimmed corrected LRs(這不就是我們最終得到的結果嗎)

怎么trim,不是想象中的那么簡單,熵模型。

 

Iterative correction

解決 computationally demanding and time consuming 問題

 

Configuration and customization

The settings include scoring schemes, binning, masking, iteration procedure and post-processing.

 

Scalability and parallelization擴展性和並行

 

 

MATERIALS AND METHODS

 

RESULTS

 

DISCUSSION



注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2021 ITdaan.com