Beruflich Dokumente
Kultur Dokumente
PGM
Data Quality
Omayma Al-Awar
Regional MarketingAMR
Feb 17, 2012
Definitions-Quality Scores
What does a Q score mean?
!
A single base call is assigned a quality score. It is an odds ratio, the probability
of an incorrect call.
For example, a base with a Q40 score has a probability of 1 in 10,000 of an incorrect
base call.
Probability of Incorrect
Base Call
10
1 in 10
90%
20
1 in 100
99%
30
1 in 1000
99.9%
40
1 in 10000
99.99%
Definitions-Quality Scores
!
11/13 T
2/13 A
7
The consensus is T.
Do the As represent a SNP?
Do the As represent a sequencing error?
COMPANY CONFIDENTIAL INTERNAL USE ONLY
Definitions-Quality Scores
!
Data-Quality Scores
Comparison of Entrococcus faecium de novo Sequencing Data
MiSeq
(2Gb spec)
PGM
(316-100 bp)
2 x 151 (overlapping)
100 (nominal)
220*
120
17.4
1.47
2.633
(84.2% >Q30)
0.197
(Avg. Q20)
877x
66x
Newbler 2.6
3,053,394
2,980,036
N50
45,964
22,750
9,709
5,636
Yield (Gb)
(Quality)
Genome coverage
Assembler
Genome Assembly (bp)
10
11
O
cleavage
site
fluor
DNA
PPP
3
HN
block
Incorporation
Detection
Deblock; fluor removal
N
O
3
OH free 3 end
Next cycle
12
13
From the blogosphere Ion Torrent has high indel error rates
Ion Torrent do not factor indels in their data quality analysis
!
omitting indels from the quality analysis hits close to home, because I've been
guilty of this too. Partly this was it was easy to avoid them at first, and partly it
stems from the fact that indels don't really fit into the phred score paradigm I've
been using (that's a whole 'nother stalled blog post). I've tried to be up front about
that, but it is certainly an issue. In some applications the homopolymer reads can
be seen as just a tax on your data. For example, if I know I'm only looking for
activating substitutions in an oncogene, those must be in frame and I can discard
the reads with indels in the vicinity of my codon(s) of interest. But, in most cases
they really are an issue.
14
From the blogosphere Ion Torrent has high indel error rates
Impact on SNP calls
Conclusion:
Homopolymers are obviously an issue for this platform
28% of the erroneous SNPS were in homopolymer regions
15
16
We measure the accuracy of every base in every read. Ion Torrent uses
consensus call accuracy and base call recalibration to measure their
accuracy. This means they rely on a stack of reads and a reference genome
to recalibrate their quality score.
What if you dont have a reference genome?
What if your application requires that you perform de novo sequencing
What if youre interested in metagenomics?
Why should any microbiology lab ever consider an Ion Torrent?
Single Read
17
Reference Genome
PGM
(316-100 bp)
2 x 151 (overlapping)
100 (nominal)
220*
120
17.4
1.47
2.633
(84.2% >Q30)
0.197
(Avg. Q20)
877x
66x
Newbler 2.6
3,053,394
2,980,036
N50
45,964
22,750
9,709
5,636
Yield (Gb)
(Quality)
Genome coverage
Assembler
Genome Assembly (bp)
19
QUESTIONS?
21