Sie sind auf Seite 1von 21

MiSeq vs.

PGM
Data Quality

Omayma Al-Awar
Regional MarketingAMR
Feb 17, 2012

COMPANY CONFIDENTIAL INTERNAL USE ONLY


2011 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera,
Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names
contained herein are the property of their respective owners.

COMPANY CONFIDENTIAL INTERNAL USE ONLY

MiSeq Data Quality is Superior to that of Ion Torrent

We have higher quality scores and therefore lower error rates.


We sequence homopolymers accurately, while they have an inherent problem in
sequencing those regions. Their higher indel error rates increases the rate of
false positive SNP calls, and increases the cost of downstream validation
We measure the accuracy of every base, while they rely on stacks of reads and
a reference. What if a reference is not present? What if you are looking for rare
variants?
Proton sensing is novel and interesting, but technically inferior for DNA
sequencing. Our sequencing approach is more sensitive, requires less library
amplification, and produces higher fidelity data.

COMPANY CONFIDENTIAL INTERNAL USE ONLY

MiSeq Data Quality is Superior to that of Ion Torrent

We have higher quality scores and therefore lower error rates.


We sequence homopolymers accurately, while they have an inherent problem in
sequencing those regions. Their higher indel error rates increases the rate of
false positive SNP calls, and increases the cost of downstream validation
We measure the accuracy of every base, while they rely on stacks of reads and
a reference. What if a reference is not present? What if you are looking for rare
variants?
Proton sensing is novel and interesting, but technically inferior for DNA
sequencing. Our sequencing approach is more sensitive, requires less library
amplification, and produces higher fidelity data.

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Message #1: Quality Scores

Illumina data quality guarantees are at Q30 , or the probability of 1 error in


1000 bases. Ion Torrent data quality is at Q17, or 2-3 errors every 100
bases (at best at Q20 with 1 error every 100 bases).
Will that work for your application?

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Definitions-Quality Scores
What does a Q score mean?
!

A single base call is assigned a quality score. It is an odds ratio, the probability
of an incorrect call.
For example, a base with a Q40 score has a probability of 1 in 10,000 of an incorrect
base call.

Phred Quality Score

Probability of Incorrect
Base Call

Base Call Accuracy

10

1 in 10

90%

20

1 in 100

99%

30

1 in 1000

99.9%

40

1 in 10000

99.99%

Our Quality Guarantee: >75% bases >Q30 at 2 x 150


Ion Torrent Quality Scores: Q17- Q20
6

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Definitions-Quality Scores
!

Why is base quality score important?


A
A
T
T
T
T
T
T
T
T
T
T
T
T

11/13 T
2/13 A
7

You will call more False Positive SNPs.


You will spend more time and money on downstream
validation

What if you sequenced this on the Ion


Torrent with an average Q score of 17?

Base quality score determines whether this is


a SNP or an error!
If As are at Q30, we flag them as SNPs
If As are at Q17, we discard them as errors

The consensus is T.
Do the As represent a SNP?
Do the As represent a sequencing error?
COMPANY CONFIDENTIAL INTERNAL USE ONLY

Definitions-Quality Scores
!

Why is base quality score critical for cancer sequencing?

Few solid tumors that are sequenced are pure

Normal tissue contamination can be as high as 95%


!

Example: Matched tumor/normal samples were sequenced.


Results at a certain base position were:

Normal sample: 15 As, 15 Ts


Tumor sample: 12 As, 15 Ts, 6 Gs, and 3 Cs

How do you call it?

The normal sample is a heterozygote A/T


Without high BASE QUALITY scores, tumor sample is impossible to call accurately.
A CONSENSUS ACCURACY score that IT generates does not distinguish between
bona fide SNPs or errors in this example
Q scores of 30 or higher are critical for accurate calling of SNPs in complex samples
8

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Data-Quality Scores
Comparison of Entrococcus faecium de novo Sequencing Data
MiSeq
(2Gb spec)

PGM
(316-100 bp)

2 x 151 (overlapping)

100 (nominal)

Processed read length

220*

120

No. of reads (millions)

17.4

1.47

2.633
(84.2% >Q30)

0.197
(Avg. Q20)

877x

66x

Velvet 1.1.06 (k=95)

Newbler 2.6

3,053,394

2,980,036

N50

45,964

22,750

Raw SNP Count

9,709

5,636

Raw read length

Yield (Gb)
(Quality)
Genome coverage
Assembler
Genome Assembly (bp)

Work performed in the laboratory of Dr. Tim Stinear


University of Melbourne
9

COMPANY CONFIDENTIAL INTERNAL USE ONLY

From the blogosphere Quality Scores

Dan Koboldt, MassGenomics Blog. June 22, 2011.


Analyzed public IT data: DH10B sequenced on 316 chip
Average base quality is 23.
A lot of base quality values of 8 (he wondered if this is equivalent to Illumina Q2
scores, indicating virtually no confidence in the base)

10

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Message #2 Ion Torrent has high indel error rates

Illumina chemistry sequences homopolymers accurately, while Ion Torrent


technology has an inherent problem in reading through homopolymers, as do all
pyrosequencing-based technologies. This means their indel error rate is high.

Ion Torrent omits indels from their quality analysis!

Why should high indel error rate matter?


This means you will have a much higher number of false positive SNPs. You need to
do more downstream validation, which means the experiment will cost you more!

11

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Definitions - Ion Torrent has high indel error rates


Illumina Proven Reversible Terminator Chemistry

All 4 labeled nucleotides in 1 reaction


Higher accuracy
No problems with homopolymer repeats
O
HN
O

O
cleavage
site

fluor

DNA

PPP
3

HN

block

Incorporation
Detection
Deblock; fluor removal

N
O

3
OH free 3 end

Next cycle
12

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Data - Ion Torrent has high indel error rates


MiSeq has no issues with Homopolymers
BGI HiSeq data from recent
E.coli outbreak

BGI IT data from recent E.coli


outbreak

13

COMPANY CONFIDENTIAL INTERNAL USE ONLY

From the blogosphere Ion Torrent has high indel error rates
Ion Torrent do not factor indels in their data quality analysis
!

Keith Robison, Omics! Omics! Blog. Sept 14, 2011.

omitting indels from the quality analysis hits close to home, because I've been
guilty of this too. Partly this was it was easy to avoid them at first, and partly it
stems from the fact that indels don't really fit into the phred score paradigm I've
been using (that's a whole 'nother stalled blog post). I've tried to be up front about
that, but it is certainly an issue. In some applications the homopolymer reads can
be seen as just a tax on your data. For example, if I know I'm only looking for
activating substitutions in an oncogene, those must be in frame and I can discard
the reads with indels in the vicinity of my codon(s) of interest. But, in most cases
they really are an issue.

Why do they get away with it?


Yet he acknowledges.

14

COMPANY CONFIDENTIAL INTERNAL USE ONLY

From the blogosphere Ion Torrent has high indel error rates
Impact on SNP calls

Dan Koboldt, MassGenomnics Blog. June 22, 2011.


Analyzed public IT data: DH10B sequenced on 316 chip
Because it is a laboratory reference strain, it is expected to be genetically
homogeneous, and sequence should be identical to the reference.
Any apparent SNPs or Indels are due to sequencing errors.
VarScan detected an astonishing 1,122,276 insertion/deletion events, reflecting indel
error rate of 0.726%, or about eight-fold higher than the substitution error rate
VarScan removed 87.6% of indels as clear artifacts in homopolymer runs of 4 or more
bases.
What about SNPs?
Found 142,920 SNPs (remember, sequence should be identical to reference!)

Conclusion:
Homopolymers are obviously an issue for this platform
28% of the erroneous SNPS were in homopolymer regions

15

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Messages #3- Single base call accuracy

16

We measure the accuracy of every base in every read. Ion Torrent uses
consensus call accuracy and base call recalibration to measure their
accuracy. This means they rely on a stack of reads and a reference genome
to recalibrate their quality score.
What if you dont have a reference genome?
What if your application requires that you perform de novo sequencing
What if youre interested in metagenomics?
Why should any microbiology lab ever consider an Ion Torrent?

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Definitions Single base call accuracy


Measuring Accuracy
!
!

Illumina uses single base call accuracy to assess data quality.


Ion Torrent uses consensus accuracy (look at accuracy of a whole stack of reads
in a given genomic position).
Single base call accuracy

Consensus call accuracy


T
T
T
T
T
A

Single Read

17

Reference Genome

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Data Single base call accuracy


de novo assembly
Comparison of Entrococcus faecium de novo Sequencing Data
MiSeq
(2Gb spec)

PGM
(316-100 bp)

2 x 151 (overlapping)

100 (nominal)

Processed read length

220*

120

No. of reads (millions)

17.4

1.47

2.633
(84.2% >Q30)

0.197
(Avg. Q20)

877x

66x

Velvet 1.1.06 (k=95)

Newbler 2.6

3,053,394

2,980,036

N50

45,964

22,750

Raw SNP Count

9,709

5,636

Raw read length

Yield (Gb)
(Quality)
Genome coverage
Assembler
Genome Assembly (bp)

Work performed in the laboratory of Dr. Tim Stinear


University of Melbourne
18

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Message # 4: Our sequencing is technically superior


Its a Photon vs. Proton Argument.
Proton sensing is novel and interesting, but technically inferior
for DNA sequencing.
Optical DNA sequencing:
!

500x more sensitive with better SNR

Requires less amplification

Produces higher fidelity data

Safer choice when the stakes are high

19

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Message # 4: Our sequencing is technically superior


!

Ion sensing requires a massive amount of


PCR to overcome the native background
signal in solution
Native PCR introduces 1 error every 9000
bases
Sequence detection is analog, requiring
high speed computing to call bases

Figure S1 Process overview


20

COMPANY CONFIDENTIAL INTERNAL USE ONLY


a, Overview of ion sequencing
work flow. b, Prepare genomic library, DNA is fragmented,

QUESTIONS?

21

COMPANY CONFIDENTIAL INTERNAL USE ONLY

Das könnte Ihnen auch gefallen