Sie sind auf Seite 1von 46

Massively parallel sequencing for biodiversity science

Annie Archambault, Centre de la Science de la Biodiversit du Qubec

qcbs.ca
April 2013

Synonyms:
Next generation sequencing (NGS), Massively parallel sequencing, High throughput sequencing, 2nd or 3rd generation sequencing

Parallelize the sequencing process :


Producing thousands of short sequencing reads at once REPLACES CLONING AND CLONE SCREENING REPLACES INDIVIDUAL SEQUENCING REACTIONS

Outline Uses for biodiversity studies


Very brief review of the 4 main platforms
Examples of experimental procedures strategies (complexity reduction and multiplexing) Laboratory steps and costs for 4 cases studies Disclaimer:
I still have limited experience with these instruments, I gained understanding from intensive readings

Useful reading
Review of the chemistry and the workflow :
Myllykangas S, Buenrostro J, Ji HP: Overview of Sequencing Technology Platforms. In Bioinformatics for High Throughput Sequencing. Springer New York; 2012: 1125.
http://www.springerlink.com/content/n6u33m1335750g57/

Review of technologies and applications in biodiversity :


Purdy KJ, Hurd PJ, Moya-Larao J, Trimmer M, Oakley BB, Woodward G: Systems Biology for Ecology: From Molecules to Ecosystems. In Advances in Ecological Research. 2010: 87149. http://linkinghub.elsevier.com/retrieve/pii/B9780123850058000034

Instruments comparison
Platform Amplification, detection Detection Fluorescence Fluorescence Step At GQ Innovation Center Yes Yes Unit

GS - FLX+ (454)
HiSeq (Illumina)

Pyrosequencing emulsion PCR


BridgePCR Wash after every base

During synthesis
During synthesis

1 plate (divided in )
Flow cell of 8 lanes

Ion PGM Sequencer Emulsion PCR. Pyrosequencing(LifeTechnolo like gies) 314, 316 chip No prior amplification. PacBio RS Single-molecule Real-time sequencing (smrt)

H+ ions

During synthesis

Yes

Chip

Fluorescence

During synthesis

No

Cell

Visuals

454 GS FLX
http://454.com/products/technology.asp http://bcove.me/7eidiq1e?width=490&he ight=274

HiSeq
http://www.youtube.com/w atch?v=77r5p8IBwJk

Ion PGM
http://www.youtube.com/watch?v=yVf 2295JqUg&feature=plcp&context=C489 7380VDvjVQa1PpcFPcv91xP1YGJ31VyENe915toprCBsg2Jc%3D

PacBio RS
http://www.youtube.com/watch?v=N HCJ8PtYCFc&feature=related

Visuals

454 GS FLX

HiSeq

Ion PGM

PacBio RS

Instrument comparison
Platform Nb reads per unit 1 million per plate Read length 350 500 bp 50 bp 100 bp 150 bp 35 400 bp Run time Cost $ per Mb * Preferred uses Type of errors GS - FLX+ (454) 20 h Library prep: 160 $ Per plate: 8 200 $ Library prep: 160 $ Per lane: 715 $ to 2 100 $ (length) 7$

Amplicon sequencing; Initial characterization. non-model species.


Re-sequencing; Frequency-based applications. NOT amplicons Individual laboratories, Small scale Non-model species, long fragments, methylated fragments

Indels

HiSeq (Illumina)

~200 million per lane

8 days

0.1 $

Susbstitutions

Ion PGM 314 ; 316 or 318 chip

314: 100 000 316: 1 million 318: 10 millions


50 000 reads per cell

1.5 7 h

50 $

Indels

PacBio RS

6000 bp

2h

~ 750$ USD per sample

11 200 $

CG deletions, High error rates

*Cost estimate: Glenn TC. 2011. Field guide to nextgeneration DNA sequencers. Molecular Ecology Resources 2011, 11:759769. http://onlinelibrary.wiley.com/doi/10.1111/j.1755-0998.2011.03024.x/abstract

Quantity instead of length or quality


Each read is short (75 200 bp), and bears errors: need to be confirm with many reads covering the same template region Template Long templates (gDNA) Short amplicon templates
Library preparation: gDNA fragmentation + adaptors Library preparation: Amplification + adaptors

Quantity instead of length or quality


Each read is short (75 200 bp), and bears errors: need to be confirm with many reads covering the same template region Template Long templates (gDNA) Short amplicon templates
Library preparation: gDNA fragmentation + adaptors Library preparation: Amplification + adaptors

Reads

Assembly/mapping by similiarity

8X coverage

Excluded from further analyses

8X coverage

8X coverage

Excluded from further analyses

8X coverage

8X coverage

Deduced template sequence

Useful in biodiversity?
How to make use of 200 millions reads for your biological question?
Be strategic!

Reduce the complexity of genetic material analyzed Combine different samples into a single run (Multiplexing)

Strategies: Multiplexing
Incorporate specific KNOWN oligos (code or index) at beginning of the each fragment. During library preparation Read at sequencing Sorted by sequence deconvolution according its code Roche 454: 30 (up to 130) Multiplex identifiers (MID), 10 bp Illumina: 12 Index sequences, 6 bp
A single run
Sample 1 Sample 2 Sample 3

Pool in one tube

Depth of coverage: GS-FLX plate = 250 000 reads / 25 barcodes: 10 000 reads per sample. Enough for you?

Sorted according to coded seq.

Sample 1 3

Sample 2

Sample

Strategies: Multiplexing
Incorporate specific KNOWN oligos (code or index) at beginning of the UNKNOWN fragment. During library preparation
Example of Roche 10 bp MID barcode for Amplicon sequencing
5'-CTCGTAGACTGCGTACCAATTC.............TTACTCAGGACTCAT-3 3CAATGAGTCCTGAGTAG TargetSpecific

Primer LibL_B with TargetSpecific (no MID)

Primer LibL_A with MID3 with TargetSpecific

TargetSpecific GACTGCGTACCAATTC3 3 - CATCTGACGCATGGTTAAG .............AATGAGTCCTGAGTAGCAG-5

Strategies: complexity reduction


A few organisms (1 to a few hundreds) : Survey a few thousands loci per sample Enrich in gene-rich regions for gDNA sequencing Random genomic survey Transcriptome sequencing Very many organisms (e.g. environmental studies): Survey one or two loci per individual Amplicon sequencing with universal primers (PCR)

By hybridization
Be creative!
A few organisms:

Enrich in simple-sequence-repeats. Hybridization to target repeats (e.g. microsatellites loci)


Enrich in gene-rich regions for genomic DNA sequencing. Hybridization to reference set of genes (e.g. target exons)
Bound (retained) Bait (custom made)

Unbound (discarded)

Beads

DNA fragmentation

Hybridization: Enrich in specific fragments (e.g. exon)

Sequence the enriched pool

! Evaluate costs carefully


From 2008 to 2013? Instruments give higher throughput Each sequencing run is cheaper May be cheaper not to target specific regions

By methylation-sensitive RE
Be creative!
A few organisms: Enrich in gene-rich regions for genomic DNA sequencing Elimination of methylation rich regions (plants repetitive elements)

Nuclear DNA fragmentation

Insert in E. coli : digests methylated DNA

Sequence the enriched pool

By amplification
One or a few organisms: Randomly sample the whole genome Amplification: AFLP-like Sequence instead of length polymorphism ddRAD : Double digest restriction-site-associated DNA sequencing, to find SNPs
DNA fragmentation
Enz.A Enz.A Enz.B

Adaptor ligation

Amplification with adaptor primers

By amplification
One or a few organisms: Randomly sample the whole genome ddRAD : Double digest restriction-site-associated DNA sequencing Powerful: Coupled with multiplexing
Enz.A Enz.A Enz.B

Index
Adaptor

Sample 1 Multiplex Sample 2

Genome complexity reduction: RNA


A few samples:
Transcription (DNA > RNA) Translation (RNA > protein)

Transcriptome sequencing
Total RNA : RNAseq Reduce to mRNA only (polyA)

Reduce to microRNA only


! Driven by external condition and by tissues type Needs high number of reads: Illumina preferred

Genome complexity reduction: RNA


A few organisms:

Reminder: mRNA sequences include non-coding regions (UTR)

5 UTR

Exon

Intron

Exon

3 UTR

AAAAAAA

5 UTR

CDS

3 UTR

Genome complexity reduction: Amplicon


Very many organisms:
Amplicon sequencing with universal primers for ONE loci Limitation: primers may not amplify equally well in ALL target organisms Environmental samples targeting ITS, 16S, CO1 (the barcode loci)

Primers anneal

Primers anneal

Primers DO NOT anneal

Case studies in biodiversity


Bartram et al 2011. Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol.
http://aem.asm.org/content/early/2011/04/01/AEM.02772-10

Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953 Peterson et al. 2012. Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135 Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19.
http://www.biomedcentral.com/1741-7007/9/19

Bacterial communities
Bartram et al: Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling pairedend Illumina reads. Appl. Environ. Microbiol.
http://aem.asm.org/content/early/2011/04/01/AEM.02772-10

Objective: develop a protocol for community genetic diversity (test for two samples, costs for 20 samples)

Material:
Soils from arctic tundra. Total DNA extracted with FastDNA (MPBiomedicals) Also includes a control bacterial mix in liquid media.

20 X 5 $ = 100 $

Bacterial communities
Molecular steps
Primer for hypervariable region 3 (V3) of the microbial 16S rRNA
81 bp, purified PAGE:
caagcagaagacggcatacgagatCGTGATgtgactggagttcagacgtgtgctcttccgatctATTACCGCGGCTGCTGG

25 X 67 $ = 1 675 $
Illumina-prime Target-gene

flow-cell-binding Index

Amplify with High fidelity polymerase (Phusion) Extract desired length 200-250 bp. (columns) Multiplexing: Yes, including technical replicates Quality control for libraries : (e.g. Agilent Bioanalyzer)

1 X 90 $ = 90 $ 25 X 1.5$ = 40 $
25 X 50$ = 1 250 $ 1 X = 2 090 $ Total = 5 250 $

Sequencing : paired-end 2 x 125bp Illumina GAIIx (would be HiSeq)

Bacterial communities
Sequence analyses
Bioinformatics: Base calling and error estimation Illumina Analysis Pipeline Quality filtering, reads sorting according to index sequence, contig assembly (custom made, PANDAseq).
Index seq.

Raw reads
Custom program

Paired-end reads assembled


CD-HIT

Discard:
1 or more mismatch between the two overlapping fragments of a the pair-end 1 or more ambiguous base

Cluster modified single linkage


RDP / QIIME

Assignation to taxonomic affiliations : nave Bayesian classification (Ribosomal Database Project RDP classifier) cutoff 0.5. Goods coverage for each libraries to estimate sequence coverage (C = 1 n1/N) CD-HIT to cluster arctic tundra datasets at 97% sequence identity

Classification / Diversity estimate

Bacterial communities
Results
Total of 12 million raw reads Discard 50% of the reads:
Raw reads: 7.6 million and 4.4 millions for each technical replicates Post-assembly: 4.1 and 2.4 millions for each technical replicates

Average post-assembly contig : 150 11 bases (without primers). Overlap 66 11 bases Pre-clustering at 97% sequence identity Estimate error rate (from control library): 1 error per 5 contig (1%/base). Higher than Sanger sequencing. Find contaminant in the growth media of a control Duplicate arctic tundra libraries displayed a high degree of similarity
Comparison of phyla in one library compared to one another (AT1 to AT2; r=0.999) The majority of sequences clusters (99.57%) detected in both replicates

Isolation of novel microsatellites loci


Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953
Objective: Discover novel microsatellites loci from diverse organisms Material: Total genomic DNA from many different organisms (report results for only 3 species, I calculate for 8)

Also read: Jennings et al. 2011. Multiplexed Microsatellite Recovery Using Massively Parallel Sequencing. Molecular Ecology Resources. http://doi.wiley.com/10.1111/j.17550998.2011.03033.x.

Isolation of novel microsatellites loci


Genome complexity reduction: None: direct sequencing Material: Genomic DNA (5 ug). One individual per species Library preparation for Illumina sequencing (likely on ~8 to 10 species) Multiplexing: Yes. At the sequencing facility, during library preparation. Sequencing platform: Illumina GAIIx ; 120 bp paired-end. Would now be HiSeq2000 One need to order primers for each loci after that ~ 700$ for 50 loci up to 5 500$ for 8

8 X 5 $ = 40 $ 8 X 160 $ = 1 280 $

1 X 2 090 $ = 2 090 $ Total: 3 400 $ Total (with primers): 9 000$

Isolation of novel microsatellites loci


Bioinformatics: Simple, no assembly, no comparison to reference genome
In a perl script Identify reads that contain perfects SSR : 2mer to 6mer, repeated at least 6 times Sort by SSR types (de-multiplex) Design primers (with Primer3) Discard the primer pairs that also occur in other reads

Isolation of novel microsatellites loci


Results:
Number of raw reads: Not reported Use 5 millions paired-end reads per sample (A 1X coverage) Mean sequence length : Not reported Between 150 000 to 540 000 potential loci (containing microsatellites) Primers designed for 72 000 to 174 000 loci, depending on species With extra stringency (only 3 to 6-mer, >7 repeats): 200 to 2000 loci

Primers not tested for amplifyability Conclusions: Large variation in number and proportion of motifs (3-mer, 4-mer) in the different organisms.

ddRAD-seq
Peterson et al. 2012 Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135 Objective: Aim at 10 000 SNPs, random genome-wide,10X coverage

Material: Total DNA extracted from 54 P. leucopus , one population (Qiagen kits) 54 X 5 $ = 270 $

ddRAD-seq
Complexity reduction: Yes, complex Digestion, annealing, size-selection, many purifications steps Multiplexing: Yes 54 samples. With simulation, include genome size and nucleotides frequency, estimate they need 400 000 reads per individual, for the 300 +- 30 bp

Platform: Two lanes of GAII (now HiSeq 2000). Paired-end

2 X 2 010 $ = 4 020 $

ddRAD-seq

PCRprimer1 (46bp) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACG adaptorP1 gDNA adaptorP2 ACACTCTTTCCCTACACGACGCTCTTCCGATCTAATTA-3 AATTCNNNNNN P-5-CGAGATCGGAAGAGCGAGAACAA Oligo1.1 |||||||||||||||||||||||||||||||||||||| |||||||| ||||||||||||| Oligo2.1 Oligo1.2 one of 48) TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGATTAATTTAA-5-P GNNNNNNGGC TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG CGTGTGCAGACTTGAGGTCAGTGTAGTGCTAGAGCATACGGCAGAAGACGAAC

Oligo2.2 PCRprimer2 (1 of 12)

Many oligos, combine index on the 5 and on the 3 Digestions and PCR amplifications

110 X ~30$ = 3 300 $ Enzymes + purif. = 1 250 $ Big total = ~9 000 $

Many purifications and precise size selection (pippin prep)

ddRAD-seq Sequence analyses


Initial sequence processing De-multiplex accept 1 bp mismatch in the 4 bp barcode Assign the read to a single individual Collapse identical reads to one seq., retaining fequency

No reference genome; not the Stacks package Compute pariwise distance btw alll reads (BLAT) MCL to group similar reads (ortholog inference) Count unique seqs in a cluster (=loci), count how many are beyond the ploidy level )=% error containg reads) Align orthologs (MUSCLE) Write alignment as reference-ordered SAM/BAM files GATK UnifiedGenotyper, to genotype Error : rate ranged from 0.18 0.22% per nucleotide. 1/10 reads Technical replicates? No

ddRAD-seq results
The 54 wild Peromyscus from a same population Total reads: not reported (~2 X 21 millions) Assigned to an individual: not reported Discard 5.4% of reads SNP discovered Variable regions (loci): 6 200 found Polymorphic sites for >70% of individuals: 16 000 sites found In an analysis on samples from different populations SNPs in multi-SNPs loci: >80% These multi-SNP are usually excluded in other analyses

Phylogenies with polyploids


Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19. http://www.biomedcentral.com/1741-7007/9/19 Objective: Phylogenies in polyploid grasses ; recent, rapid radiation (in a time and cost effective experimental design)

Phylogenies with polyploids


Material: Total DNA extracted from 60 individuals of 11 different polyploids Poa species Complexity reduction:
60 X 5 $ = 300 $

Primers Amplify 3 cp genes (rpl32-trnL, rpoB-trnC and trnH-psbA) and two nuclear genes (DMC1 and 10 X 5 $ = 50 $ Enzymes CDO504) from each of the 60 samples. 1 X 150 $ = 150 $ Target amplicons are < 500 bp

Pool the 5 different PCR-products for one individual

Phylogenies with polyploids


Multiplexing: Yes, addition of ds-adaptors with MID barcodes by ligation. Design 64 different 64 X 2 X 9 $ = 1160 $ barcodes (with 3 technical replicates)
64 X 2 X 6 $ = 800 $
TitaniumAdapterA 25bp + MIDbarcode + T
CGTATCGCCTCCCTCGCGCCATCAG + ACGAGTGCGT + T GCATAGCGGAGGGAGCGCGGTAGTA TGCTCACGCA

A - TitaniumAdapterB
A + CTGAGCGGGCTGGCAAGGCGCATAG GACTCGCCCGACCGTTCCGCGTATC

Ligation of barcode-adapters to the pools of amplicon, purification, pool, quality control

1 X 100 $ = 100 $ 64 X 1.5 $ = 96 $ 1 X 50 $ = 50 $ 1 X 2 140 $ = 2 140 $ Total = ~4 900$

Sequencing platform: plate of a Roche 454 with Titanium 2010 chemistry

Phylogenies with polyploids


Bioinformatics: Galaxy platform (free) Sort the gene regions by regular expression (REGEX) of gene specific primers Discard:
low-quality reads short sequences reads matching no MID barcode

Calculate error rate by calculating SNP at chloroplast regions. Detect and discard PCR recombinant
Alleles that occurred at <5% for a species OR Both ends of the allele do not match the same common allele

Phylogenies with polyploids


Results: 121 000 raw reads. Length: 40 to 775 bp (mean 278 bp).
111 200 (92%) match to gene specific primer 70 601 reads (58%) remained after barcode sorting and quality control Useful sequence for 281 out of 320 (88%) targets = 12% missing

Sequence error rate 0.13%


PCR recombination : 2.9% of CDO504 reads and 14% of DMC1 reads Technical replicates: P. costiniana: identical alleles (At the < 2 bp level), but one extra allele (=PCR error). Two distinct copies (and more) of each nuclear gene deduced
DMC1 has 19 (4.0%) base difference and CDO504, 35 (8.5%) and seven-bp indels and 4-bp-indels One extra gene copy discovered for CDO514, shows a 57-bp deletion in intron.

Phylogenies with polyploids


Number of sequence reads obtained for each marker/individual combination.
A - After quality control and barcode deconvolution. B - Useful sequence reads remaining after alignment and editing.

Percentage of useful reads gained for each nuclear gene copy and allele, including recombinant reads.

Phylogenies with polyploids


Results: Phylogenetic analyses Timing of polyploidization: took place before the Australian and the American species diverged. Extensive haplotype sharing between taxa currently different species Nuclear gene networks showed incongruence both with each other and with the chloroplast gene networks Tasmania-mainland differentiation detected On the local scale, strong spatial genetic structure detected using two of the chloroplast markers.
Suggest a smaller neighborhood for seed dispersal than for pollen dispersal.

To remember
Diversity of protocols and experimental design (Be creative!) Budget: 3 000 $ to 9 000 $, main cost can be primers and library preparation Standards are rapidly increasing:
Technical replicates required

Challenge: Data analysis


No standard analytical protocol (custom, in house, developed) No standard calculation of error rate Initial steps computer intensive (30 millions of short reads)

Results:
Half of the reads are discarded Many target loci will be missing Unequal proportion of technical replicates in final dataset Prone to PCR recombination and chimeras assembly

Comparison
Platform Nb samples Total cost % reads retained Nb clean unique reads Tech. reps Error rate Missed target

Arctic soil

GA II, one lane

24

5 250 $

53 %

4.1 million vs 2.4 millions for tech rep

Yes

1%

NA

Microsat

GAIIx, one lane

? (? 8)

3 400 $ (without primers)

No

NA

SNP (ddRAD)

GA II, two lanes

54

9 000 $

95%

7 000 loci w SNPs

No

0.18 0.22% per base

Polyploids

GS FLX, plate

61

4 900 $

58 %

70 601

Yes

0.13%

12 %

Thank you!

Das könnte Ihnen auch gefallen