Beruflich Dokumente
Kultur Dokumente
qcbs.ca
April 2013
Synonyms:
Next generation sequencing (NGS), Massively parallel sequencing, High throughput sequencing, 2nd or 3rd generation sequencing
Useful reading
Review of the chemistry and the workflow :
Myllykangas S, Buenrostro J, Ji HP: Overview of Sequencing Technology Platforms. In Bioinformatics for High Throughput Sequencing. Springer New York; 2012: 1125.
http://www.springerlink.com/content/n6u33m1335750g57/
Instruments comparison
Platform Amplification, detection Detection Fluorescence Fluorescence Step At GQ Innovation Center Yes Yes Unit
GS - FLX+ (454)
HiSeq (Illumina)
During synthesis
During synthesis
1 plate (divided in )
Flow cell of 8 lanes
Ion PGM Sequencer Emulsion PCR. Pyrosequencing(LifeTechnolo like gies) 314, 316 chip No prior amplification. PacBio RS Single-molecule Real-time sequencing (smrt)
H+ ions
During synthesis
Yes
Chip
Fluorescence
During synthesis
No
Cell
Visuals
454 GS FLX
http://454.com/products/technology.asp http://bcove.me/7eidiq1e?width=490&he ight=274
HiSeq
http://www.youtube.com/w atch?v=77r5p8IBwJk
Ion PGM
http://www.youtube.com/watch?v=yVf 2295JqUg&feature=plcp&context=C489 7380VDvjVQa1PpcFPcv91xP1YGJ31VyENe915toprCBsg2Jc%3D
PacBio RS
http://www.youtube.com/watch?v=N HCJ8PtYCFc&feature=related
Visuals
454 GS FLX
HiSeq
Ion PGM
PacBio RS
Instrument comparison
Platform Nb reads per unit 1 million per plate Read length 350 500 bp 50 bp 100 bp 150 bp 35 400 bp Run time Cost $ per Mb * Preferred uses Type of errors GS - FLX+ (454) 20 h Library prep: 160 $ Per plate: 8 200 $ Library prep: 160 $ Per lane: 715 $ to 2 100 $ (length) 7$
Indels
HiSeq (Illumina)
8 days
0.1 $
Susbstitutions
1.5 7 h
50 $
Indels
PacBio RS
6000 bp
2h
11 200 $
*Cost estimate: Glenn TC. 2011. Field guide to nextgeneration DNA sequencers. Molecular Ecology Resources 2011, 11:759769. http://onlinelibrary.wiley.com/doi/10.1111/j.1755-0998.2011.03024.x/abstract
Reads
Assembly/mapping by similiarity
8X coverage
8X coverage
8X coverage
8X coverage
8X coverage
Useful in biodiversity?
How to make use of 200 millions reads for your biological question?
Be strategic!
Reduce the complexity of genetic material analyzed Combine different samples into a single run (Multiplexing)
Strategies: Multiplexing
Incorporate specific KNOWN oligos (code or index) at beginning of the each fragment. During library preparation Read at sequencing Sorted by sequence deconvolution according its code Roche 454: 30 (up to 130) Multiplex identifiers (MID), 10 bp Illumina: 12 Index sequences, 6 bp
A single run
Sample 1 Sample 2 Sample 3
Depth of coverage: GS-FLX plate = 250 000 reads / 25 barcodes: 10 000 reads per sample. Enough for you?
Sample 1 3
Sample 2
Sample
Strategies: Multiplexing
Incorporate specific KNOWN oligos (code or index) at beginning of the UNKNOWN fragment. During library preparation
Example of Roche 10 bp MID barcode for Amplicon sequencing
5'-CTCGTAGACTGCGTACCAATTC.............TTACTCAGGACTCAT-3 3CAATGAGTCCTGAGTAG TargetSpecific
By hybridization
Be creative!
A few organisms:
Unbound (discarded)
Beads
DNA fragmentation
By methylation-sensitive RE
Be creative!
A few organisms: Enrich in gene-rich regions for genomic DNA sequencing Elimination of methylation rich regions (plants repetitive elements)
By amplification
One or a few organisms: Randomly sample the whole genome Amplification: AFLP-like Sequence instead of length polymorphism ddRAD : Double digest restriction-site-associated DNA sequencing, to find SNPs
DNA fragmentation
Enz.A Enz.A Enz.B
Adaptor ligation
By amplification
One or a few organisms: Randomly sample the whole genome ddRAD : Double digest restriction-site-associated DNA sequencing Powerful: Coupled with multiplexing
Enz.A Enz.A Enz.B
Index
Adaptor
Transcriptome sequencing
Total RNA : RNAseq Reduce to mRNA only (polyA)
5 UTR
Exon
Intron
Exon
3 UTR
AAAAAAA
5 UTR
CDS
3 UTR
Primers anneal
Primers anneal
Castoe et al. 2011 Rapid Microsatellite Identification from Illumina Paired-End Genomic Sequencing in Two Birds and a Snake. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030953 Peterson et al. 2012. Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135 Griffin et al. 2011. A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biology 9: 19.
http://www.biomedcentral.com/1741-7007/9/19
Bacterial communities
Bartram et al: Generation of multi-million 16S rRNA gene libraries from complex microbial communities by assembling pairedend Illumina reads. Appl. Environ. Microbiol.
http://aem.asm.org/content/early/2011/04/01/AEM.02772-10
Objective: develop a protocol for community genetic diversity (test for two samples, costs for 20 samples)
Material:
Soils from arctic tundra. Total DNA extracted with FastDNA (MPBiomedicals) Also includes a control bacterial mix in liquid media.
20 X 5 $ = 100 $
Bacterial communities
Molecular steps
Primer for hypervariable region 3 (V3) of the microbial 16S rRNA
81 bp, purified PAGE:
caagcagaagacggcatacgagatCGTGATgtgactggagttcagacgtgtgctcttccgatctATTACCGCGGCTGCTGG
25 X 67 $ = 1 675 $
Illumina-prime Target-gene
flow-cell-binding Index
Amplify with High fidelity polymerase (Phusion) Extract desired length 200-250 bp. (columns) Multiplexing: Yes, including technical replicates Quality control for libraries : (e.g. Agilent Bioanalyzer)
1 X 90 $ = 90 $ 25 X 1.5$ = 40 $
25 X 50$ = 1 250 $ 1 X = 2 090 $ Total = 5 250 $
Bacterial communities
Sequence analyses
Bioinformatics: Base calling and error estimation Illumina Analysis Pipeline Quality filtering, reads sorting according to index sequence, contig assembly (custom made, PANDAseq).
Index seq.
Raw reads
Custom program
Discard:
1 or more mismatch between the two overlapping fragments of a the pair-end 1 or more ambiguous base
Assignation to taxonomic affiliations : nave Bayesian classification (Ribosomal Database Project RDP classifier) cutoff 0.5. Goods coverage for each libraries to estimate sequence coverage (C = 1 n1/N) CD-HIT to cluster arctic tundra datasets at 97% sequence identity
Bacterial communities
Results
Total of 12 million raw reads Discard 50% of the reads:
Raw reads: 7.6 million and 4.4 millions for each technical replicates Post-assembly: 4.1 and 2.4 millions for each technical replicates
Average post-assembly contig : 150 11 bases (without primers). Overlap 66 11 bases Pre-clustering at 97% sequence identity Estimate error rate (from control library): 1 error per 5 contig (1%/base). Higher than Sanger sequencing. Find contaminant in the growth media of a control Duplicate arctic tundra libraries displayed a high degree of similarity
Comparison of phyla in one library compared to one another (AT1 to AT2; r=0.999) The majority of sequences clusters (99.57%) detected in both replicates
Also read: Jennings et al. 2011. Multiplexed Microsatellite Recovery Using Massively Parallel Sequencing. Molecular Ecology Resources. http://doi.wiley.com/10.1111/j.17550998.2011.03033.x.
8 X 5 $ = 40 $ 8 X 160 $ = 1 280 $
Primers not tested for amplifyability Conclusions: Large variation in number and proportion of motifs (3-mer, 4-mer) in the different organisms.
ddRAD-seq
Peterson et al. 2012 Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species. PlosOne http://www.plosone.org/article/info:doi/10.1371/journal.pone.0037135 Objective: Aim at 10 000 SNPs, random genome-wide,10X coverage
Material: Total DNA extracted from 54 P. leucopus , one population (Qiagen kits) 54 X 5 $ = 270 $
ddRAD-seq
Complexity reduction: Yes, complex Digestion, annealing, size-selection, many purifications steps Multiplexing: Yes 54 samples. With simulation, include genome size and nucleotides frequency, estimate they need 400 000 reads per individual, for the 300 +- 30 bp
2 X 2 010 $ = 4 020 $
ddRAD-seq
PCRprimer1 (46bp) AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACG adaptorP1 gDNA adaptorP2 ACACTCTTTCCCTACACGACGCTCTTCCGATCTAATTA-3 AATTCNNNNNN P-5-CGAGATCGGAAGAGCGAGAACAA Oligo1.1 |||||||||||||||||||||||||||||||||||||| |||||||| ||||||||||||| Oligo2.1 Oligo1.2 one of 48) TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGATTAATTTAA-5-P GNNNNNNGGC TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGTG CGTGTGCAGACTTGAGGTCAGTGTAGTGCTAGAGCATACGGCAGAAGACGAAC
Many oligos, combine index on the 5 and on the 3 Digestions and PCR amplifications
No reference genome; not the Stacks package Compute pariwise distance btw alll reads (BLAT) MCL to group similar reads (ortholog inference) Count unique seqs in a cluster (=loci), count how many are beyond the ploidy level )=% error containg reads) Align orthologs (MUSCLE) Write alignment as reference-ordered SAM/BAM files GATK UnifiedGenotyper, to genotype Error : rate ranged from 0.18 0.22% per nucleotide. 1/10 reads Technical replicates? No
ddRAD-seq results
The 54 wild Peromyscus from a same population Total reads: not reported (~2 X 21 millions) Assigned to an individual: not reported Discard 5.4% of reads SNP discovered Variable regions (loci): 6 200 found Polymorphic sites for >70% of individuals: 16 000 sites found In an analysis on samples from different populations SNPs in multi-SNPs loci: >80% These multi-SNP are usually excluded in other analyses
Primers Amplify 3 cp genes (rpl32-trnL, rpoB-trnC and trnH-psbA) and two nuclear genes (DMC1 and 10 X 5 $ = 50 $ Enzymes CDO504) from each of the 60 samples. 1 X 150 $ = 150 $ Target amplicons are < 500 bp
A - TitaniumAdapterB
A + CTGAGCGGGCTGGCAAGGCGCATAG GACTCGCCCGACCGTTCCGCGTATC
Calculate error rate by calculating SNP at chloroplast regions. Detect and discard PCR recombinant
Alleles that occurred at <5% for a species OR Both ends of the allele do not match the same common allele
Percentage of useful reads gained for each nuclear gene copy and allele, including recombinant reads.
To remember
Diversity of protocols and experimental design (Be creative!) Budget: 3 000 $ to 9 000 $, main cost can be primers and library preparation Standards are rapidly increasing:
Technical replicates required
Results:
Half of the reads are discarded Many target loci will be missing Unequal proportion of technical replicates in final dataset Prone to PCR recombination and chimeras assembly
Comparison
Platform Nb samples Total cost % reads retained Nb clean unique reads Tech. reps Error rate Missed target
Arctic soil
24
5 250 $
53 %
Yes
1%
NA
Microsat
? (? 8)
No
NA
SNP (ddRAD)
54
9 000 $
95%
No
Polyploids
GS FLX, plate
61
4 900 $
58 %
70 601
Yes
0.13%
12 %
Thank you!