Leveraging Massively Parallel Pyrosequencing For Functional and Evolutionary Genomics in Ferns.

Leveraging massively parallel
pyrosequencing for functional and

evolutionary genomics in ferns.
Joshua Der and Paul Wolf
Dept. Biology and Center for Integrated Biosystems,
Utah State University
Genomics in
non-model organisms
• Traditional genome sequencing methods are cost
prohibitive, especially for large genomes
• Expressed Sequence Tags (ESTs) are a genome-wide

proxy for functional components of the genome
• New-generation sequencing technologies:

low cost per base, massively high throughput
• Roche 454 pyrosequencing: long read lengths, enabling

de novo genome assembly in non-model organisms
Roche 454
Pyrosequencing
• DNA fragmentation (nebulization)
• Fragment size selection and adapter ligation
• Single stranded DNA library isolation
• Bind library molecule to sequencing bead

(1 molecule/bead)
• Clonal amplification in emulsion PCR
• Bead deposition in PicoTiterPlate
• Pyrosequencing and image processing

Fern evolution
Seed Plants
• Sister clade to seed plants
• Evolved and maintained Ferns
independent,
photosynthetic,
Lycophytes
gametophyte and
sporophyte generations
Bryophytes
Alternation of generations
zygote (2n) meiosis
syngamy
Fern
life cycle haploid spores (n)
egg (n) sperm (n)
Fern genetics
• Recessive alleles are not masked in haploid
gametophytes
• Gametophytes and sporophytes can be
vegetatively propagated
• Controlled crosses and double haploid lines
(selfed individuals are homozygous at all loci)
• Gametic-phase segregation and recombination
can be directly observed from gametophytes
• Genome function can be examined in haploid
and diploid phases independently
Challenges in fern genetics
• Large genome sizes (avg. 10 Gb;
humans = 3.2 Gb; Arabidopsis = 0.157 Gb)
• Large chromosome numbers (avg. n=57;

n=700 in Ophioglossum)
• History of polyploidy and hybridization

• Linkage map and ESTs for Ceratopteris
• No fern genomes have been sequenced or funded
Bracken fern,
Pteridium aquilinum
• Worldwide distribution
• Economically important
• Highly adaptable and phenotypically plastic
• Well established culture techniques
• Model for fern gametophyte development
and pheromonal sex determination
• Ancient polyploid with diploid gene
expression (isozymes)
• Genome size: 1C = 9.8 Gb
Research Questions:
Genome
• Are we able to determine complete chloroplast
and mitochondrial genome sequences?
• How much of the genome is composed of

repetitive sequences (transposable elements, SSRs)?
• What proportion of the genome is protein coding?

Research Questions:
Transcriptome
• What genes and ontology groups are expressed in
the reproductive haploid phase in ferns?
• Are any of these genes homologous with

reproductive genes in bryophytes or flowering plants?
• Is there a signature of past polyploidy in the

transcriptome?
• Do genes expressed in the haploid phase experience

purifying selection?
The data: Genome
Genomic 454 read lengths
Total genomic sequences
35000
derived from DNA extracted
in CTAB and purified on a
25000
CsCl gradient
Number of reads
• Combination of 454
15000
Standard FLX and Titanium
• 711,178 reads 5000

0
• 216.19 Mb total sequence

0 100 200 300 400 500 600 700
Read length (maximum = 1363)

Genome: assembly
Histogram of genome assembly (MIRA)
# reads assembled: 294,497
8000
Total contigs + singletons = 78366

# singletons: 1,500
6000
Mean length = 476.91 bp

Number of sequences
# contigs: 76,866
Total bases = 37.37 Mb
4000
Average contig size: 476.91 bp

2000
Largest contig size: 52,181 bp

0
0 500 1000 1500 2000 2500

Total consensus: 37.37 Mb
Sequence length (largest contig = 52181 bp)
Genome: chloroplast genome
Ribosomal RNAs 4
Transfer RNAs 29
Photosystem I 5
Photosystem II 15
Cytochrome 6
ATP synthase 6
Rubisco 1
Chlorophyll biosynthesis 3
NADH dehydrogenase 11
Ribosomal proteins 22
RNA polymerase 4
Miscellaneous proteins 5
Hypothetical proteins 4
Pseudogenes 2
Genome: gene content
• Gene finder: GlimmerHMM Histogram of exon size
(trained on Arabidopsis)
5000
4000
• 7.27 Mb are putative exons
Frequency
3000
(19.46%)
2000
1000
19.46%
0
0 500 1000 1500
Exon length
80.54%
Exon Noncoding
Genome: microsatellites
Histogram of microsattelite repeats
Repeat motif length #
2000
dinucleotide 5564
1500
trinucleotide 470
Frequency
1000
tetranucleotide 15
500
pentanucleotide 78
hexanucleotide 1
0
0 20 40 60 80 100
Number of repeats Total: 6128

The data: Transcriptome
Histogram of cleaned reads
Full-length enriched,
normalized cDNA sequences
derived from mature
15000
Number of sequences
gametophyte total RNA
10000
Reads were vector screened
and quality trimmed:
5000
• 681,722 reads
0
• 254.00 Mb total sequence

0 100 200 300 400 500 600
Cleaned read length, maximum = 624

Transcriptome: assembly
Histogram of transcriptome unigenes (CAP3)

MIRA CAP3
Total unigenes = 38889 (1º) (2º)
6000
# 2º contigs: 0 5,905
Mean length = 685.76 bp
Number of sequences
Total bases = 26.67 Mp

# 1º contigs: 50,020 32,801
4000
# singletons: 638 183

2000
# unigenes: 50,658 38,889

mean unigene size: 637.7 bp 685.8 bp
0
0 500 1000 1500 2000 2500 largest unigene size: 4,489 bp 4,897 bp
Unigene length, largest transcript = 4897 bp total consensus: 32.30 Mb 26.67 Mb
Transcriptome: BLAST
E-value distribution
•38,889 unigenes blasted against

10,000
9,000
NCBI nr database (blastx)

8,000
7,000
6,000
•eValue cutoff: 1.0 E-10

HITs
5,000
4,000
3,000
2,000
1,000
•17,788 unigenes had no match
0
25 50 75 100
E-value (1e-X)
125 150 175 in the database
Sequence similarity distribution
0%
7,000
6,000 46%
5,000
54%
HITs
4,000
3,000
2,000
1,000
No BLAST result
No BLAST hit
0
0 10 20 30 40 50 60 70 80 90 100
#positives/alignment-length
Positive BLAST hit
Transcriptome: BLAST
HSP/HIT coverage distribution
7,000
6,000
5,000
4,000
HITs
3,000
2,000
1,000
0
0 10 20 30 40 50 60 70 80 90 100
HSP/HIT coverage in %
Distribution of full length transcripts

Transcriptome: BLAST Top-Hit species distribution
BLAST HITs
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000
Physcomitrella patens
Vitis vinifera
Picea sitchensis
Ricinus communis
Populus trichocarpa
Arabidopsis thaliana
Oryza sativa
Sorghum bicolor
Glycine max
Zea mays
Gossypium hirsutum
Medicago truncatula
unknown
Adiantum capillus-veneris
Ceratopteris richardii
Nicotiana tabacum
Marchantia polymorpha
Solanum tuberosum
Chlamydomonas reinhardtii
Alsophila spinulosa
Ginkgo biloba
Micromonas sp.
Pteris vittata
Elaeis guineensis
Pinus taeda
Solanum lycopersicum
Micromonas pusilla
Triticum aestivum
Gossypium barbadense
others
Transcriptome: GO annotation
Direct GO Count
#Seqs
0 500 1,000 1,500 2,000 2,500
P:cellular process
P:metabolic process
P:transport
P:biosynthetic process
P:protein modificati...
P:protein metabol...
P:cellular compone...
P:transcription
P:translation
P:response to stress
#GO
P:nucleobas...
P:generation ...
P:carbohydra...
P:catabolic process
P:cellular amino ac...
P:signal transduction
P:lipid metabol...
P:photosynthesis
P:biological_process
P:response to abiot...
Biological Process
Direct GO Count
#Seqs
0 500 1,000 1,500 2,000 2,500 3,000 3,500
C:plastid
C:membrane
C:mitochondrion
C:cytoplasm
C:nucleus
C:plasma membrane
C:intracellular
C:thylakoid
C:ribosome
C:cytosol
#GO
C:extracellular region
C:endoplasm...
C:cell wall
C:vacuole
C:cell
C:cytoskeleton
C:Golgi apparatus
C:nucleolus
C:peroxisome
C:cellular_component
Cellular Component
Direct GO Count
#Seqs
0 500 1,000 1,500 2,000 2,500 3,000 3,500
F:binding
F:catalytic activity
F:nucleotide binding
F:hydrolase activity
F:protein binding
F:transferase activity
F:kinase activity
F:transporter activity
F:DNA binding
F:structural molecu...
#GO
F:nucleic acid binding

F:RNA binding
F:molecular_function
F:transcription fact...
F:signal transduc...
F:receptor activity
F:transcripti...
F:translation fact...
F:nuclease activity
F:enzyme regulat...
Molecular Function
Transcriptome:
paleopolyploidy
• Evaluate the distribution of
synonomous substitution rates
(Ks) for duplicate gene pairs
800
• Ks is a proxy for time

600
Frequency
• Constant birth-death rate of

duplicate genes results in a
400
exponential decrease in
200
frequency over time.

• Mixture model analysis to
0
separate significant components

0.0 0.5 1.0 1.5 2.0
of the distribution
Ks value
• Significant low peak at Ks = 1.20
Future work
• Try to extract complete mitochondrial sequence
• Train gene finders with transcriptome data
• Sequence sporophyte transcriptome
• Screen population samples for microsatellites variation
• Population genetics: gene flow, demographics, population
structure
• Test for selective sweeps to ID candidate genes for adaptation,
reproductive isolation, and speciation
• Develop linkage maps from controlled crosses
• RNA-seq to measure expression levels
Acknowledgements
• Utah State University
• Aaron Duffy - gametophyte culture advice
• Mike Pfrender - lab space and equipment for RNA extraction
• Center for Integrated Biosystems -
CIBR students grant for transcriptome sequencing and
complimentary 454 Titanium genome sequencing
• VP for Research - additional funds for transcriptome
sequencing
• University of British Columbia
• Mike Barker - 454 advice and transcriptome assembly
• Katrina Dlugosch - transcriptome sequence cleaning scripts
• Pennsylvania State University
• Claude dePamphilis - transcriptome analysis
• Norman Wickett - transcriptome analysis
• Sara Elgin, HHMI Genomics Education Partnership, Washington
Univ.- funding and sequencing 454 FLX standard genomic data
• Jeff Boore, Genome Solutions - genome sequencing funding.
• Keithanne Mockaitis, Center for Genomics and Bioinformatics,
Indiana University - transcriptome library preparation and
sequencing
Data processing:
Genome
1. de novo sequence assembly: MIRA
2. BLASTn search against fern chloroplast genomes
to identify chloroplast contigs. In silico finishing.
3. Identify putative exons: GlimmerHMM
4. Identify microsatellites: SSRIT
5. Identify transposable elements: REPCLASS
6. Summaries and statistical analyses
Data processing:
Transcriptome
1. Adapter and primer trimming: SeqClean and Snowhite.pl
(custom script by Katrina Dlugosch)
2. de novo sequence assembly: MIRA
3. Secondary assembly: CAP3
4. BLASTx against nr to find homologous proteins
5. Functional annotation (GO) based on homology transfer
from BLAST hits: blast2go
6. Functional description of gametophyte transcriptome
7. Ks analysis of duplicate genes and past polyploidy
8. Summaries and statistical analyses

Leveraging Massively Parallel Pyrosequencing For Functional and Evolutionary Genomics in Ferns.

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Leveraging Massively Parallel Pyrosequencing For Functional and Evolutionary Genomics in Ferns.

Hochgeladen von

Copyright:

Verfügbare Formate

Leveraging massively parallel

pyrosequencing for functional and

• Expressed Sequence Tags (ESTs) are a genome-wide

• New-generation sequencing technologies:

• Roche 454 pyrosequencing: long read lengths, enabling

• Fragment size selection and adapter ligation

• Single stranded DNA library isolation

• Bind library molecule to sequencing bead

• Clonal amplification in emulsion PCR

• Bead deposition in PicoTiterPlate

• Pyrosequencing and image processing

zygote (2n) meiosis

• Large chromosome numbers (avg. n=57;

• History of polyploidy and hybridization

• How much of the genome is composed of

• What proportion of the genome is protein coding?

• Are any of these genes homologous with

• Is there a signature of past polyploidy in the

• Do genes expressed in the haploid phase experience

• 711,178 reads 5000

• 216.19 Mb total sequence

Read length (maximum = 1363)

Total contigs + singletons = 78366

Mean length = 476.91 bp

Average contig size: 476.91 bp

Largest contig size: 52,181 bp

0 500 1000 1500 2000 2500

Number of repeats Total: 6128

• 254.00 Mb total sequence

Cleaned read length, maximum = 624

Histogram of transcriptome unigenes (CAP3)

Total bases = 26.67 Mp

# singletons: 638 183

# unigenes: 50,658 38,889

•38,889 unigenes blasted against

NCBI nr database (blastx)

•eValue cutoff: 1.0 E-10

Distribution of full length transcripts

F:nucleic acid binding

• Ks is a proxy for time

• Constant birth-death rate of

frequency over time.

separate significant components

Das könnte Ihnen auch gefallen