Sie sind auf Seite 1von 28

Leveraging massively parallel

pyrosequencing for functional and


evolutionary genomics in ferns.
Joshua Der and Paul Wolf
Dept. Biology and Center for Integrated Biosystems,
Utah State University
Genomics in
non-model organisms
• Traditional genome sequencing methods are cost
prohibitive, especially for large genomes

• Expressed Sequence Tags (ESTs) are a genome-wide


proxy for functional components of the genome

• New-generation sequencing technologies:


low cost per base, massively high throughput

• Roche 454 pyrosequencing: long read lengths, enabling


de novo genome assembly in non-model organisms
Roche 454
Pyrosequencing
• DNA fragmentation (nebulization)

• Fragment size selection and adapter ligation

• Single stranded DNA library isolation

• Bind library molecule to sequencing bead


(1 molecule/bead)

• Clonal amplification in emulsion PCR

• Bead deposition in PicoTiterPlate

• Pyrosequencing and image processing


Fern evolution
Seed Plants
• Sister clade to seed plants
• Evolved and maintained Ferns
independent,
photosynthetic,
Lycophytes
gametophyte and
sporophyte generations
Bryophytes
Alternation of generations

zygote (2n) meiosis

syngamy
Fern
life cycle haploid spores (n)
egg (n) sperm (n)
Fern genetics
• Recessive alleles are not masked in haploid
gametophytes
• Gametophytes and sporophytes can be
vegetatively propagated
• Controlled crosses and double haploid lines
(selfed individuals are homozygous at all loci)
• Gametic-phase segregation and recombination
can be directly observed from gametophytes
• Genome function can be examined in haploid
and diploid phases independently
Challenges in fern genetics
• Large genome sizes (avg. 10 Gb;
humans = 3.2 Gb; Arabidopsis = 0.157 Gb)

• Large chromosome numbers (avg. n=57;


n=700 in Ophioglossum)

• History of polyploidy and hybridization


• Linkage map and ESTs for Ceratopteris
• No fern genomes have been sequenced or funded
Bracken fern,
Pteridium aquilinum
• Worldwide distribution
• Economically important
• Highly adaptable and phenotypically plastic
• Well established culture techniques
• Model for fern gametophyte development
and pheromonal sex determination
• Ancient polyploid with diploid gene
expression (isozymes)
• Genome size: 1C = 9.8 Gb
Research Questions:
Genome
• Are we able to determine complete chloroplast
and mitochondrial genome sequences?

• How much of the genome is composed of


repetitive sequences (transposable elements, SSRs)?

• What proportion of the genome is protein coding?


Research Questions:
Transcriptome
• What genes and ontology groups are expressed in
the reproductive haploid phase in ferns?

• Are any of these genes homologous with


reproductive genes in bryophytes or flowering plants?

• Is there a signature of past polyploidy in the


transcriptome?

• Do genes expressed in the haploid phase experience


purifying selection?
The data: Genome
Genomic 454 read lengths
Total genomic sequences

35000
derived from DNA extracted
in CTAB and purified on a

25000
CsCl gradient

Number of reads
• Combination of 454
15000
Standard FLX and Titanium

• 711,178 reads 5000


0

• 216.19 Mb total sequence


0 100 200 300 400 500 600 700

Read length (maximum = 1363)


Genome: assembly
Histogram of genome assembly (MIRA)
# reads assembled: 294,497
8000

Total contigs + singletons = 78366


# singletons: 1,500
6000

Mean length = 476.91 bp


Number of sequences

# contigs: 76,866
Total bases = 37.37 Mb
4000

Average contig size: 476.91 bp


2000

Largest contig size: 52,181 bp


0

0 500 1000 1500 2000 2500


Total consensus: 37.37 Mb
Sequence length (largest contig = 52181 bp)
Genome: chloroplast genome
Ribosomal RNAs 4
Transfer RNAs 29
Photosystem I 5
Photosystem II 15
Cytochrome 6
ATP synthase 6
Rubisco 1
Chlorophyll biosynthesis 3
NADH dehydrogenase 11
Ribosomal proteins 22
RNA polymerase 4
Miscellaneous proteins 5
Hypothetical proteins 4
Pseudogenes 2
Genome: gene content
• Gene finder: GlimmerHMM Histogram of exon size

(trained on Arabidopsis)

5000
4000
• 7.27 Mb are putative exons

Frequency

3000
(19.46%)

2000
1000
19.46%

0
0 500 1000 1500

Exon length

80.54%

Exon Noncoding
Genome: microsatellites
Histogram of microsattelite repeats
Repeat motif length #
2000

dinucleotide 5564
1500

trinucleotide 470
Frequency

1000

tetranucleotide 15
500

pentanucleotide 78
hexanucleotide 1
0

0 20 40 60 80 100

Number of repeats Total: 6128


The data: Transcriptome
Histogram of cleaned reads
Full-length enriched,
normalized cDNA sequences
derived from mature

15000
Number of sequences
gametophyte total RNA

10000
Reads were vector screened
and quality trimmed:

5000
• 681,722 reads
0

• 254.00 Mb total sequence


0 100 200 300 400 500 600

Cleaned read length, maximum = 624


Transcriptome: assembly

Histogram of transcriptome unigenes (CAP3)


MIRA CAP3
Total unigenes = 38889 (1º) (2º)
6000

# 2º contigs: 0 5,905
Mean length = 685.76 bp
Number of sequences

Total bases = 26.67 Mp


# 1º contigs: 50,020 32,801
4000

# singletons: 638 183


2000

# unigenes: 50,658 38,889


mean unigene size: 637.7 bp 685.8 bp
0

0 500 1000 1500 2000 2500 largest unigene size: 4,489 bp 4,897 bp
Unigene length, largest transcript = 4897 bp total consensus: 32.30 Mb 26.67 Mb
Transcriptome: BLAST
E-value distribution

•38,889 unigenes blasted against


10,000
9,000

NCBI nr database (blastx)


8,000
7,000
6,000

•eValue cutoff: 1.0 E-10


HITs

5,000
4,000
3,000
2,000
1,000
•17,788 unigenes had no match
0
25 50 75 100
E-value (1e-X)
125 150 175 in the database
Sequence similarity distribution
0%
7,000

6,000 46%
5,000
54%
HITs

4,000

3,000

2,000

1,000
No BLAST result
No BLAST hit
0
0 10 20 30 40 50 60 70 80 90 100
#positives/alignment-length
Positive BLAST hit
Transcriptome: BLAST
HSP/HIT coverage distribution
7,000

6,000

5,000

4,000
HITs

3,000

2,000

1,000

0
0 10 20 30 40 50 60 70 80 90 100
HSP/HIT coverage in %

Distribution of full length transcripts


Transcriptome: BLAST Top-Hit species distribution
BLAST HITs
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000
Physcomitrella patens
Vitis vinifera
Picea sitchensis
Ricinus communis
Populus trichocarpa
Arabidopsis thaliana
Oryza sativa
Sorghum bicolor
Glycine max
Zea mays
Gossypium hirsutum
Medicago truncatula
unknown
Adiantum capillus-veneris
Ceratopteris richardii
Nicotiana tabacum
Marchantia polymorpha
Solanum tuberosum
Chlamydomonas reinhardtii
Alsophila spinulosa
Ginkgo biloba
Micromonas sp.
Pteris vittata
Elaeis guineensis
Pinus taeda
Solanum lycopersicum
Micromonas pusilla
Triticum aestivum
Gossypium barbadense
others
Transcriptome: GO annotation
Direct GO Count
#Seqs
0 500 1,000 1,500 2,000 2,500
P:cellular process
P:metabolic process
P:transport
P:biosynthetic process
P:protein modificati...
P:protein metabol...
P:cellular compone...
P:transcription
P:translation
P:response to stress
#GO

P:nucleobas...
P:generation ...
P:carbohydra...
P:catabolic process
P:cellular amino ac...
P:signal transduction
P:lipid metabol...
P:photosynthesis
P:biological_process
P:response to abiot...

Biological Process
Transcriptome: GO annotation
Direct GO Count
#Seqs
0 500 1,000 1,500 2,000 2,500 3,000 3,500
C:plastid
C:membrane
C:mitochondrion
C:cytoplasm
C:nucleus
C:plasma membrane
C:intracellular
C:thylakoid
C:ribosome
C:cytosol
#GO

C:extracellular region
C:endoplasm...
C:cell wall
C:vacuole
C:cell
C:cytoskeleton
C:Golgi apparatus
C:nucleolus
C:peroxisome
C:cellular_component

Cellular Component
Transcriptome: GO annotation
Direct GO Count
#Seqs
0 500 1,000 1,500 2,000 2,500 3,000 3,500
F:binding
F:catalytic activity
F:nucleotide binding
F:hydrolase activity
F:protein binding
F:transferase activity
F:kinase activity
F:transporter activity
F:DNA binding
F:structural molecu...
#GO

F:nucleic acid binding


F:RNA binding
F:molecular_function
F:transcription fact...
F:signal transduc...
F:receptor activity
F:transcripti...
F:translation fact...
F:nuclease activity
F:enzyme regulat...

Molecular Function
Transcriptome:
paleopolyploidy
• Evaluate the distribution of
synonomous substitution rates
(Ks) for duplicate gene pairs
800

• Ks is a proxy for time


600
Frequency

• Constant birth-death rate of


duplicate genes results in a
400

exponential decrease in
200

frequency over time.


• Mixture model analysis to
0

separate significant components


0.0 0.5 1.0 1.5 2.0
of the distribution
Ks value
• Significant low peak at Ks = 1.20
Future work
• Try to extract complete mitochondrial sequence
• Train gene finders with transcriptome data
• Sequence sporophyte transcriptome
• Screen population samples for microsatellites variation
• Population genetics: gene flow, demographics, population
structure
• Test for selective sweeps to ID candidate genes for adaptation,
reproductive isolation, and speciation
• Develop linkage maps from controlled crosses
• RNA-seq to measure expression levels
Acknowledgements
• Utah State University
• Aaron Duffy - gametophyte culture advice
• Mike Pfrender - lab space and equipment for RNA extraction
• Center for Integrated Biosystems -
CIBR students grant for transcriptome sequencing and
complimentary 454 Titanium genome sequencing
• VP for Research - additional funds for transcriptome
sequencing
• University of British Columbia
• Mike Barker - 454 advice and transcriptome assembly
• Katrina Dlugosch - transcriptome sequence cleaning scripts
• Pennsylvania State University
• Claude dePamphilis - transcriptome analysis
• Norman Wickett - transcriptome analysis
• Sara Elgin, HHMI Genomics Education Partnership, Washington
Univ.- funding and sequencing 454 FLX standard genomic data
• Jeff Boore, Genome Solutions - genome sequencing funding.
• Keithanne Mockaitis, Center for Genomics and Bioinformatics,
Indiana University - transcriptome library preparation and
sequencing
Data processing:
Genome
1. de novo sequence assembly: MIRA
2. BLASTn search against fern chloroplast genomes
to identify chloroplast contigs. In silico finishing.
3. Identify putative exons: GlimmerHMM
4. Identify microsatellites: SSRIT
5. Identify transposable elements: REPCLASS
6. Summaries and statistical analyses
Data processing:
Transcriptome
1. Adapter and primer trimming: SeqClean and Snowhite.pl
(custom script by Katrina Dlugosch)
2. de novo sequence assembly: MIRA
3. Secondary assembly: CAP3
4. BLASTx against nr to find homologous proteins
5. Functional annotation (GO) based on homology transfer
from BLAST hits: blast2go
6. Functional description of gametophyte transcriptome
7. Ks analysis of duplicate genes and past polyploidy
8. Summaries and statistical analyses

Das könnte Ihnen auch gefallen