Sie sind auf Seite 1von 53

GENOME EVOLUTION AND GENE

DUPLICATIONS IN EUKARYOTES
Shin-Han Shiu
Plant Biology / QBMI
Michigan State University

Genomes and gene contents


17,000
45,000

6,000

10,000

30,000

25,000

Duplicate genes in the genome

Arabidopsis gene families*

*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Gene function and duplication

Whats the consequence?

Gene function and duplication

Whats the consequence?

Focus I: Duplication Mechanism and Loss Rate

Gene
Duplications

Mechanisms

Preferential
retention

Consequences

Duplication mechanisms

Whole genome duplication


+

Tandem duplication

Segmental duplication

Replicative transposition

Lineage-specific gains in plants and animals

Substantially more recent duplicates in plants than in animals


Mostly due to frequent whole genome duplications in plants

Organism

Lineage-specific
gains

Normalized
gain*

# of genes in
families
analyzed

% total

Rice

10115

6743

28467

35.5 (23.7)**

Arabidopsis

5984

3990

21936

27.3 (18.2)**

Human

811

811

21954

3.7

Mouse

1265

1265

24041

5.3

*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence
time (150 and 100 Mya, respectively).
**: Numbers in parentheses refer to percentage total based on normalized gains.

Gain vs. Loss

3 rounds of whole-genome duplications in the Arabidopsis lineage


~82% duplicates from the last round were lost in the past 40 million
years

120,000
15,000*

30,000

60,000

Arabidopsis
Genome duplications + tandem duplications gene losses = gene content:
21,000**

*: Number of orthologous groups in shared families between Arabidopsis and rice.


**: Number of genes in shared families.

Age distribution of animal duplicates

Steady decay in the number of duplicates


Frequent TD, SD, and RT

Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity

Shiu et al., 2006

Plant duplicate age distribution

Apparent peak at ~0.18 instead of zero Ks


Frequent WGD, TD, SD (maybe), and RT (in some plants)

Shiu et al., 2004

Genome remodeling in polyploids

Natural and synthetic polyploids

~314 Mb

~257 Mb

20,000 yr

~348 Mb

~203 Mb

Experimental approaches

Genome-wide polymorphism monitored by tiling array

Gap

Resolution

Genome
Tiled probes

Array

~6 million features

20,000 yr

Genome-wide Single Feature Polymorphism

Mid-parent (MP) vs. Arabidopsis suecica (As)

Polyploid

SFP

Natural

58,517

Synthetic

503

Genome-wide Single Feature Polymorphism

Genome-wide polymorphism monitored by tiling array


Gene

Pseudogene

Transposon

Genome-wide Single Feature Polymorphism

Duplication or deletion

MP duplication or
As deletion

Genome Survey Sequencing

Sequence ~40-60Mb of the Arabidopsis suecica genome


0.15-0.2 X coverage, will be done next week!

Ultra-high throughput sequencer (GS20) funded by the


Strategic Partnership Grant
Ultra-high throughput
20-30 Mb per run, each run 5 hours
Will be 100Mb per run early 2007

Cost efficient
~$0.3/kb

Read length rather limited


~100bp per read now
Will be ~200bp early 2007

For more information contact:

Andreas Weber (aweber@msu.edu )


David DeWitt (dewittd@msu.edu )
Or Shin-Han Shiu (shius@msu.edu )

Seminar on instrumentation:
9/29, Friday, 1pm, 1415 BPS

Summary: Gene duplication and polyploidy

Gene duplication occurred frequently in eukaryotes but most


duplicate are lost.

In plants, whole genome duplication is common. But gene lost


occurred frequently.

After 4 generations, very small number of SFPs are identified in


synthetic polyploids.

After 20,000 generations, most coding genes do not have clustered


sequence polymorphism that indicative of deletion.

Clustered polymorphisms mostly locate in pseudogenes and


transposons.

Survey sequencing is necessary to determine if some coding genes


have become pseudogenes without being deleted.

Focus II: Differential Retention of Duplicates

Gene
Duplications

Mechanisms

Preferential
retention

Consequences

Duplicate genes in the genome

Arabidopsis gene families*

*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Large gene families in plants

One of the largest gene families

Normalized gain: % expanded OGs

Large family sizes do not necessarily indicates higher expansion rates

Ancestral family sizes and gene gains

Large ancestral family tend to have more lineage specific gains but with
many exceptions

Differential expansion of functional categories

GO: GeneOntology

Protein ubiquitination
Polysaccharide biosynthesis
Cell wall modification
Transcriptional regulation
Biotic stress response
Secondary metabolism

Differences in Duplicability

Duplicability
The propensity for the retention of a duplicate gene
Computational analysis of genome-wide trend
Category
Defense response
Proteolysis
Transport
Ion channel activity
Metabolism
Development
Protein kinase activity
Transcription factor activity

Arabidopsis

Human

Kinase superfamily sizes among eukaryotes


Number of
genes

Kinase
superfamily

Percent
total gene

25,814

1041

4.0

Oryza sativa subsp. indica

~35,000

1607

3.6

Chlamydomonas reinhardtii

~12,200

414

3.4

Plasmodium falciparum

5,334

94

1.8

Plasmodium yoelii

7,681

70

0.9

Caenorhabditis elegans

19,484

417

2.1

Drosophila melanogaster

13,808

262

1.9

Anopheles gambiae

15,088

216

1.4

Ciona intestinalis

15,852

316

2.0

Fugu rubripes

33,609

632

1.9

Mus musculus

22,444

495

2.2

Homo sapiens

22,980

472

2.1

Saccharomyces cerevisiae

6449

113

1.8

Candida albicans

6,164

95

1.5

Neurospora crassa

10082

104

1.9

Schizosaccharomyces pombe

4945

109

2.2

Organism
Arabidopsis thaliana

Shiu & Bleecker, 2003

Kinase families in rice and Arabidopsis

Gene count differences among families indicate differential expansion

Shiu et al., 2004

Estimation of ancestral RLK family size

Kinase phylogeny of Arabidopsis and rice RLKs

440 speciation points

rice

Arabidopsis
A.

A.

WAK

B.

B.

LRR VIII, X, XII

Shiu et al., 2004

Development vs. resistance/defense RLKs

Shiu et al., 2004

Contradiction

Plant genes invovled in development tend to have high


duplicability

Resistance/Defense
RLKs

Developmental
RLKs

Animal tyrosine
kinases

High duplicability

Low duplicability

Low duplicability

Transcription factors
High duplicability

Selection for expansion

Depend on the level of variations of the signals

OR
T

Summary: differential retention

Longevity and duplicability of plant genes

Duplicability

Longevity

High

High

Transcription factors

High

Low

Resistance genes

Low

High

Enzymes in central metabolic


pathways

Low

Low

??

Examples

Focus III: Functional Consequences

Gene
Duplications

Mechanisms

Preferential
retention

Consequences

Functional Consequences of Duplication

Functional divergence and conservation


Is it because of changes in cis-regulatory elements or coding sequences

How are duplicates retained, subfunctionalization or neofunctionalization

Divergence in gene expression

Develop pipelines for cis-element prediction and

Expression data

Clusters of
genes with similar
expression profiles

Cis-regulatory
logic

Machine learning

Experimental
validations

Motif functional
prediction

Over-represented
sequence motifs
in 5 regions

Divergence in post-translational modification

Conservation of phosphorylation site across speces

SACE: budding yeast


CAGL: Candida glabra
CAAL: Candida albicans
CATR: Candida tropicalis
NECR: Neurospora crassa
DEHA: Debaryomuces hansenii

Detailed Functional Studies of Duplicate Genes

Functional analyses of DDF1 and DDF2 transcription factors


Derived from recent whole genome duplication in Arabidopsis
Related to the well known CBF factors involved in cold and draught stress
Arabidopsis thaliana

Promoter
GFP

Knockouts

DDFs

Binding
targets

Arabidopsis lyrata
Promoter
GFP

Overexpression
studies

Interacting
proteins

Knockouts

DDFs

Binding
targets

Overexpression
studies

Interacting
proteins

Focus IV: Protein space


Gene
Duplications

Mechanisms

Preferential
retention

Consequences

Tiling array analysis of transcriptome


Human Chr 21, 22

Kapranov et al., 2002

Posterior probability p(F|coding)

Performance of the CI measure

Known Arabidopsis exon and intron 90-300bp

Arabidopsis small protein that are not annotated


Correctly predict 19 out of 20 (95%).

Yesat sORF with translation evidence


Correctly predict 98 out of 114 (86%)

In intergenic sequences of Arabidopsis genome


3,274 sORF identified

Coupling with tiling array expression

Hybridization intensities for feature types

Summary: Novel coding genes

Many unannotated regions in the genomes are expressed.

Using the CI measure, many proteins that were not annotated but
with evidence of expression from yeast and Arabidopsis are identified
correctly.

Using the CI measure, we estimated that ~3000 novel coding regions


are present in the unannotated regions of Arabidopsis thaliana
genome.

Using tiling array data, we found that many of these novel coding
regions are expressed.

Acknowledgement

Lab members

University of Chicago
Justin Borevitz
Xu Zhang

Kousuke Hanada

University of Wisconsin
Sara Patterson
Rick Vierstra

Melissa Lehti-Shiu

University of Missouri
Scott Peck

Michigan State University


Many
Rong Jin, Comp Sci & Eng
Yue-Hua Cui, Stat & Prob
Startup fund

Cheng Zou

Emily Eckenrode

Recent completion

Genome remodeling in polyploids

Genome duplication occur frequently in plants


What is the fate of duplicates?
How fast do gene losses occur?
Is there any preference in genes retained?

Ng =

A
B

A1
B1

A2
B2

C
D

C1
D1

C2
D2

E1

E2

10

t1

A1
B1

A2
B2

C1
D1

C2
D2

E1

E2

t2

A1
B1

A2
B2

C1
D1

C2
D2

E1

E2

Comparing degrees of expansion


Arabidopsis:
~25,000 proteins

Rice prediction:
~66,000 genes

Combined set

Gene/domain
families

unique

GO:0001
Shared

ui = 1
Pairwise distance

ei = 4
Putative
orthologous groups

All orthologous groups


Total unexpanded = ui
Total expanded

= ei

Major questions on gene duplication

When: timing of gene duplications, e.g. N = 10

Domain gains in rice and Arabidopsis

Gain in one lineage does not necessarily predict gain in the other

Identify novel small coding genes

Determine base composition probabilities


Coding
sequences

CDS
parameters

Non-coding
sequences

NCDS
parameters

Pc(AAA) =
Pc(T|AAA) =

# of AAA
# of all NNN
Pc(AAAT)
Pc(AAA)

Feature tables

c1

c2

c3

c4

c5

c6

Calculate posterior probability

P(CDS | S )

P(S | CDS ) P(CDS )


P(S | CDS ) P(CDS ) P(S | NCDS ) P( NCDS )

Setting up the Bayes

Priors

P(S | CDS ) P(CDS )


P(S | CDS ) P(CDS ) P(S | NCDS ) P( NCDS )
1
P(CDS ) P( NCDS )
2
P(CDS | S )

1 1
P(CDS1) P(CDS2 ) ... P(CDS6 )
2 6
6

P(S | CDS ) P(CDS ) P(S | CDS m ) P(CDS m )


m 1

S = ATG TTC TAC TTT G

P(S | CDS1) Pc1( ATG ) Pc1(T | ATG ) Pc2 (T |TGT ) Pc3(C | GTT ) Pc1(T |TTC )...
P(S | CDS2 ) Pc2 ( ATG ) Pc2 (T | ATG ) Pc3(T |TGT ) Pc1(C | GTT ) Pc2 (T |TTC )...

P(S | CDS6 ) Pc6 ( ATG ) Pc6 (T | ATG ) Pc4 (T |TGT ) Pc5(C | GTT ) Pc6 (T |TTC )...
P(S | CDSn ) Pn ( ATG ) Pn (T | ATG ) Pn (T |TGT ) Pn (C | GTT ) Pn (T |TTC )...

Coding Likelihood (CL)

Sliding windows of a sequence

P(CDS | S n )
CL
n

Simulation based on NCDS (introns)

Divergence in post-translational modification

Conservation of phosphorylation site across speces

Das könnte Ihnen auch gefallen