Gene Family

GENOME EVOLUTION AND GENE
DUPLICATIONS IN EUKARYOTES
Shin-Han Shiu
Plant Biology / QBMI
Michigan State University
Genomes and gene contents

17,000
45,000
6,000
10,000
30,000
25,000
Duplicate genes in the genome
Arabidopsis gene families*
*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Gene function and duplication
Whats the consequence?
Gene function and duplication
Whats the consequence?
Focus I: Duplication Mechanism and Loss Rate
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Duplication mechanisms
Whole genome duplication

+
Tandem duplication
Segmental duplication
Replicative transposition
Lineage-specific gains in plants and animals
Substantially more recent duplicates in plants than in animals

Mostly due to frequent whole genome duplications in plants
Organism
Lineage-specific
gains
Normalized
gain*
# of genes in
families
analyzed
% total
Rice
10115
6743
28467
35.5 (23.7)**
Arabidopsis
5984
3990
21936
27.3 (18.2)**
Human
811
811
21954
3.7
Mouse
1265
1265
24041
5.3
*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence
time (150 and 100 Mya, respectively).
**: Numbers in parentheses refer to percentage total based on normalized gains.
Gain vs. Loss
3 rounds of whole-genome duplications in the Arabidopsis lineage

~82% duplicates from the last round were lost in the past 40 million
years
120,000
15,000*
30,000
60,000
Arabidopsis
Genome duplications + tandem duplications gene losses = gene content:
21,000**
*: Number of orthologous groups in shared families between Arabidopsis and rice.

**: Number of genes in shared families.
Age distribution of animal duplicates
Steady decay in the number of duplicates

Frequent TD, SD, and RT
Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity
Shiu et al., 2006
Plant duplicate age distribution
Apparent peak at ~0.18 instead of zero Ks

Frequent WGD, TD, SD (maybe), and RT (in some plants)
Shiu et al., 2004
Genome remodeling in polyploids
Natural and synthetic polyploids
~314 Mb
~257 Mb
20,000 yr
~348 Mb
~203 Mb
Experimental approaches
Genome-wide polymorphism monitored by tiling array
Gap
Resolution
Genome
Tiled probes
Array
~6 million features
20,000 yr
Genome-wide Single Feature Polymorphism
Mid-parent (MP) vs. Arabidopsis suecica (As)
Polyploid
SFP
Natural
58,517
Synthetic
503
Genome-wide polymorphism monitored by tiling array

Gene
Pseudogene
Transposon
Duplication or deletion
MP duplication or
As deletion
Genome Survey Sequencing
Sequence ~40-60Mb of the Arabidopsis suecica genome

0.15-0.2 X coverage, will be done next week!
Ultra-high throughput sequencer (GS20) funded by the

Strategic Partnership Grant
Ultra-high throughput
20-30 Mb per run, each run 5 hours
Will be 100Mb per run early 2007
Cost efficient
~$0.3/kb
Read length rather limited

~100bp per read now
Will be ~200bp early 2007
For more information contact:
Andreas Weber (aweber@msu.edu )

David DeWitt (dewittd@msu.edu )
Or Shin-Han Shiu (shius@msu.edu )
Seminar on instrumentation:
9/29, Friday, 1pm, 1415 BPS
Summary: Gene duplication and polyploidy
Gene duplication occurred frequently in eukaryotes but most

duplicate are lost.
In plants, whole genome duplication is common. But gene lost

occurred frequently.
After 4 generations, very small number of SFPs are identified in

synthetic polyploids.
After 20,000 generations, most coding genes do not have clustered

sequence polymorphism that indicative of deletion.
Clustered polymorphisms mostly locate in pseudogenes and

transposons.
Survey sequencing is necessary to determine if some coding genes

have become pseudogenes without being deleted.
Focus II: Differential Retention of Duplicates
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Duplicate genes in the genome
Arabidopsis gene families*
*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Large gene families in plants
One of the largest gene families
Normalized gain: % expanded OGs
Large family sizes do not necessarily indicates higher expansion rates
Ancestral family sizes and gene gains
Large ancestral family tend to have more lineage specific gains but with
many exceptions
Differential expansion of functional categories
GO: GeneOntology
Protein ubiquitination
Polysaccharide biosynthesis
Cell wall modification
Transcriptional regulation
Biotic stress response
Secondary metabolism
Differences in Duplicability
Duplicability
The propensity for the retention of a duplicate gene
Computational analysis of genome-wide trend
Category
Defense response
Proteolysis
Transport
Ion channel activity
Metabolism
Development
Protein kinase activity
Transcription factor activity
Arabidopsis
Human
Kinase superfamily sizes among eukaryotes

Number of
genes
Kinase
superfamily
Percent
total gene
25,814
1041
4.0
Oryza sativa subsp. indica
~35,000
1607
3.6
Chlamydomonas reinhardtii
~12,200
414
3.4
Plasmodium falciparum
5,334
94
1.8
Plasmodium yoelii
7,681
70
0.9
Caenorhabditis elegans
19,484
417
2.1
Drosophila melanogaster
13,808
262
1.9
Anopheles gambiae
15,088
216
1.4
Ciona intestinalis
15,852
316
2.0
Fugu rubripes
33,609
632
1.9
Mus musculus
22,444
495
2.2
Homo sapiens
22,980
472
2.1
Saccharomyces cerevisiae
6449
113
1.8
Candida albicans
6,164
95
1.5
Neurospora crassa
10082
104
1.9
Schizosaccharomyces pombe
4945
109
2.2
Organism
Arabidopsis thaliana
Shiu & Bleecker, 2003
Kinase families in rice and Arabidopsis
Gene count differences among families indicate differential expansion
Shiu et al., 2004
Estimation of ancestral RLK family size
Kinase phylogeny of Arabidopsis and rice RLKs
440 speciation points
rice
Arabidopsis
A.
A.
WAK
B.
B.
LRR VIII, X, XII
Shiu et al., 2004
Development vs. resistance/defense RLKs
Shiu et al., 2004
Contradiction
Plant genes invovled in development tend to have high

duplicability
Resistance/Defense
RLKs
Developmental
RLKs
Animal tyrosine
kinases
High duplicability
Low duplicability
Low duplicability
Transcription factors
High duplicability
Selection for expansion
Depend on the level of variations of the signals
OR
T
Summary: differential retention
Longevity and duplicability of plant genes
Duplicability
Longevity
High
High
Transcription factors
High
Low
Resistance genes
Low
High
Enzymes in central metabolic

pathways
Low
Low
??
Examples
Focus III: Functional Consequences
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Functional Consequences of Duplication
Functional divergence and conservation

Is it because of changes in cis-regulatory elements or coding sequences
How are duplicates retained, subfunctionalization or neofunctionalization
Divergence in gene expression
Develop pipelines for cis-element prediction and
Expression data
Clusters of
genes with similar
expression profiles
Cis-regulatory
logic
Machine learning
Experimental
validations
Motif functional
prediction
Over-represented
sequence motifs
in 5 regions
Divergence in post-translational modification
Conservation of phosphorylation site across speces
SACE: budding yeast

CAGL: Candida glabra
CAAL: Candida albicans
CATR: Candida tropicalis
NECR: Neurospora crassa
DEHA: Debaryomuces hansenii
Detailed Functional Studies of Duplicate Genes
Functional analyses of DDF1 and DDF2 transcription factors

Derived from recent whole genome duplication in Arabidopsis
Related to the well known CBF factors involved in cold and draught stress
Arabidopsis thaliana
Promoter
GFP
Knockouts
DDFs
Binding
targets
Arabidopsis lyrata
Promoter
GFP
Overexpression
studies
Interacting
proteins
Knockouts
DDFs
Binding
targets
Overexpression
studies
Interacting
proteins
Focus IV: Protein space

Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Tiling array analysis of transcriptome

Human Chr 21, 22
Kapranov et al., 2002
Posterior probability p(F|coding)
Performance of the CI measure
Known Arabidopsis exon and intron 90-300bp
Arabidopsis small protein that are not annotated

Correctly predict 19 out of 20 (95%).
Yesat sORF with translation evidence

Correctly predict 98 out of 114 (86%)
In intergenic sequences of Arabidopsis genome

3,274 sORF identified
Coupling with tiling array expression
Hybridization intensities for feature types
Summary: Novel coding genes
Many unannotated regions in the genomes are expressed.
Using the CI measure, many proteins that were not annotated but
with evidence of expression from yeast and Arabidopsis are identified
correctly.
Using the CI measure, we estimated that ~3000 novel coding regions

are present in the unannotated regions of Arabidopsis thaliana
genome.
Using tiling array data, we found that many of these novel coding
regions are expressed.
Acknowledgement
Lab members
University of Chicago
Justin Borevitz
Xu Zhang
Kousuke Hanada
University of Wisconsin
Sara Patterson
Rick Vierstra
Melissa Lehti-Shiu
University of Missouri
Scott Peck
Michigan State University

Many
Rong Jin, Comp Sci & Eng
Yue-Hua Cui, Stat & Prob
Startup fund
Cheng Zou
Emily Eckenrode
Recent completion
Genome remodeling in polyploids
Genome duplication occur frequently in plants

What is the fate of duplicates?
How fast do gene losses occur?
Is there any preference in genes retained?
Ng =
A
B
A1
B1
A2
B2
C
D
C1
D1
C2
D2
E1
E2
10
t1
A1
B1
A2
B2
C1
D1
C2
D2
E1
E2
t2
A1
B1
A2
B2
C1
D1
C2
D2
E1
E2
Comparing degrees of expansion

Arabidopsis:
~25,000 proteins
Rice prediction:
~66,000 genes
Combined set
Gene/domain
families
unique
GO:0001
Shared
ui = 1
Pairwise distance
ei = 4
Putative
orthologous groups
All orthologous groups

Total unexpanded = ui
Total expanded
= ei
Major questions on gene duplication
When: timing of gene duplications, e.g. N = 10
Domain gains in rice and Arabidopsis
Gain in one lineage does not necessarily predict gain in the other
Identify novel small coding genes
Determine base composition probabilities

Coding
sequences
CDS
parameters
Non-coding
sequences
NCDS
parameters
Pc(AAA) =
Pc(T|AAA) =
# of AAA
# of all NNN
Pc(AAAT)
Pc(AAA)
Feature tables
c1
c2
c3
c4
c5
c6
Calculate posterior probability
P(CDS | S )
P(S | CDS ) P(CDS )

P(S | CDS ) P(CDS ) P(S | NCDS ) P( NCDS )
Setting up the Bayes
Priors
P(S | CDS ) P(CDS )

P(S | CDS ) P(CDS ) P(S | NCDS ) P( NCDS )
1
P(CDS ) P( NCDS )
2
P(CDS | S )
1 1
P(CDS1) P(CDS2 ) ... P(CDS6 )
2 6
6
P(S | CDS ) P(CDS ) P(S | CDS m ) P(CDS m )

m 1
S = ATG TTC TAC TTT G
P(S | CDS1) Pc1( ATG ) Pc1(T | ATG ) Pc2 (T |TGT ) Pc3(C | GTT ) Pc1(T |TTC )...
P(S | CDS2 ) Pc2 ( ATG ) Pc2 (T | ATG ) Pc3(T |TGT ) Pc1(C | GTT ) Pc2 (T |TTC )...
P(S | CDS6 ) Pc6 ( ATG ) Pc6 (T | ATG ) Pc4 (T |TGT ) Pc5(C | GTT ) Pc6 (T |TTC )...
P(S | CDSn ) Pn ( ATG ) Pn (T | ATG ) Pn (T |TGT ) Pn (C | GTT ) Pn (T |TTC )...
Coding Likelihood (CL)
Sliding windows of a sequence
P(CDS | S n )
CL
n
Simulation based on NCDS (introns)
Divergence in post-translational modification
Conservation of phosphorylation site across speces

Gene Family

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Gene Family

Hochgeladen von

Copyright:

Verfügbare Formate

GENOME EVOLUTION AND GENE

Genomes and gene contents

Duplicate genes in the genome

Arabidopsis gene families*

*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Gene function and duplication

Whats the consequence?

Gene function and duplication

Whats the consequence?

Focus I: Duplication Mechanism and Loss Rate

Whole genome duplication

Lineage-specific gains in plants and animals

Substantially more recent duplicates in plants than in animals

Gain vs. Loss

3 rounds of whole-genome duplications in the Arabidopsis lineage

*: Number of orthologous groups in shared families between Arabidopsis and rice.

Age distribution of animal duplicates

Steady decay in the number of duplicates

Shiu et al., 2006

Plant duplicate age distribution

Apparent peak at ~0.18 instead of zero Ks

Shiu et al., 2004

Genome remodeling in polyploids

Natural and synthetic polyploids

Genome-wide polymorphism monitored by tiling array

Genome-wide Single Feature Polymorphism

Mid-parent (MP) vs. Arabidopsis suecica (As)

Genome-wide Single Feature Polymorphism

Genome-wide polymorphism monitored by tiling array

Genome-wide Single Feature Polymorphism

Genome Survey Sequencing

Sequence ~40-60Mb of the Arabidopsis suecica genome

Ultra-high throughput sequencer (GS20) funded by the

Read length rather limited

For more information contact:

Andreas Weber (aweber@msu.edu )

Summary: Gene duplication and polyploidy

Gene duplication occurred frequently in eukaryotes but most

In plants, whole genome duplication is common. But gene lost

After 4 generations, very small number of SFPs are identified in

After 20,000 generations, most coding genes do not have clustered

Clustered polymorphisms mostly locate in pseudogenes and

Survey sequencing is necessary to determine if some coding genes

Focus II: Differential Retention of Duplicates

Duplicate genes in the genome

Arabidopsis gene families*

*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

Large gene families in plants

One of the largest gene families

Normalized gain: % expanded OGs

Large family sizes do not necessarily indicates higher expansion rates

Ancestral family sizes and gene gains

Differential expansion of functional categories

Kinase superfamily sizes among eukaryotes

Oryza sativa subsp. indica

Shiu & Bleecker, 2003

Kinase families in rice and Arabidopsis

Gene count differences among families indicate differential expansion

Shiu et al., 2004

Estimation of ancestral RLK family size

Kinase phylogeny of Arabidopsis and rice RLKs

440 speciation points

LRR VIII, X, XII

Shiu et al., 2004