Beruflich Dokumente
Kultur Dokumente
DUPLICATIONS IN EUKARYOTES
Shin-Han Shiu
Plant Biology / QBMI
Michigan State University
6,000
10,000
30,000
25,000
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Duplication mechanisms
Tandem duplication
Segmental duplication
Replicative transposition
Organism
Lineage-specific
gains
Normalized
gain*
# of genes in
families
analyzed
% total
Rice
10115
6743
28467
35.5 (23.7)**
Arabidopsis
5984
3990
21936
27.3 (18.2)**
Human
811
811
21954
3.7
Mouse
1265
1265
24041
5.3
*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence
time (150 and 100 Mya, respectively).
**: Numbers in parentheses refer to percentage total based on normalized gains.
120,000
15,000*
30,000
60,000
Arabidopsis
Genome duplications + tandem duplications gene losses = gene content:
21,000**
Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity
~314 Mb
~257 Mb
20,000 yr
~348 Mb
~203 Mb
Experimental approaches
Gap
Resolution
Genome
Tiled probes
Array
~6 million features
20,000 yr
Polyploid
SFP
Natural
58,517
Synthetic
503
Pseudogene
Transposon
Duplication or deletion
MP duplication or
As deletion
Cost efficient
~$0.3/kb
Seminar on instrumentation:
9/29, Friday, 1pm, 1415 BPS
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Large ancestral family tend to have more lineage specific gains but with
many exceptions
GO: GeneOntology
Protein ubiquitination
Polysaccharide biosynthesis
Cell wall modification
Transcriptional regulation
Biotic stress response
Secondary metabolism
Differences in Duplicability
Duplicability
The propensity for the retention of a duplicate gene
Computational analysis of genome-wide trend
Category
Defense response
Proteolysis
Transport
Ion channel activity
Metabolism
Development
Protein kinase activity
Transcription factor activity
Arabidopsis
Human
Kinase
superfamily
Percent
total gene
25,814
1041
4.0
~35,000
1607
3.6
Chlamydomonas reinhardtii
~12,200
414
3.4
Plasmodium falciparum
5,334
94
1.8
Plasmodium yoelii
7,681
70
0.9
Caenorhabditis elegans
19,484
417
2.1
Drosophila melanogaster
13,808
262
1.9
Anopheles gambiae
15,088
216
1.4
Ciona intestinalis
15,852
316
2.0
Fugu rubripes
33,609
632
1.9
Mus musculus
22,444
495
2.2
Homo sapiens
22,980
472
2.1
Saccharomyces cerevisiae
6449
113
1.8
Candida albicans
6,164
95
1.5
Neurospora crassa
10082
104
1.9
Schizosaccharomyces pombe
4945
109
2.2
Organism
Arabidopsis thaliana
rice
Arabidopsis
A.
A.
WAK
B.
B.
Contradiction
Resistance/Defense
RLKs
Developmental
RLKs
Animal tyrosine
kinases
High duplicability
Low duplicability
Low duplicability
Transcription factors
High duplicability
OR
T
Duplicability
Longevity
High
High
Transcription factors
High
Low
Resistance genes
Low
High
Low
Low
??
Examples
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Expression data
Clusters of
genes with similar
expression profiles
Cis-regulatory
logic
Machine learning
Experimental
validations
Motif functional
prediction
Over-represented
sequence motifs
in 5 regions
Promoter
GFP
Knockouts
DDFs
Binding
targets
Arabidopsis lyrata
Promoter
GFP
Overexpression
studies
Interacting
proteins
Knockouts
DDFs
Binding
targets
Overexpression
studies
Interacting
proteins
Mechanisms
Preferential
retention
Consequences
Using the CI measure, many proteins that were not annotated but
with evidence of expression from yeast and Arabidopsis are identified
correctly.
Using tiling array data, we found that many of these novel coding
regions are expressed.
Acknowledgement
Lab members
University of Chicago
Justin Borevitz
Xu Zhang
Kousuke Hanada
University of Wisconsin
Sara Patterson
Rick Vierstra
Melissa Lehti-Shiu
University of Missouri
Scott Peck
Cheng Zou
Emily Eckenrode
Recent completion
Ng =
A
B
A1
B1
A2
B2
C
D
C1
D1
C2
D2
E1
E2
10
t1
A1
B1
A2
B2
C1
D1
C2
D2
E1
E2
t2
A1
B1
A2
B2
C1
D1
C2
D2
E1
E2
Rice prediction:
~66,000 genes
Combined set
Gene/domain
families
unique
GO:0001
Shared
ui = 1
Pairwise distance
ei = 4
Putative
orthologous groups
= ei
Gain in one lineage does not necessarily predict gain in the other
CDS
parameters
Non-coding
sequences
NCDS
parameters
Pc(AAA) =
Pc(T|AAA) =
# of AAA
# of all NNN
Pc(AAAT)
Pc(AAA)
Feature tables
c1
c2
c3
c4
c5
c6
P(CDS | S )
Priors
1 1
P(CDS1) P(CDS2 ) ... P(CDS6 )
2 6
6
P(S | CDS1) Pc1( ATG ) Pc1(T | ATG ) Pc2 (T |TGT ) Pc3(C | GTT ) Pc1(T |TTC )...
P(S | CDS2 ) Pc2 ( ATG ) Pc2 (T | ATG ) Pc3(T |TGT ) Pc1(C | GTT ) Pc2 (T |TTC )...
P(S | CDS6 ) Pc6 ( ATG ) Pc6 (T | ATG ) Pc4 (T |TGT ) Pc5(C | GTT ) Pc6 (T |TTC )...
P(S | CDSn ) Pn ( ATG ) Pn (T | ATG ) Pn (T |TGT ) Pn (C | GTT ) Pn (T |TTC )...
P(CDS | S n )
CL
n