Beruflich Dokumente
Kultur Dokumente
Garland Science
MO ECU AR
E Ell S 4TH ED I TION
*Missing the glossary and index, and some of the reference pages at the end
of the chapters, they are useless and take a lot of time to scan, other than
that the scan is perfect.
xi
Contents
PCR 163
Evolution 297
Diseases 467
Factors 497
Glossary 719
Index 737
xii
Detailed Contents
Chapter 1
The complex relationship between amino acid
sequence and protein structure 25
The a -helix. 25
Nucleic acids 3
Polypeptides 4
Chapter 2
and function 6
2.1 pLOIDY AND THE CEIJ. CYCLE 30
REPLICATIOI'I / 7
Mitosis is the normal form of cell division 31
semi-discontinuous 9
Independent assortment 34
EXPRE.SSION 12
2.3 STRUCfURE.AND FUNCTION OF
Most genes are expressed to make polypeptides 14
CHROMOSOMES 36
5' capping 18
in different organisms 39
3' polyadenylation 18
Replication of a mammaJia n chromosome involves
processing 19
Telomeres have speCialized structures to preserve
unive rsal 22
2.4 STUDYlNG HOMAN CHROMOSOMES 43
4.3 PRINCIPLES OF CELl. SIGNAlING 102 Additional recombination and mutation mechanisms
Signaling molecules bind to specific re ceptors in contribute to receptor diversity in B cells. but
responding cells to trigger altered cell behavior 102 not T cells 130
Some signaling molecules bind intracellular receptors The monospecificity ofIgs and TCRs is due to allelic
that activate target genes directly 103 exclusion and light chain exclusion 131
Signaling through cell surface receptors often FURTHER READING 132
involves kinas e cascades 105
Signal transduction pathways often use small
intermediate intracellular s ignaling molecules 106
Chapter 5
Synapti c sig naling is a specialized form of cell Principles of Development 133
signal ing that does not require the activatio n of 5.1 AN OVERVIEW OF DEVELOPMF.NT 134
transcription factors 107
Animal models of development 135
4.4 CEll PROUFElWION, EN.E5CENCE,AND
5.2 CELL SPECIALIZAl10N DURI""G
PROGRAMMED CEll DEATH 108
DEVUOl'MENT 136
Most of the cells in mature animals are non-dividing
cells, but some tissues and cells turn over rapidly 108 Cells become specialized through an irreversib le
Mitogens promote cell proliferation by overcoming series of hierarchical decisions 136
braking mechanisms that restrain cell cycle The choice between alternative cell fates often
dep ends on cell position 137
progression in Gl 109
Cell proliferation limits and the concept of cell Sometimes cell fate can be specified by lineage
senescence 111 rather th an position 138
Large numbers of our cells are naturally 5.3 P,\TTF.RN PORMATION IN DEVELOPMENT 139
programmed to die 111 Emergence of the body plan is dependent o n axis
The importance of programmed cell death 11 2 speCification and polarization 139
Apoptosis is performed by caspases in response Pattern formation often depends on grad ients of
to death s ignals or s ustai ned cell stress 113 s ignaling molecules 141
Extrinsic pathways: s ignaling through cell Homeotic mutations reveal the molecular basis of
s urface death receptors 11 3 pOSitional identity 142
Intrinsic pathways: intracellular responses to
cell stress 113 5.4 MORPHOGHNESIS 144
4.5 STEM CEUS AND DIFFEHENTtATION 114 Morphogenesis can be driven by changes in cell
shape and size 144
CeU specialization involves a directed series of
Major morphogenetic changes in the embryo
hierarchical decisions 114
result from changes in cell affinity 145
Stem ceUs are rare self-renewing progenitor cells 115
Cell proliferation and apoptosis are important
Tissue stem cells allow specific adult tissues to be
morphogenetic mechanisms 145
replenished 115
Stem cell niches 11 7 5.5 I!ARlY HUMAl\J DEVELOPMENT:
Stem cell ren ewal versns differentiation 11 7 FERTILIZATION TO GASTRULATIO/l. 146
Embryo nic s tem cells and embryonic germ cells During fertilization the egg is activated to form
are pluripotent 11 7 a unique individual 146
Origins of cultured embryonic stem cells 118 Cleavage partitions the zygote into many smaller
Pluripotency tests 119 cells 147
Embryonic germ cells 119 Mammalian eggs are among the smallest in the
animal kingdom. and cleavage in mammals is
4.6 L\IMUNESYSTEM CELLS: FUNCTION exceptional in several ways 148
TH,ROUGH DIVERSITY 119
Only a small percentage of the cells in the early
The innate immune system provides a rapid
mammalian embryo give rise to the mature
response based on general pattern recognition
organism 148
of pathogens 121
Implantation 150
The adaptive immune system mounts highly specific
Gastrulation is a dynamic process by which cells
immune responses that are enhanced by memory
of the epiblast give rise to the three germ layers lSI
~lli In
Humoral immunity depends on the activities of 5.6 NEURAL DEVE!..OPMENT 154
soluble antibodies 123 The axial mesoderm induces the overlying
In cell-media ted immun ity. T cells recognize cells ectoderm to develop into the nervous system 155
containing fragm ents of foreign proteins 125 Pattern formation in the neural tube involves the
T-cell ac tivation 127 coordinated expression of genes along two axes 155
The unique organization and expression oflg and Neuronal differentiation involves the combinatorial
TCR genes 128 activity of transcription factors 157
Detailed Contents XV
Chapter 9
More than 3000 human genes synthesize a wide
RNAs 287
GENOME 257
HETEROCHROMAfIN AND TRAJl;SPOSO:'ll
determine 261
A few human L1NE-I eleruents are active
CONCLUSION 293
Chapter 10
fragments 271
genes 28 1
amount of functional noncoding DNA in
expression 285
Gene duplication can permit increased gene dosage
The globin superfa mily illustrates divergence in Elongation of the transcript 349
gene regulation and function after gene Termination of transcription 349
duplication 317 Many other proteins modulate the activity of the
T\\'O or !hree major whole genome duplication basal transcription apparatus 349
e'\-ents have occurred in vertebrate lineages since Sequence-specific DNA-binding proteins can bind
the split from tunicates 318 close to a promoter or at more remote locations 350
~Iajor chromosome rearrangements have occurred Co-activators and co-repressors influence
duri ng mammalian genome evolution 319 promoters without binding to DNA 352
In heteromorphic sex chromosomes, the smaller 11.2 CHll0MATlN CONPOR.\MTION: DNA
chrOinosome is limited to one sex and is mostly MEfHYtATION AND THE HISTONF CODE 353
non -reco mbining with few genes 320 Modifications ofillstones in nucleosomes may
The pseudoautosomal regions have changed rapidly comprise a histone code 353
during evolution 322 Open and closed chromatin 354
Human sex chromosOlnes evolved after a sex ATP-dependent chromatin remodeling
determining locus developed on one autosome, complexes 356
causing it to diverge from its homolog 323 DNA methylation is an important control in gene
Abundant testis- expressed genes on the Y expression 357
chromosome are mostiy maintained by Methyl-CpG-binding proteins 358
intrachromosomal gene conversion 324 DNA methylation in development 359
X-chromosome inactivation developed in response Chromatin states are maintained by several
to gene depletion from the Y chromosome 326 interacting mechanisms 360
10_-1 OUlI PL\CE IN THE TREE OF LIrE 326 The role of HPI protein 36 1
Molecular phylogenetics uses sequence alignments A role for small RNA molecules 361
to construct evolutionary trees 326 A role for nuclear localization 361
Evolutionary trees can be constructed in different No single prime cause? 362
ways, and their reliability is tested by statistical The ENCODE project seeks to give a comprehensive
methods 328 overview of transcription and its control 362
The G-value paradox: organism complexity is Transcription is far more extensive than
not simply related to the number of (protein-coding) previously imagined 362
genes 329 Predicting transcription start sites 364
Striking lineage-specific gene family expansion 11.3 EPIGENETIC MEMORY AND IMPRINTING 365
often involves environmental genes 332 Ep igenetic memory depends on DNA methylation,
Regulatory DNA sequences and other noncoding and possibly on the polycomb and trithorax
DNA have significantly expanded in complex groups of proteins 365
metazoa ns 333 X-inactivation: an epigenetic change that is heritable
Mutations in cis-regulatory sequences lead to from cell to daughter cell, but not from parent
gene expression differences that underlie ro~M 3~
morphological divergence 333 Initiating X-inactivation: the role ofXlST 366
Lineage-specific exons and cis-regulatory elements Escaping X- inactivation 367
can originate from transposable elements 335 At imprinted loci, expression depends on the
Gene family expansion and gene loss/ inactivation parental origin 367
have occurred recently in human lineages, but Prader-Willi and Angelman syndromes are classic
human-specific genes are very rare 337 examples of imprinting in humans 368
Comparative genomic and phenotype-led studies Two questions arise about imprinting: how is it
seek to identify DNA sequences important in done and why is it done? 370
defining humans 339 Paramutations are a type oftransgenerational
CONCLUSION 341 epigenetic change 370
purrrHER READI"G 342 Some genes are expressed from only one allele but
independently of parental origin 371
Chapter 11 11.4 ONE! GENF. -MORE THAN ONE PROTEIN 372
Many genes have more than one promoter 372
Human Gene Expression 345
Alternative splicing allows one primary transcript
1 Ll PROMOTERS AND THE PRIMARY to encode multiple protein isoforms 373
TRANSCRIPT 347 RNA editing can change the sequence of the
Transcription by RNA polymerase 1I is a multi-step mRNA after transcription 374
process 347 11 .5 CONTROL OF GENE EAl'RESS(ON AT TILE
Defining the core promoter and transcription LFVEI.OFTRANSLATION 375
start site 347 Further controls govern when and where a
Assembling the basal transcription apparatus 348 mRNA is translated 375
Detailed Contents xix
The discovery of many small RNAs that regulate The protein interactome provides an important
gene expression caused a paradigm shift in gateway to systems biology 399
cell biology 376 Defining nucl eic acid-protein interactions is
MicroRNAs as regulators of translation 376 critical to understanding how genes function 401
MicroRNAs and cancer 378 Mapping protein-DNA interactions in vitro 401
Some unresolved questions 378 Mapping protein-DNA interactions in vivo 402
COl\ClUSION 378 CONCLUSION 403
FURTHER READING 379 FURTHER READING 404
Chapter 12 Chapter 13
Studying Gene Function in the Human Genetic Variability and Its
Post-Genome Era 381 Consequences 405
12.1 STU D\Cl NG GENE FUNCTlON:AN OVERVIEW 362 13.1 TYPES OFVARL\T10N BETWEEN HUMAN
Gene function can be studied at a variety of GENOMES 406
different levels 383 Single nucleotide polymorphisms are numerically
Gene expression studies 383 the most abundant type of genetic variant 406
Gene inactivation and inhibition of gene Both interspersed and tandem repeated sequences
expression 383 can show polymorphic va riation 408
Defining molecular partners for gene Short tandem repeat polymorph isms: the
products 383 workhorses of family and forensic studies 408
Genomewide analyses aim to integrate analyses Large-scale variatio ns in copy number are
of gene function 384 surprisingly frequent in human genomes 409
12.2BIOlNFORMATlCAPPROACHES TO 13.2 DNA VAMAGEAND REPAIR MECHANISMS 41 1
STUDYING GENE FUNCTION 364 DNA in cells tequires constant maintenance
Sequence homology searches can provide val uable to repair damage and correct errors 411
clues to gene function 385 The effects of DNA damage 413
Database searching is often performed with a DNA replication, transcription, recombination,
model of an evolutionarily conserved sequence 386 and repair use multi protein complexes that
Comparison with documented protein domains share components 415
and motifs can provide additional cl ues to Defects in DNA repair are the cause of many human
gene function 388 diseases 415
Complementation groups 416
12.3 SnlD~lNG GENE FUNCTION BY SELECTIVE
GENE INACTIVATION OR MODIFICATION 389 J3.3 PATHOGENIC DNA VAAlANTS ~l6
Clues to gene function can be inferred trom Deciding whether a DNA sequence change is
different types of genetic manipulation 389 pathogenic can be difficult 416
RNA interference is the primary method for evaluating Single nucleotide and other small-scale changes
gene function in cultured mammalian cells 390 are a common type of pathogenic change 417
Global RNAi screens provide a systems-level Missense mutatio ns 417
approach to studying gene function in cells 392 Nonsense mutations 418
Inactivation of genes in the getm line provides the Changes that affect splicing of the primary
most detailed informatio n on gene function 392 transcript 419
Frameshifts 420
12.4 PROTEOMICS, PROTEIN-PROTIlIN
Changes that affect the level of gene expression 421
INTERACTIONS, AND PROTJ::IN-VNA
Pathogenic synonymous (silent) changes 422
INTERACTIONS 393
Variations at shorr tandem repeats are occasionally
Proteomics is largely concerned with identifying
pathogeni c 422
and characterizing proteins a t the biochemical
Dynamic mu tations: a special class of
and functional levels 393
pathogenic microsatelJite variants 423
Large-scale protein-protein interaction studies
Variants that affect dosage of one or more genes
seek to define functional protein networks 395
may be pathogenic 425
Yeast two-hybrid screening relies on reconstituting
a functional transcription factor 396 13.4 MOLECUlAR PATHOLOGY: UNDERSTANDING
Affinity purification-mass spectrometry is widely THE EFFECf OF VARl.\l'<"TS 426
used to screen for protein partners of a test The biggest disthlction in molecular pathology is
protein 397 between loss-of-function and gain-of-function
Suggested protein-protein interactions are often changes 428
validated by co-immunoprecipitation or Allelic heterogeneity is a common feature of
pull-down assays 398 loss-of-function phenotypes 429
x:x Detailed Contents
Loss-of-function mutations produce dominant Lod scores of +3 and -2 are the criteria for linkage
phenotypes when there is haploinsufficiency 429 and exclusion (for a single test) 453
Dominant-negative effects occur when a mutated For whole genome searches a genomewide threshold
gene product interferes with the function of the of significance must be used 455
normal product 43 I 14.4 M UUIPOINT MAPPING 455
Gain-of-function mutations often affect the way in The CEPH families were used to construct marker
which a gene or its product reacts to regulatory framework maps 455
signals 432 Muitipoint mapping can locate a disease locus on a
Diseases caused by gain of function of G-prote in fram ework of markers 456
coupled hormone receptors 433
14.5 FlNE· MAPPING USING EXTENDED PEDIGREES
Allelic homogeneity is not always due to a gain of
function 433
ANDANCESTRALHAPLOTYPES 457
Autozygosity mapping can map recessive conditions
Loss-of-function and gain-of-function mutations in
effiCiently in extended inbred families 457
the same gene will cause different phenotypes 433
Identifying shared ancestral chromosome segments
13.5 THE QUEST FOR GENOTYPE-PH ENOTYPE allows high-resolution genetic mapping 459
CORRELATIONS 435
14.6 DIFFICULTIES WITH STANDARD LOD SCORE
The phenotypic effect of loss-of-function mutations
ANALYSIS 460
depends on the residual level of gene function 435
Errors in genotyping and misdiagnoses can generate
Genotype-phenotype correlations are especially
spurious recombinants 461
poor for conditions caused by mitochondrial
Computational difficulties limit the pedigrees that
mutations 436
can be analyzed 462
Variability within families is evidence of modifier
Locus heterogeneity is always a pitfall in human
genes or chance effects 437
gene mapping 462
CONCLUSION 436 Pedigree-based mapping has limited resolution 462
FURTHER RfAOING 438 Characters with non-Mendelian inheritance cannot
be mapped by the methods described in this
Chapter 14 chapter 463
Genetic Mapping of Mendelian CONCLUSION 46.~
Characters 441 FURTHER READING 464
14.1 TilE ROLE OF RECOMBINATION IN GENETIC
MAPPING 442
Chapter 15
Recombinants are identified by genotyping parents Mapping Genes Conferring Susceptibility to
and offspring for pairs of loci 442 Complex Diseases 467
The recombination fraction is a measure of the 15.1 FAMJLYSTUDIES OF COMPLEX DISEASES '168
genetic distance between two loci 442 The risk ratio (Al is a measure of familial clustering 468
Recombination fractions do not exceed 0.5. however Shared family environment is an alternative
great the distance between two loci 445 explanation for familial clustering 469
Mapp ing functions define the relationship betwee n Twin studies suffer from many limitations 469
recombination fraction and ge netic distance 445 Separated monozygotic twins 470
Chiasma counts give an estimate of the total map Adoption studies are the gold standard for
le ngth 446 disentangling genetic and environmental factors 471
Recombination events are distributed non-randomly J 5.:! SEGREG~nON ANALYSIS 471
along chromosomes. and so genetic map distances Complex segregation analysis estimates the most
may not correspond to physical distances 446 likely mix of genetic factors in pooled family data 471
14.2 MAPPING A DISEASE LOCUS 44~ J 5." LflIfKAGE ANALYSIS OP COMPLEX
Mapping human disease genes depends on genetic CHARACTERS 473
markers 448 Standard lod score analysis is usually inappropriate
For linkage analysis we need informative meioses 449 for non· Mendelian characters 473
Suitable markers need to be spread throughout the Near-Mendelian families 473
genome 449 Non-parametric linkage analysis does not require a
Linkage analysis norm ally uses either fluorescently genetic model 474
labeled microsatellites or SNPs as markers 450 Identity by descent versus identity by state 474
14.3 TWO-POIi\'T Jl.W'I'ING 451 Affected sib pair analysis 474
Scoring recombinants in human pedigrees is not Linkage analysis of complex diseases has several
always simple 451 weaknesses 475
Comp uterized lod sCOre analysis is the best way to Significance thresholds 475
analyze complex pedigrees for linkage between Striking lucky 476
Mendelian characters 452 An example: linkage analysis in schizophrenia 476
Detailed Contents xxi
15.4 ASSOCIATION STUDlP.s AND UNKAGE Mouse models have a special role in identifying
DISEQUlLillRJUM 477
human disease genes 503
disequilibrium 479
clues for research 504
segments 480
Rearrangements that appear balanced under the
weaknesses 485
identifica tion 508
STUDIES 491
unrelated affected patients 512
origins 491
The gene underlying a disease may not be an
A complete account of genetic susceptibili ty will Further studies a re often necessary to confirm that
require contributions from both the common the correct gene has been identified 515
Chapter 16
Functional analysis of SNPs in sequences with no
16.7 HOWWFLL HAS DISEASE GENE ATM: the initial dete ctor of damage 555
IDENTIFICATION WORK1ID? 53J Nibrin and the MRN complex 555
Most variants that cause Mendelian disease have CHEK2: a mediator kinase 555
been identified 531 The role ofBRCAll2 556
Genomewide association studies have been very p53 to the rescue 556
successful, but identifying th e true functional Defects in the repair machinery underlie a variery
variants remains difficult 532 of cancer-prone genetic disorders 556
Clinically useful findings have been achieved in a Microsatellite instabiliry was discovered through
few complex diseases 532 research on familial colon cancer 557
Alzheimer disease 532 17.6GENOMI:WIDEVlEWS OF CANCER !i59
Age-related macular degeneration (ARMD) 533 Cytogenetic and microarray analyses give
Ecze ma (alOpic dermatitis) 533 genomewide views of structural changes 559
The p rohlem of hidden heritabiliry 534 New sequencing technologies allow genomewide
CO:l:CLUSJO/I. 534 surveys of sequence changes 559
FURTHER READ£.l\IG 535 Further techniques provide a genomewide view of
epigenetic changes in tumors 560
Geno mevvide views of gene expression are used to
Chapter 17 generate expression signatures 560
Cancer Genetics 537 17.7 UNRAVELlNGTl-lE MULTI-SUG£
17.1 '1111'. EVOLUTION OF CANCER 539 EVOLUTION OF A TUMOR 5(11
I 7.2 ONCOGENES 540 The microevolution of colorectal cancer has been
Oncogenes function in growth signaling pathways 541 particularly well do cumented 561
Oncogene activation involves a gain of function 542 17.11 INTEGRATING TilE DATA: CANCER AS CELL
Activation by amp lification 542 BIOLOGY 564
Activation by point mutation 543 Tumorigenesis should be considered in terms of
Activation by a translocation that creates a pathways, not individual genes 564
novel chimeric gene 543 Malignant tumors must be capable of stimulating
Activa tion by translocation into a angiogenesis and metastasizing 565
transcriptionally active chromatin region 544 Systems biology may eventually allow a unified
Activation of oncogenes is only oncogenic under overview of tumor development 566
certain circumstances 546 CONCLUSION 566
17.3 TlJMOR SUPPRESSOR GENES 546 FURTHER RfADlNG 567
Retinoblastoma provided a paradigm for
understanding tumor suppressor genes 546
Some tumor suppressor genes show variations on Chapter 18
the fWo-hit paradigm 547 Genetic Testing of Individuals 569
Loss of heterozygosiry has been widely u sed as a
18, I WHAT TO TEST AND WHY 571
marker to locate tumor suppressor genes 548
Tumor suppressor genes are often silenced Many different rypes of sample can be used for
genetic testing 571
epigenetically by methylation 549
RNA or DNA? 572
17.-l eEl L CYCLE DYSREGULATION IN CANCER 550 Functional assays 572
Three key tumor suppressor genes control eve nts
18.2 SCANNING A GENE FOR MIJTATIONS 572
in G, phase 551
A gene is normally scanned for mutations by
pRb: a key regulator of progression through
sequencing 573
G, phase 551
A variery oftechniques have been used to scan a
p53: the guardian of the genome 551
gene rapidly for possible mutations 574
CDKN2A: one gene that encodes fwO key
Scanning methods based on detecting
regulatory proteins 552
mismatches or heteroduplexes 574
17.5INSTABIUTY OF THE GENOME 553 Scanning methods based on single-strand
\~ario us methods are used to survey cancer cells for conformation analysis 576
chromosomal changes 553 Scanning methods based on translation: the
Three main mechanisms are responsible for the protein truncation test 576
chromosome instability and abnormal Microarrays allow a gene to be scanned for almost
karyotype , 553 any mutation in a single operation 576
Telomeres are essential for chromosomal s tabiliry 554 DNA methylation patterns can be detected by a
D:\.-\ damage sends a signal to p53, whi ch initiates variety of methods 577
proreni,·e responses 555 Unclassified variants are a major problem 578
Detailed Contents xxiii
Even if no single test gives a strong prediction, Nuclear transfer has been used to produce
maybe a battery oEtests will 623 genetically modified domestic mammals 646
Much remains unknown about the clinical validity Exogenous promoters provide a convenient way
of susceptibility tests 624 of regulating transgene expression 646
Risk of type 2 diabetes: the Framingham and Tetracycline-regulated inducible transgene
Scandinavian studies 625 expression 647
Risk of breast cancer: the study ofPharoah Tamoxifen-regulated inducible transgene
and colleagues 625 expression 647
Risk of prostate cancer: the study of Zheng Transgene expression may be influenced by position
and colleagues 627 effects and locus structure 648
Evidence on the clinical utility of susceptibility
20.3 TARGETED GENOME MODIFICATION
testing is almost wholly lacking 627
AND G~'Iffi fNACI1VAl 'ION INVH'O 648
H),5 popm \TION SCREENING 628 The isolation of pluripotent ES cell lines was a
Screening tests are not diagnostic tests 629 landmark in mammalian genetics 648
Prenatal screening for Down syndrome defines an Gene targeting allows the production of animals
arbitrary threshold for diagnostic testing 629 carrying defined mutations in every cell 649
Acceptable screening programs must fit certain Different gene-targeting approaches create null
crileria 631 alleles or subtie point mutations 650
\ hat would screening achieve? 631 Gene knockouts 650
ensitivity and specificity 63 1 Gene knock-ins 650
ChOOSing subjects for screening 633 Creating point mutations 651
An ethical framework for screening 633 Microbial site-specific recombination systems allow
Some people worry that prenatal screening conditional gene inactiva Uon and chromosome
programs might devalue and stigmatize engineering in animals 651
affected people 633 Conditional gene inactivation 653
Some people worry that enabling people with Chromosome engineering 653
genetic diseases to lead normal lives spells Zinc finger nucleases offer an alternative way of
trouble for future generations 634 performing gene targeting 654
6 HIE ~E\'" PARADIGM: PREDICT AND Targeted gene knockdown at the RNA level involves
PREVENT? 634 cleaving the gene transcripts or inhibiting their
translation 656
·CLUSrON 636
In vivo gene knockdown by RNA interference 656
r ""THEn READING 637 Gene knockdown with morpholino antisense
oligonucleotides 656
Chap ler20
ZUA RANDOM MUTAGENESIS AND LARGE· SCALE
Genetic Manipulation of Animals for
ANlMAL MUTAGENESIS SCREEN 657
, Iodeling Disease and Investigating Gene Random mutagenesis screens often use chemicals
Function 639 that mutate bases by adding ethyl groups 658
, ER \~ EW 640 Insertional mutagenesis can be performed in ES
A lride range of species are used in animal modeling 640 cells by using expression· defective transgenes as
). • animal models are generated by some kind gene traps 658
of nificially designed genetic modification 640 Transposons cause random insertional gene
~ e types of phenotype analysis can be inactivation by jumping within a genome 658
penanned on animal models 642 Insertional mutagenesis with the Sleeping
Beauty transposon 659
\KL'lJG TRANSGENIC ANIr.w..5 643
piggyBac-mediated transposition 660
lkans mc animals have exogenous DNA inserted The International Mouse Knockout Consortium
me germ line 643
seeks to knock out all mouse genes 660
i':!=ICM~·r microinjection is an established method
,- g some transgenic animals 644 20.5 USING GENEnCALLY MOorHED ANIMALS
Tn;n:,".mes can also be inserted into the germ line TO MODEL DISEASE AND DISSECT GEl'lffi
,;a germ cells, gametes, or pluripotent cells derived FUNCTION 661
from the early embryo 645 Genetically modified animals have furthered our
Gen transfer into gametes and germ cell knowledge of gene function 661
precursors 645 Creating animal models of human disease 662
Gene transfer into pluripotent cells of the early Loss-of-function mutations are modeled by selectively
embryo or cultured pluripotent ste m cells 645 inactivating the orthologous mouse gene 662
Gene ran sfer into somatic cells (animal Null alleles 663
~ 645 Humanized alleles 663
Detailed Contents xxv
Gene knockouts to model loss of tumor Integration of therape utic genes into host
CONCLUSION 672
vectors 702
understood 678
fWD THERAPEUTIC GENE REPAIR 704
PROTEIN~,ANDVACCINES 6110
Antisense oligonucleotides can induce exon
transgenic animals 68 1
2J.6GENETHERAPY IN PRACllCF. 709
TffERAP, 667
but progress toward effective treatment is slow 714
Transdifferentiation 694
Chapter 1
KEY CONCEPTS
The great bulk of eukaryotic genetic information is stored in the DNA found in the nucleus.
A tiny amount is also stored in mitochondrial and chloroplast DNA.
DNA molecules are polymers of nucleotide repeat units that consist of one of four types of
nitrogenous base, plus a sugar, plus a phosphate.
The backbone of any DNA molecule is a sugar-phosphate polymer, but it is the sequence of
the bases attached to the sugars that determines the identity and genetic function of any
DNA sequence.
• DNA normally occurs as a double helix, comprising two strands that are held together by
hydrogen bonds between pairs of complementary nitrogenous bases.
1tansmission of genetic information from cell to cell is normally achieved by copying the
complementary DNA molecules that are then shared equally between two daughter cells.
Genes are discrete segments of DNA that are used as a template to synthesize a fun ctional
complementary RNA molecule.
Most genes make an RNA that will serve as a template for making a polypeptide.
Vario us genes make RNA molecules that do not encode p olypeptide. Such noncoding RNA
often helps regulate the expression of other genes.
Uke DNA, RNA molecules are polymers of nucleotide repeat units that consist of one of four
types of nitrogenous base (three of these are the same as in DNA). plus a slightly different
sugar, plus a phosphate.
Unlike DNA, RNA molecules are usually single-stranded.
To become functional, newly synthesized RNA must undergo a series of maturation steps
- sugar ---(i)- sugar -G-- sugar - 11 - sugar ~. sugar - G - 5'1/ ' ""- I 0'1/ ""- I
C H H C1' C H H Cl'
Nucleic acids and polypeptides are linear sequences of simple Figur. 1.1 Repeat units in nucleic acids.
(A) The linear backbone of nucleic acids
repeat units consists of alte rnating phosphate and sugar
residues. Attached to each sugar is a base.
Nucleic acids The basic repeat unit (pal e peach shading)
DNA and RNA have very similar structures, Both are large polymers with long consists of a base + sugar + phosphate = a
linear backbones of alternating residues of a phosphate and a five -carbon sugar. nucleotide. The suga r has fi ve carbon atoms
Attached to each sugar residue is a nitrogenous base (Figure 1.IA), The sugars in numbered l ~ to 5' . (6) In DNA, the sugar is
DNA and RNA differ, in either lacking or possessing, respectively, an -OH group deoxyribose. (C) In RNA, the sugar is ribose,
altheir 2' -carbon positions (Figure LIB, C), In deoxyribonucleic acid (DNA), the w h ic h differs from deoxyribose in having a
hydroxyl (OH) group attached to carbon 2'.
sugar is deOl,-yribose; in ribon ucleic acid (RNA), the sugar is ribose,
Unlike the sugar and phosphate residues, the bases of a nucleic acid molecule
valY, The sequence of bases identifies the nudeic acid and determines its func
tion, Four types of base are commonly foun d in DNA: adenine (A), cytosine (C),
guanine (G), and thymine (T), RNA also has four major types of base, Three of
them (adenine, cytosine, and guanine) also occur in DNA, but in RNA uracil (U)
repl aces thymine (Figure 1.2A),
NH, NH, 0
16 41 11 6 16 7
/N~
~
7 C 5 1/ C , 5 N7 C , ,5
oC ........... 5 _ N
C~ N9/
I N :?' C '\.
I I CH "IC II I I CH I 8
3
H
9
OH
thymine (T) uracil (U) I
CH2 0
II CH II "'l'c_b/ J
H, 'I 2' 1 H
/ c ,, 5/ 3 c 5
HN 4 C HN / '"' ........... CH
OH OH
31 ICH 31 IICH
c c lFigur.1 .2 Purines, pyrimidines, nucleosides, and
7 '1 " N /. 7'2 " N /· nucleotides. (A) Four nitrogenous bases (A.. C, G. and
o I H o 1H
T) occu r in DNA, and four nitrogenou s bases (A, C. G,
and U) are fou nd in RNA. A and G are purines; C, T, and
(e) N
IH, NH, U are pyrimidines. (6) A nucleoside is a base + sugar
6
c,," 7 J
C "
residue; in thi s case, it is adenosine. (C) A nucleotide
is a nucleoside + a phosphate group that is attached
to the 3' or 5' carbon of the sugar.The two examples
II ~"CH
/ N" N-?- " CH
-?- C shown here are adenosine 5'-mo nophosphate (AM P)
31 II
C~ N9/ 8
1N
I
HC""" / 4
C CH
7 ' 2 " N /.
and 2'-deoxycytidine 5'·trip hosphate (dCTP).The bold
lines at the bottom of the ribose and deoxyribose
) -"""';N ,
3
o 1 rings mean that the plane of the ring is at an angle of
o o 0 0 90 0 with respect to the plane of the chemical groups
that are linked to t he 1' 104' carbon atoms w ithin the
-0 - P - O
II
-o - ~L o - ~ I!.... o -ko ring. If the plane of the base is represented as lying on
I
0-
I
CH
I
0-
I
0-
I
0-
I
CH
the surface of the page, the 2' and 3' carbo ns of the
suga r could be viewed as projecting upward out of the
"1/0",, " 1/0",, page, and the oxygen atom as projecting downward
C H H C t" C H H Cl
4'I'c_c/
H3' I 2' 1 H
1 4'I'c_ c(1
H3' I 2' 1 H
below its su rfa ce. Ph osphate groups are num bered
sequentially (a.. p,1, etc.), according to their distance
OH OH OH H from the su ga r ring.
4 Chapter 1: Nucleic Acid Structure a nd Gene Expression
Purine
Ad enine OO.€- "'\Oi.. -~ adenosine monophosphate adenosine diphosp hate (ADP) adenosine triphosphate (ATP)
(AMP)'
2 _: g ~ ~ nosine guanosine mono phosphate guanosine d i phosphate (GOP) guanosine triphosphate (GTP)
(G MP)'
Pyrimidine
Cytosine cytid ine cytidine monophosphate (C MP)a cytidin e diphosphate (CDP) cytidine triphosphate (CTP)
Thymi ne thymidine thymidine monophosphate thymidine d iphosphate (TOP) thym idine triphosphate (TIP)
ITMP)'
Uracil urid ine uridine mono phosphate (UMP);! uridine diphosphate (U DP) uridine triphosphate (UTP)
~Nucleosjde monophosphates are alternatiyely na med as fo ll ows: AMP, adenylate; GMP, guanylate; CMP, cytidylate; TMP, thymidylate; UMP, urid ylate.
t>where the sugar is ri bose, the nucleotide is AMP; where the sugar is deoxyribose, the nucleotide is dAMP. This pattern applies through out the table.
No te that TMP. TOP, and TTP are not normally found in cells.
Bases consist of heterocyclic rings of carbon and nitrogen atoms, and can be
divided into two classes: purines (A and G). which have two interlocked rings.
and pyrimidines (e, T, and U) , which have a single ring. In nucleic acids. each
base is attached to carbon l ' (one prime) of the sugar; a sugar with an attached
base is called a nucleoside (Figure 1.2B). A nucleoside with a phosph ate group
attached at the 5' or 3' carbon of the sugar is Ole basic repeat unit of a DNA strand
and is called a nucleotide (Figure 1.2e and Table I. t) .
Polypeptides
Proteins are composed of one or more polypeptide molecules that may be modi
fied by the addition of carbohydrate side chains or other chemical groups. Like
DNA and RNA. polypeptide molecules are polymers that are a linear sequence of
repeatlng units. The basic repeat unit is called an amino acid. An amino acid has
a positively charged amino gro up (-NH 2) and a negatively charged carboxylic
acid (carboxyl) gro up (-eOOH). These are connected by a central a-carbon ato m
th at also bears an identifying side chai n that determines the chemical nature of
tbe amino acid. POlypeptides are formed by a condensation reaction between the
amino group of one amino acid and the carboxyl group of the next, to form a
repeating backbone, where the side chain (called an R-group) can differ from one
ami no acid to another (figure 1.3).
l he 20 different common amino acids can be categorized according to their
edJai ns:
bibic amino acids (Figure 1.4A) carry a side chain with a net positive charge
pb,·siologicai pH;
DNA, RNA, AND POLYPEPTIDES 5
CH, I N CH
I I
l I. .;, C
I I. o
;C,0- 0; ' 0
C
NH, H,N NH, HC=NH
basic acidic
ICi
CH,
I
; C,
CH, HC CH
I I II
CH, CH, He CH
I I ~/ I
CHI CH -CH3 C CH,
o
C
; ,
NH2 o
/ , NH] I I I I
OH OH OH SH
101
H 0
I I II
HN- C -C-OH
I
CH,
I H,C
/CH, I 1°
CH, H]C CH 2
CH CH
H,C
/, / , CH, I ,/ Figu'.1 ~4 R groups of the 20 common
H CH, CH, H,C CH, CH, amino acids, grouped according to
glycine alanine valine leucine isoleucine proline chemical class. Th ere are 11 polar amino
(Gly;G) {Ala; A) (Val; V) (leu; L) {lie; I) (Pro; P) acids, divided into three classes: (A) basic
amino acids (pos itively charged); (8) acidi c
amino acid s (negatively charged); and
CH, (el uncharged polar amino acid s bearing
CH, I CH, three different types of chemical group.
I ;C, ;CH, I Polar chemical gro ups are highlighted. (0) In
CH, HC CH HC C-C
I I II I II II addition, a fourth class is composed of nine
5 HC CH HC C nonpolar neutral amino acids. Amino acids
C
I ~/ ~ / ,/ w ithin each class are chemically very similar.
CH, CH CH NH
Side-chain carbon atoms are numbered
methionine phenylalanine tryplophan from [he central a-carbon ato m (see the
(Met;M) (Phe; F) (Trp; W) lysi ne side chain). In proline, [he A group's
side chain co nn ects to the ami no acid's -NH2
nonpolar
group as well as to its central a-carbon atom.
6 Cha pter 1: Nuclei< Add Structure and Gene Expression
Ionic IO nic interac tions occur between charged groups. They can be very strong in
crystals, but in an aqueous environment the charged groups are shield ed by both
water molecules and ions in solution and so are quite wea k. Nevertheless. they
can be very important in biological function, as in enzyme-su bstrate recog nition
Van derWaals Any two atoms in close proximity show a wea k attractive bonding interac tion
forces (van derWaals attract io n) as a resu lt of their fluctuating electrical charges. When
atoms become extremely close, they repel each other very strongly (va n de r
Waals repu lsion). Although individual van derWaals attractions are very wea k, the
cumu lative effect of many such attractive forces can be important when there is a
very good tit between the surfaces of two macromolecules
Hyd rophobic Water is a polar molecule. Hydropho bic molecu les or che mical groups in an
forces aqueous environment tend to cluster. This minimizes their disruptive effect s on
the complex network of hydrog en bonds between water molecules. Hydrophobic
groups are said to be held together by hydrophobic bond s, although the basis of
In general, polar amino acids are hydrophilic, and nonpolar amino acids are
hydrop hobic. Glycine, with its very small side chain, and cysteine (whose -SH
. group is not as polar as an - OH groupl occupy intermedia te positions on the
hydrophilic-hydrophobic scale.
As described below, the side chains can be modified by the addition of various
chemical groups or sugar chains.
BOX 1.1 THE IMPORTANCE OF HYDROGEN DON DING IN NUCLEIC ACIDS AND PROTEINS
Intermolecular hydrogen bonding in nucleic acids
codons bind to tRNA during translation . Many regulatory RNAs,
This is important in permitting the formation of the following double
such as microRNAs, control the expression of selected target
SHa nded nucl eic acids:
genes by base pairing to complementary sequences at the RNA
Double-stranded DNA The stability of the double helix is level.
mai ntained by hydrogen bonding between A- T and C-G base
Intramolecular hydrogen bond in g in nuclei c acids
p2l1rs. The individual hydrogen bonds are weak, but in eukaryotk
This is particularly prevalent in RNA molecu les. Intramolecular base
celts the two strands of a DNA helix are held together by between
pairing can form hairpins that may be crucially important to the
tens ofthousands and hundreds of millions of hydrogen bonds.
st ruct u re of some RN As such as rRN A and tRN A (see Figure 1.9), and as
o A-R/'.JA duplexes. Hydrogen bonds form naturally between DNA
targets for ge ne regu lation.
and RNA d uring tran scription, but the base pairi ng is transient
beouse the RNA migrates away from DNA as it matures. Intramolecular hyd rogen bonding in proteins
DoubJf:-scronded RNA. This occurs stably in the genomes of some Severa l characteristic elements of protein second ary stru cture, such
vJru:ses.1t also arises tra nsiently in cells during gene expression. as a-helices and ~ - pleated sheets, arise because of hydrogen bonding
ie r example, during RNA splicing, small nuclear RN A molecules between side cha ins of different amino acids on the same polypeptide
bind to complementary sequences in pre-mRNA, and mRNA chain.
NUCLEIC ACID STRUCTURE AND DNA REPLICATION 7
break them . No ncoval ent bonds, however, are constantly being made and broken
H )' I rl H
o H
at physiological temperatures (see Box 1.1 ). I
Figure 1.5 A 3', 5' -phosphodiester bond.
1.2 NUCLEIC ACID STRUCTURE AND DNA fhe phosphodiester bond (pale peach
shading) joins the 3' carbon atom of one
REPLICATION sugar to the 5' carbon atom of the next
sugar in the sugar- phosp hate ba<kbone of a
DNA and RNA structure nucleic acid.
,Aj (6 ) H
~
~2c -
sogac ;" 0 sugar
/\ N -H .. · .. &
~9;
sugar
" 3 N =C 2 1/
3
N = C2
0
'" , 1 ------ sugar
" •/ \ ...-
' . j \ C - N -----
~
N
N -c 8T j N- C
/ -H&-.....[rNj \
'~/,ICS-H
"; \\ N .. · .. H - \
\\C - c/,1 1
C
/ s """N/7 5 6, N _ 3...
3N\
c-{l "
I C -H
c"""
8 N
I 5- S
C C
~O-
1
C - Cs
H
H
8;-1;1
/ H'/"O \ 4 5
CH3
/
H 7 \ O .... · M
H- N/ "
1
\
H
A hydrogen
bond T G H C
Flgur.1 .6 AT and GC base pairs. (A) AT base pairs have two connecting hydrogen bonds (doned red lines); (8) GC base pairs have
three. Fractional positive ch arges and frac tiona l negative charges are shown by 8+ and 0- , re spectively.
8 Chapter 1: Nucleic Acid Structure and Gene Expression
Because the phosphodiester bonds link carbon ato ms number 3' and number 5'
3'
5' of successive sugar residues, the two ends of a linear DNA strand are different.
The 5' end has a terminal sugar residue in which carbon atom numbe r 5' is not
linked to another sugar residue. The 3' end has a terminal sugar resid ue whose 3'
carb on is not involved in phosphodiester bonding, The two strands of a DNA
duplex are described as being anti-parallel to each other because the 5'-.3' direc
tion of one DNA strand is the opposite to that of its partner, according to Watson
Crick base-pairing rules (Figure 1.8).
pitch
Genetic information is encoded by the linear sequence of bases in the DNA 3.6 nm
strands. The two strands of a DNA duplex have complementary seq uences, so the
sequence of bases of one DNA strand can therefore readily be inferred from that
major
of the other strand. It is usual to describe DNA by writing the sequence of bases groove
of one strand only, in the 5'-.3' direction, which is the direction of synthesis of
new DNA or RNA from a DNA 'template. When describing the sequence of a DNA ,,
region encompassing two neighboring bases (a dinucleotide) on one DNA strand, ,,
it is usual to insert a 'p' to denote a connecting phosphodiester bond. So, a CG ,,
,,
base pair means a C on one DNA strand is hydrogen-bonded to a G on the com ,,
plementary strand, but CpG represents a deoxycytidine covalently linked to a ,,
neighboring deoxyguanosine on the same DNA strand (see Figure 1.8). 5' ,,
Unlike DNA, RNA is normally single·stranded except for certain viruses that
,
1 0m ):
have double·stranded RNA genomes. However, to perform certain cell functions
two RNA molecules may need to associate transiently to for m base pairs, and
Figute 1r1 Features of the DNA double
intermolecular hydrogen bonding also permits the for mation of transient RNA hel ~ . The two DNA strands wind round eac h
DNA duplexes (see Box 1.1). oth er, producing a minor groove and a major
In addition, hydrogen bonding can occur between bases within a single groove In the double helix. The double helix
stranded RNA (or DNA) molecule to produce structurally and functionally impor' has a pitch of 3.6 nm and a radius of 1 nm
tant stretches of double·stranded sequence. Hairp in structures may be formed per cum.
with stems that are stabilized by hydrogen bonding between bases (Figure 1.9AJ.
lntrachain base pairing causes certain RNA molecules to have complex struc
tures (Figure 1.9B).
In double-stranded RNA, A pairs with U instead ofT. Although G usually pairs
with C, sometimes G-U base pairs are formed (see the example in Figure 1.9B),
Although not particularly stable, G-U base pair ing does not significantly distort
the RNA- RNA helix.
S' end 3' end
H OH
'( H 12' h' H
O= p -o- I/~ - ~, I <l'
I 1'( H H C
o I~ /1 "
I e ':::::" G a CH,
CH ........ I
" I/O~I 0
C H H ct' I
" 1"c - c'/1
'o-p~o
I
H 3' 1 "I H H a
a
I H I,'
H C-
I,'
c" H
o= p-o' 1/ , ,, 14'
I t'C H H C
o ........
I~ /
C 0
1"
CH,
I
CH 0 G .......
•.•••• ".
1
" I '/ ~ I 0
c"H H Ct ' I
4'1't_c/ I -o-P =0
H 3'1 2'1 H H ~ I=lgl.l,.1.6 Anti-parallel nature of the DNA
O= p-o-
'I H H " ,
I /T - ~,I .f
I,' H double helix. The two anti-parallel DNA
strands run in opposite directions in lin king
1 "C H H C 3' to S' carbon atoms in the sugar residues.
o I~ /1 , This double-stranded trinucleotide has the
1 T ,...... A a jH, sequence 5' pCpGpT-OH 3'YS' pApCpG- OH 3',
CH, 0 where p sta nds for a phosphate group an d
"1/ ~ I 0 -OH 3' represents the 3' terminal hydroxyl
C H HC t' 1
4'I't_t/ 1 -o-P=o group. Thi s is conventionally ab breviated to
.~
give the S' ~ 3' sequence of nucleotid es on only
"3'1 "I" one strand, either as S'-CGT-3' (blue strand) or
OH "
r~ ,~ as S'-ACG-3' (purple strand).
NUCLEIC ACID STRUCTU RE AND DNA REPLICATION 9
1
hydrogen bond A Hydrogen bonds between the sequences
1~
forma tion highlighted by dark pink shading within this
single-stranded nucleic acid (shown here
5, end
G'"C I acceptor arm as RNA) can stabilize folding back to form a
GAG C~·-G hairpin with a double-stranded stem.
A C A.. ·U
{B} Extensive intramolecular base pairing in
C. .C Urn- " A
U .. ,A transfer RNA. The tRNAGly shown here as an
Darm G' ''C example illustrates the classical cloverleaf
I G· ·· ( ~ c;l ~ ~ ~ U Um!A tRNA structure. There are three hairpins (the
uGA ~Y UGGU 1mi b ~~_G o arm, anticodon arm, and the T141C arm) plus
G : : 1 'm eG ! I
a stretch of base pairing between 5' and
G~ A G A A U U AG T'f'C
5' AGACCAC C. A AGAGCC 3'
... [ ~:C"'G
: :g oem
3' terminal sequences (called the acceptor
arm because the 3' end is used to attach an
amino acid). Note that tRNAs always have the
antlcodon arm J:::~
C- mSC same number of base pairs in the stems of the
V
anlicodon
different arms of thei r cloverleaf structure and
that the anticodon at the cen ter of the middle
loop identifies the tRNA according to the
amino acid it will bear. The minor nucleotides
depi((ed are: 0, S,6-dihydrouridine;
Replication is semi-conservative and semi-discontinuous 'V, pseudourid ine (S-ribosyluracil); m SC.
5-methylcytidine; m 1A. l-methyladenosine;
For new DNA synthesis (replication) to begin, the two DNA strands of a helix Um, 2' -O-methyluridine.
need to be unwound by the enzyme helicase. The two unwound DNA strands
then each serve as a template for DNA polymerase to make complementary DNA
strands, using the four deoxynudeoside triphosphates (dATp, dCTp, dGTP, and
dTTP). Two daughter DNA duplexes are formed, each identical to the parent mol
ecule (Pigure 1.101. Each daugh ter DNA duplex contains one strand from the
parent molecule and one newly synthesized DNA strand, so the replication proc
3'
ess is semi~conservative. 5'
DNA replication is initiated at specific points, called origins of replication,
generatingY-shaped replication forks, where the parental DNA duplex is opened
up. The an ti-parallel parental DNA strands serve as templates for the synthesis of parental
The overall direction of chain growth is 5'.....>3' for one daughter strand, the
leading strand, but 3'->5' for the other daughter strand, the lagging strand
(figure 1.11 ). The reactions catalyzed by DNA polymerase involve the addition of
a deoxynudeoside monophosphate (dNMP) residue to the free 3' hydroxyl group
of the growing DNA strand. However, only the leading strand always has a free 3'
hydroxyl group that allows continuous elongation in the same direction in which
the replication fork moves.
The direction of syn thesis of the lagging strand is opposite to that in which the 3'
replication fork moves. As a result, strand synthesis needs to be accomplished in
a progressive series of steps, making DNA segments that are typically 100-1000
nucleotides long (Okazaki fragments). Successively synthesized fragm ents are
eventually joined covalently by the enzyme DNA ligase to ensure the creation of
two complete daughter DNA duplexes. Only the leading strand is synthesized
continuously, so DNA synthesis is therefore semi-discontinuous.
5' • ~ 3' S"
DNA polymerases sometimes work in DNA repair and new ongmal original
- 3'
new
recombination ' - - -----l
daugh ter
I
daughter
duplex duplex
The machinery for DNA replication relies on a variety of proteins (80)[ 1.2) and
RNA primers, and has been highly conserved during evolution. However, the Figure 1.1 0 Semi-conservative DNA
complexity of the process is greater in mammalian cells, in terms of the numbers replication. The parental DNA dupl ex
of different DNA polymerases (TobIe 1.3), and of their constituent proteins and consists of two comp lementary,
anti-parallel DNA stra nds that unwind to
rubunits.
serve as templates for the synthesis of
Most DNA polymerases in mammalian cells use an individual DNA strand as
new complementary DNA strands. Each
a template for synthesizing a complementary DNA strand and so are DNA completed daughter DNA duplex contains
directed DNA polymerases. Unlike RNA polymerases, DNA polymerases nor one of the two parental DNA strands plus
m ally require the 3'-hydroxyl end of a base-paired primer strand as a substrate. one newly synthesized DNA strand. and is
TIlerefore, an RNA primer, synthesized by a primase, is needed to provide a free structurally identical to the original parental
3' OH group for the DNA polymerase to start synthesizing DNA. DNA duplex.
10 Chapter 1: Nucleic Acid Structure and Gene Expression
strand
Topoisomerases- start th e process of DNA unwind ing by DNA polym erases-for synthesizing new DNA strands. New
breaking a single DNA strand, releaSing the tension holding the cellular DNA synthesis normally depends on an existing DNA
helix in its coiled and supercoiled form. strand template that is read by a DNA--directed DNA polymerase.
Helicases-unwind the double helix at the replication fork, once This complex aggregate of protein subunits often also provides
supercoiting has been eliminated by a topoisome rase. DNA proofreading and DNA repair function s (see Table 1.3). This
Single-stranded binding proteins-maintain the stability o f means that any wrongly incorporated bases can be identified,
the replication fork. Single-stranded DNA is very vu lnerable removed, and repaired. DNA can also be synthesized from an
to enzymatic attack; the bound proteins protect it from being RNA template, u sing an RNA-directed DNA polymerase (a reverse
deg raded . transcriptase). The ends of linea r chromosomes are copied using a
Primases-enzymes that attach a small complementary RNA reversetranscriptase (telomerase).
sequence (a primer) to single-stranded DNA at the replication DNA lig ases- needed to seal nicks that remain in newly
fork. The RNA primer provides the 3' OH needed by DNA synthesized DNA after t he RNA primers have been removed and
polymerase t o begin synthesis (unlike RNA polymerases, DNA the small gaps filled by DNA polymerase. The DNA lig ases catalyze
polyrn erases can not initiate new strand synthesis from a bare the formation of a phosphodiester bond between unattached but
single-stranded template but require an initiating molecule w ith a adjacent 3' hydroxyl and 5' phosphate groups.
free 3' OH grou p o nto which deoxynucleoside triphosphates can
be attached to bu ild a complementary strand).
NUCLEIC ACID STRUCTURE AND DNA REPLICATION 11
Polymerase Family Standard DNA replication Additional or alternative roles in DNA repair, recombination, etc.
I) (delta) B main polymera se t hat synthesizes lagging multiple roles in DNA repai r
strand
l (iota) Y translesion synthesise; poss ible roles in base excision repair b and
mismatch repaire
A (lambda) X
double-strand break repair; VO) recombination9; base excision repairb
~ (mu) X
Interspersed rep eat reverse t ra nscriptases (LINE· 1 or occasionally converts mRNA and other RNA into eDNA, wh ich can integrate elsewhe re
endogenous retrovirus elements) into the genome
Telomerase reverse transcriptase (Tert) replicates DNA at the ends of linear chromosomes
.rrerminal deoxynucleotide transferase. bBa se excision repair identifies and rem oves inappropria te bases or inappropriately m odi fied bases.
'Translesion synthesis involves the repl ication of DNA past damaged DN A (lesions) on the template strand. dlnterstrand crosslin k repair is the repair
of high ly cytotoxic lesions w here covalent DNA bonds have been formed between the DNA strands. eMismatch repair is a form of DNA repair that
corrects mistakes arising when noncomplementary nucleotides form a base pair. f Nucleotide excision repair is used to fi x helix·distorting lesions.
950 matic hypermutation and VDJ recombination are mechanisms used in B cells to d iversify immunoglobulin sequences.
DNA polymerases can continue to synthesize new DNA strands opposite a lesion
in the template DNA (translesion synthesis) and they can contribu te to the
uence diversity of immunoglobulins (e.g. by introducing many base changes
in coding sequences) and so assist in the recognition of numerous foreign anti
gens by the immune system.
Single linear Multiple linear Single circular MUltiple circular Mixed (linear +
circular)
DNAGENOMES
Do ub le~stranded (ds)DNA some viruses; a segmented dsDNA mitochondria; multipartite viruses; a very few bacteria,
very few bacteria, viruses; eukaryotic chloroplasts; many some bacteria e.g. Agrobacterium
e.g. Borrelia nuclei bacteria and Archaea tumefaciens
RNAGENOMES
Single~st r ande d (ss)RNA some viruses segmented ssRNA a very few viruses
viruses
See Figure 1.12 for examples of viral genomes and for explanations of segmented and multipartite viruses.
Viruses have developed many different strategies to infect and subvert cells,
and their genomes show extraordinary diversity when compared with cellular
genomes [fable 1.4 and Figure 1.121. Because RNA replication has a much higher
error rate than DNA replication, viral RNA genomes have a higher mutational
load than DNA genomes. Although viral RNA genomes are generally quite small,
the elevated mutation rate permits more rapid adaptation to changing environ·
mental conditions. RNA viruses usually replicate in the cytoplasm; DNA viruses
generally replicate in the nucleus.
Retroviruses are unusual RNA viruses both because they replicate in the
nucleus and also because their RNA replicates via a DNA intermediate. The
single-stranded RNA genome is converted into a single-stranded cDNA using a
viral reverse transcriptase. The single-stranded viral cDNA is then converted into
double-stranded DNA, by using a DNA polymerase from the host cell. Other viral
proteins then help insert this double-stranded DNA into the host cell's chromo
somal DNA. It can remain there for long periods or be used to synthesize new
viral RNA genomes that are packaged as new virus particles.
o o
which is therefore a positive-wand genome
~® ~® (+), or be the opposite sense (antiSense)
parvoviruses but no DN A Intermediate of the genome, a negative-strand genome
e.g. hepatitis A H. Some single-stranded (+) RNA viru ses
(e.g. retroviruses) go through a DNA
e.g. M13, fd, fl e.g. hepatitis
~0 delta virus intermediate. and some double-stranded
e.g. rhabdoviruses DNA viruses (e.g. hepatitis B) go through
a replicative RNA form. (S) Segmented
~® and multipartite genomes. Segmenred
replicate via DNA intermediate genomes are ones in which the genome is
e.g. retroviruses divided into multiple d ifferent nucleic acid
molecules, each specifying a messenger RNA
double-stranded devoted to making a single po lypep tide.
o
linear circular linear circular For example, the genome of an influenza
virus is partitioned into eight different
IIII !I II ! ! i , iii i iii! Ii " l i i i i i p jj iiiiii
negative single-strand RNA molecules. In
herpes virus,). reoviruses some segmented genomes, each of the
different m olecules is packaged in a separate
, illil"!!", " i! ~ virus particle (capsid). Such genomes
papovaviruses
IB)
DNA RNA
~
gemini virus j influenza virus
(bipartite; two
circular double
stranded DNA
-.
- - (segmented negative
Slrand RN A)
mo lecules in
a double capsid)
Normally, only one of the two DNA strands in a duplex serves as a template for
RNA synthesis. During transcription, double-stranded DNA is bound by RNA
polymerase. The DNA is then unwound, enabling the DNA strand that will act as
il template for RNA synthesis (the template strand) to form a transient double
stranded RNA-DNA hybrid with the growing RNA strand.
The RNA transcript is complementary to the template strand of the DNA and
has the same 5'->3' direction and base sequence (except that U replaces T) as the
opposite, nontemplate DNA strand. The nontemplate strand is often called the
sense strand, and the template strand is often called the antisense strand (Figure
J.l3).
In documenting gene sequences, it is customary to show only the DNA Figure ... 13 Transcribed RNA is
sequence of the sense strand. The orientation of sequences relative to a gene no[ complementary to one strand of DNA.
mally refers to the sense strand. For example, the 5' end of a gene refers to The nucleotide sequence of transcribed
RNA is normally identical to that of the
sense strand, except that U replaces T, and
5' is complementary to that of the template
r r+
3· j ,- ,. 3·
5·
strand. The nucleotide at the extreme
S' end of a primary ANA transcript carries a
5' ppp -: ~ 1~ !! i i ! ! : : t : OH 3'
template (antisense) s t ra nd
S' triphosphate group that may later
undergo modification; the 3' end has a free
RNA hydroxyl group.
14 Chapter 1: Nucleic Acid Structure and Gene Expression
sequences at the 5' end of the sense strand, and upstream or downstream
sequences flank the gene at its 5' or 3' ends, respectively, with reference to the
sense strand. For transcription to proceed efficiently, various proteins (transcrip
tion factors) must bind to particular DNA sequence elements (collectively called
a promoter) that are often located close to and upstream of a gene. The bound
transcription factors serve to position and guide the RNA polymerase.
RNA polymerases synthesize RNA from four nucleotide precursors: ATp, CTp,
GTp, and UTP. Elongation involves the addition of the appropriate ribonucleoside
monophosphate residue (AMP, CMP, GMP, or UMP) to the free 3' h ydroxyl group
at the 3' end of the growing RNA strand. These nucleotides are derived by split
ting a pyrophosphate residue (PP;) from their approp riate ribonucleoside tri
phosphate (rNTPJ precursors. Only the initiator nucleotide at the extreme 5' end
of a primary transcript carries a 5' triphosphate group.
285 rRNAa, 185 rRNA"', 5.85 rRNN localized in the nucleolus; RNA polymerase I produces a single primary
transcript (455 rRNA) that is cleaved to give the three rRNA classes listed here
II mRNAh, miRNA', most snRN As d and snoRNAs e RNA polymerase II tra nscripts are unique in being subject to capping and
polyade nyl ation
III 5$ rRNN, tRNAf, U6 snRNA9, 7SL RNAh, various the promoter for some ge nes transcribed by RNA polymerase III (e.g. 5S rRNA,
other small noncoding RNAs tRNA, 7SL RNA) IS interna l to the gene; for others, it is located upstream of the
gene (see Figure 1.15)
3Riboso mal RNA. bMessenger RNA. cMicroRNA. dSmall nuclear RNAs. eSmall nucleolar RN As. frransfer RNA. 9U6 snRNA is a component or the
spliceosome, an RNA-protein complex that removes unwanted noncoding seque nces from newly formed RNA transcripts. h7SL RNA forms part of the
sign al recognition particle, which has an importan t role in the transport of newly synthesized proteins.
RNA TRANSCRIPTION AND GENE EXPRESSION 15
-
-30 - 20 -10 +1 eukaryotic genes encoding polypeptides.
Polypeptide-encoding genes are transcribed
by ANA polymera se 11. The promoters are
defined by short sequence elements located
(B)
in regions just upstream of the transcription
- 1000 -900 -800 - 700 -600 - 500 -400 -300 - 200 -100 +1 sta rt site (+ 1). (A) The (}globin gene
<" ,~ "}) < ») . > ~ )~J.
promoter includes a TATA box (orange), a
CAAT box (purple), and a GC box (blue).
(B) The glucocorticoid receptor gene is
unusual in possessing 13 upstream GC
For a gene to be transcribed byRNApolymerase II the DNA must first be bound
boxes: lOin the normal orientation, and
by general transcription factors, to form a preinitiation complex. General tran 3 in the reverse orientation (a/(emative
scription factors required by RNA polymerase 1I includeTFlLA, TFIIB, TFIID, TFIIE, orientations for GC box elements ate
TFIII; and TFlIH. These transcription factors may themselves comprise several indicated by chevron directions).
components. For example, TFIID consists of the TATA-box-binding protein (TBP;
also found in association with RNA polymerases I and III) plus various TBP
associated factors (TAF proteins). The complex that is required to initiate trans
cription by an RNA polymerase is known as the basal transcription apparatus
and consists of the polymerase plus aJJ of its associated general transcription
factors.
In addition to the general transcription factors required by RNA polymerase
II, s pecific recognition elements are recognized by tissue-restricted transcription
factors. For example, an enhancer is a cluster of cis-acting short sequence ele
ments that can enhance the transcriptional activity of a specific eukaryotic gene.
Unlike a promoter, which has a relatively constant position with regard to the
transcriptional initiation site, enhancers are located at variable (often consider
able) distances from their transcriptional start sites. Furthermore, their function
is independent of their orientation. Enhancers do, however, also bind gene regu
latory proteins. The DNA between the promoter and enhancer sites loops out,
which brings the two different DNA sequences together and allows the proteins
bound to the enhancer to interact with the u'anscription factors bound to the
promoter, or with the RNA polymerase.
A silencer has similar properties to an enhancer but it inhibits, rather than
stimulates, the transcriptional activity of a specific gene.
5S rRNA gene
1 .",0",;P.;00 of geo.
IB)
"
,
gu ----- - ag
E2 ,
91J -- ,g
,El Flgure1.16 The process of RNA .splicing.
(A) In this example, the gene contains three
exons and twO introns. (S) The primary RNA
~ cleave p,;ma<y RNA "'O''';PIa' ,
and discard intronic sequences 1 and 2
transcript is a conti nuous RNA copy of the
gene and contains sequences transcribed
(C) from exons (El, E2, and E3) and introns.
E> El
" (C) The primary transcript is cleaved at
lOS to > 10,000 nucleotides < 20 nucleotides Figure 1.17 Three consensus DNA
sequences in introns of complex
splice branch splice
exon
T eTA C
·cNr CGAf" · ····f¥f~¥¥t¥¥¥¥N¥~G ~
exon
II correspond to three functionally important
regions. Tw o ofthe region s, the splice donor
site and the splice acceptor site, span the 5'
and 3' boundaries o f the intron. Th e branch
site is an additional important region that
Although the conserved GT and AG dinucleotides are crucial for splicing, they typically occ urs less than 20 nucleotides
are not sufficient to mark the limits of an intron. The nucleotide sequences that upstream of the splice acceptor site. The
are immediately adjacent to them are also quite highly conserved, constiruting nudeotides shown in red in these three
splice junction consensus sequences (Hgure 1.17). A third conserved intronic consensus sequences are almost invarian t.
sequence th at is also important in splicing is known as the branch site and is typi The other nucleo tides detailed in both
the in tron and the exons are those most
cally located no more than 40 nucleotides upstream of the intron's 5' terminal AG
commonly fou nd at each position. In some
(see Figure 1.17). Other exonic and intronic sequences can promote splicing
in stances, two nudeotides may be equally
(splice enhancer sequences) or inhibit it (splice silencer sequences), and muta co mmon, as in the case of C and T near the
tions in these sequences can cause disease. 3' end of the intron . Where N app ears, any of
The essential steps in splicing are as follows: the four nucJeotides may occur.
• Nucleophilic attack ofthe intron's 5' terminal G nucleotide by the invariant A
of the branch site consensus sequ ence, to form a lariat-shaped strucrure.
Cleavage of the exonlintron junction at the splice donor si teo
Nucleophilic attack by the 3' end of the upstream exon of the splice accep tor
site, leading to cleavage and release of the intronic RNA in the form of a lariat,
aud the splicing together of the two exonic RNA segments (Figure 1.18).
El A E2
of Ul and U2 and has variants ofU4 and U6 snRNA.
Once a splice donor site is recognized by the spliceosome, it scans the RNA 5'--"Z '"'------ ~ O H
I
3'
sequence until it meets the next splice acceptor site (signaled as a targe t by the
upstream presence of the branch site consensus sequence).
nucleophilic attack by A
In branch ~ ite at the 5'
Specialized nucleotides are added to the ends of most RNA termina l G of innoni( RN/I
polymerase II transcripts
(B) spike
1n addition to RNA splicing, the ends of RNA polymerase II transcripts undergo acceptor
m odifications: the 5' end is capped by adding a variant guanine by using an U
site
E1 ,..., E2
unusual phosphodiester bond, and a long sequence of adenines is added to the G
3' end. As well as protecting the ends from cellular exonucleases, these rnodifica 5'_ " A AG !D!lii3I 3'
l
U1 snRNP binds to
Interaction between the splice dono r and splice acce ptor sites is stabilized splice donor site. and
by (el the binding of a multi-snRNP particle that contains the U4, US, and U6 U2 snRNP binds to
snRNAs. The US snRNP binds simultaneously to both the splice donor and splice brancn site
acceptor sites. Their cleavage releases the intronic sequence and allows (D) El (B) splice
and E2 t o be spli ced together. acceptor site
" E2
A AG 3'
5' capping
Shortly after the initiation of synthesis of primary RNA transcripts that will
become mRNA, a methylated nucleoside (7-methylguanosine, m7G) is linJ<ed by
binding of U4/U5/ U6
snRNP complex;
a 5'-5' phospho diester bond to the first 5' nucleotide. This is described as the U5 snRNP binds to
capping of the 5' end ofthe transcrip t (Plgure 1_20); the caps of snRNA gene tran donor and acceptor sites
scripts may undergo additional modification. The 5' cap may have several
functions: (C )
U4 U6
molecules are rapidly degraded);
US
to facilitate transport of mRNAs hom the nucleus to the cytoplasm;
i
cleavage of splice
donor and acceptor
3' polyadenylation Intronic sequence sites, and splicing
of E1 to E2
Transcrip tion by both RNA polymerase I and III stops after the enzyme recog
nizes a specific transcription termination site, However, the 3 ends of mRNA 1
(D)
m olecules are determined by a post-transcriptional cleavage reaction. The El E2
sequence MUAAA (someti mes AUUAAA) signals the 3' cleavage fo r most 5' 3'
polymerase II transcripts.
Cleavage occurs at a sp ecific site 15-30 nucleotides downstream of the
MUAAA sequence, although the primary transcript may continue for hundreds
Or even thousands of nucleotides past the cleavage point. After cleavage has OH OH
occurred, the enzyme poly(A) polymerase sequentially adds adenylate (AMP)
residues to the 3' end (about 200 in the case of mammalian mRNA). This polya
denylation reaction (FIgure 1.2 I) produces a poly(A) tail that is thought to:
• Help transport mRNA to the cytoplasm.
Stabilize at least some mRNA molecules in the cytoplasm.
Enhance recognition of mRNA by the ribosomal machinery. .!. 5'-5'
Histone genes are unique in producing mRNA th at does not become poly T P
triphosphate
linkage
adenylated; termination of their transcription nevertheless also involves 3' cleav
age of the primary transcript.
I
p
I
o
Figur. 1.20 The 5' cap of a eukaryotic mRNA. The nucleotide compo nents 5. 1
in pink represent the reSidue of the original 5' end of a eukaryottc pre-mRNA.
The primary pte-mRNA tran script begins with a nucleotide that contains a
4,VO~Ul'
purine (Pu) base and a 5' triphosphate group. However, as the pre-mRNA IM',I
undergoes processing, the end phosphate gro up at the 5' end is excised with o 0
II I
a phosp hatase to leave a 5' di phosphate group, and a specialized nucleotide is
cova lently joined to form a cap that will protect mRNA from exonuclease attack CH,
and assist in the initiation of tran slation. The cap nucleotide (with base shown
in red) is first formed when a GTP residue is Cleaved to generate a guanosine
o
mono phosphate that is then added through a 5'-5' triphosphate linkage (pale
I
peach shadi ng) to the diphosphate group of the original purine end nucleotide.
Subsequently nitrogen atom 7 of the new 5' terminal G is methylated. In mRNAs 4,V~~1'
syntheSized in ve rtebrate cells, the 2' carbon atom of the ribose of each of the I~,I
two adjacent nucleotides, the o rigina l purine end-nucleotid e and its neighbor, o 0
are also methylated, as illustrated in thiS example. m7G, 7-methylguan osine; , I
N, any nucleotide. CH,
RNA PROCESSING 19
,
tcans"'plion by RNA polyme'as. " comple)(es required for polyadeny lation:
CPSF (cleavage and polyadenylation
(8) specificity factor) and CStF (cleavage and
7
5' m Gppp AAUAAA 3' sti mulation factor) that cooperate to identify
a polyadenylation signal downstream of the
15-30
termination codon in the RNA transcript and
nucleOlides
13' po",denylation
(D) AMP residues are subsequently added by
poly(A) polymerase to form a poly(A) tail.
(D)
1
5' m Gppp AA UAAA AAAAAAA -- - · AAA -OH 3'
spacer spacer
r-"--. nn
-
lSS ~
5.85 285
\\
, 1
Figure 1. 2 2 The major rRNA species are
"anwipHo" of DNA synthesized by cleavage of a shared
(8) primary transcript. (A) In human cells, the
_~ 3 '
18S, 5.85, and 285 rRN As are encoded by a
I 455 rRN A
"
1 ""vag. of 455
single tran scription unit that is 13 kb long.
It occurs w ithin tandem repeat units of
1
5' I = 3' 415 rRNA
produces a 13 kb primary transcript (455
rRNA) that then undergoes a complex series
cleavage of 415
of post-transcri ptional cleavages.
, ,
int~m edia te
(C- E) Ultimately, individual 18S, 28S, and 5.85
(D) rRNA mOlecules are relea sed. The 185 rRNA
5' _ _
1 3' 20S rRNA will form part of the small ribosomal subunit.
325 rRN A
The 5.85 rRN A binds to a complementary
In addition to the sequence of cleavage reactions (see Figure 1.22), tile pri
mary rRNA transcript also undergoes a variety of base-specific modifications.
This extensive RNA processing is undertaken by many different small nucleolar
RNAs that are encoded by abo ut 200 different genes in the human genome.
Mature tRNAmolecules also undergo extensive base modifications, and about
10% of the bases in any tRNA are modified versions of A, C, G, or U. Common
examples of modified nucleosides include dihydrouridine, which has extra
hydrogens at carbons 5 and 6; pseudouridine, an isomer of uridine; inosine
(deami nated guanosine); and N,N'-dimethylguanosine.
(C)
1
lranslatloo
of exon 2. (D) During post-translational
modification the 147·amino acid precursor
polypeptide underg oes cl eavage to remove
N ·Ai •............... ..A,rg, IPJO.. .. Fh~.l c
ils N-terminal methionin e residue, to
generate the mature' 46-residue j}-glo bin
~ j POst:tran~lationa!
I
protein. The flanking Nand C symbolS to
~J, mOdlficatlon the left and rig ht. respect ively, in «() and
(0) (0) depict th e N-terminu-s IN) and
Nl Va!: .. · .. " :'&rig. . :· .... ·•· .... ..Arg Pro· .. His l e ( -terminus (C).
TRANSLATION, POST-TRANSLATIONAL PROCESSING, AND PROTEIN STRUCTURE 21
~~r ...
UAC C<C
9 ' ••
".
site. The sma ll ribosomal subunit (405 in
eukaryotes) binds mRNA, which is scanned
5' AUG GGG :r 5' AUG GGG Th\C along its 5' UTR in a 5' ~3' direction until
mRNA
/ small ribosomal unit
the Start codon is identified, an AUG located
w ithin a larger con sensus sequence (see
the text) . An initiator tRNAMet carrying a
methionine residue binds to the P site with
IC) 10)
its anticodon in regi ster with the AUG start
codon. (8) The appropriate ami noacyl tRN A
Met- Gly Met : -r G1.l Tyr
I is bound to the A site wit h its anticodon
I I
A A A A base-pairing w ith the next codon {GGG in
C C C C
this case, specifying glycine). (C} The rRN A
( C
~ C C
in the large subunit catalyzes peptide bond
~-4c forma tion, resulting in the methionine
y-~! ffi CE(
:::
::: UAC
AUG detaching from its tRN A and being bound
instea d to the glycine attached to the
5' AUG GaG OA-C 5' AU_G GGG
tRNA held at the A site. (D) The ribosome
trans locates along the mRNA so that the
ribosome moves
along by one codo n tRNA bearing the Met-Gly dipeptide is
bound by the P site. The next aminoa cyl
IE)
tRNA (here, carrying Tyr) binds to the A site in
P site A site P site A site
preparation for new peptide bond formation.
(H, CH, (El Peptide bon d formation . The N atom of
I I the amino group of the amino acid bound to
s 5 the tRN A in the A site makes a nu cleophilic
I (H,
I attack on the carboxyl C atom of the amino
CH,
acid held by the tRN A b ound to the P site.
I I
CHl .....------...._ H CH 2 0 H
I 0< " I I II I
H.l N -CH-C= O H2N - CH- C= O H2 N-CH - C - NH - 'CH-C =O
I I OH OH OH 0
I
OH 0 OH 0
,JOt, ad.oCt, I
o
I
0
--t JOt'~deCt, I
o
I
0
I I _ I
O=p-O-
I
O=p-O
O= p- ~ O=p-O
I I I I
o 0 o 0
I I
tRN~Mt\ tRNiAGI~
! !
tRN~IM' tRN'AGfy
Eac h tRNA has its own anticodon, a trinucleotide at the center of the anti
codon arm (see Figure 1.9B) that provides the necessary specificity to interpret
the genetic code. For an amino acid to be added to a growing p olypeptide, the
relevant codon of the mRNA molecule must be recognized by base pairing with a
complementary anticodon on the appropriate amino acyl !RNA molecule. This
hap pens on the ribosome. The small ribosomal subuni t binds the mRNA, and the
large subunit has two sites for binding aminoacyl tRNAs, namely a P (peptidyl)
site and an A (aminoacyl) site (Plgure J _24).
The cap at the 5' end of messenger RNA mol ecules is important in initiating
translati on. It is recognized by certain key proteins that bind the small ribosomal
subunit, and these initiation fa ctors hold the mRNA in place. In cap-dependent
translation initia tio n, the ribosom e scans the 5' UTR of the mRNA in the 5'--.3'
direction to find a suitable initiation codon, an AUG that is found within the
Kozak consensus sequence 5' -GCCPuCCAUGG-3' (where Pu represents purine).
The most important determinants are the G at position +4 (immediately follow
ing the AUG codon), and the purine (preferably A) at -3 (three nucleotides
upstrea m ofthe AUG codon) .
22 Chapter 1: Nucleic Add Structure and Gene Expression
~ l lYS CM
CAG I Gin
GM I G!u
GAG UM I STOP
UAG
amino acid specified by each, as read in the
5' ~3 ' direction from the mRNA sequence.
MCI Asn
MU
CAC
CAU
I His GAC
GAU
I Asp UAC
UAU
I Tyr The interpretations of the 64 codons in the
'universal' genetic code are shown in black
immediately to the right of the codons.
ACA Sixty-one codon sspecify an ami no acid.
ACG I CCA
CCG I GCA I Ala
GCG
UCA I
UCG Ser
Ac e Thr CCC Pro GCC UCC Three STOP codons (UAA, UAG, and UGAI do
ACU CCU GCU UCU not encode any amino acid. The genetic code
for mitochondrial mRNA (mtDNA) conforms
STOP I AGA
AGG
IA<g CGAI Arg
CGG GGA I Gly
GGG
UGA I STOP Trp
UGG ' Trp
to the universal code except for a few
variants. For example, in the mitochondrial
AGC I CGC GGC UGe genetic code in humans and many other
AGU Ser CGU GGU UGU Cys
species four codons are used differently:
Me< AUA . Ue
AUG I Met
AUC I lie
CUA I
CUG l eu
cue
GUA I
GUG
GUC Val
UUA I
UUG Leu
I
UUC Phe
UGA encodes tryptophan instead of being a
STOP codon, AUA encodes methionine, and
instead of encoding arginine, AGA and AGG
AUU CU U GUU UUU are STOP codons.
TRANSLATION, POST-TRANSLATIONAL PROCESSING, AND PROTEIN STRUCTURE 23
Although more than 60 codons can specify an amino acid, Ihe number of dif
TABLE 1.6 RULES FOR BASE
ferenl cytoplasmic tRNA molecules is quite a bit smaller, and only 22 types of
PAIRING CAN BE RELAXED
mitochondrial tRNA are made. The interpretation of more than 60 sense codons
(WOBBLE) AT POSITION 3
with a much smaller number of different tR NAs is possible because base pairing
OF A CODON
in RNA is more flexible than in DNA. Pairing of codon and anticodon follows the
normalA- U and G--C rules for the first two base positions in a co don. However, at Base at 5' end of Base recognized
the third position there is some flexibility (base wobble), and GV base pairs are tRN A anticodon at 3' end of mRNA
tolerated here (Table 1.6). codon
The gene tic code is the same throughout nearly all life forms. However, mito A U only
chondria and chloroplasts have a limited capacity for pro tein synthesis, and dur
ing evolution their genetic codes have diverged slightly from that used at.cyto C G only
plasmic ribosomes. Translation of nuclear-encoded mRNA continues until one
G (or!)" C orU
of three stop codons is encountered (UM, UAG, or UGA) but in mammalian
mitochondria there are four possibilities (UM, UAG, AGA, and AGG). U AorG
The meaning of a codon can also be dependent upon the sequence context;
tI,at is, the nature of the nucleotide sequence in which it is embedded. Depending "Inosine (I) is a deam inated form of
guanosine.
on the surro unding sequence, some codons in a few types of nuclear-enco ded
mRNA can be interpreted differently from normal. For example, in a wide variety
of cells the stop codon UGA is alternatively interpreted as encoding seleno
cysteine with some nuclear-encoded mRNAs, and VAG can sometinles be inter
preted to encode glutamine.
--~~~--~----------~-------4
Target amino acid,s)
Notes
Methylatio n (CH))
Lys achieved by methylases; reversed by demethylases
Carboxylation ((OO H)
Glu achieved by y-carboxylase
T"T"hls is especially common wh en Asn is in the sequ ence : Asn-X-{Serffhr), where X is any amino acid o ther than Pro. bHyd roxylysine. CAt ( -terminus
polypeptide. dAt N-terminus o f poly peptide. eroform S-pa lmitoyllink.
24 Chapter 1: Nucleic Acid Structure and Gene Expression
attached carbohydrate); if they are, they have a single sugar residue, N-acetyl
glucosamine, attached to a serine or threonine residue. However, proteins that
are secreted from cells or transported to Iysosomes, the Golgi apparatus, or the
plasma membrane are routinely glycosylated. In these cases, the sugars are
assembled as oligosaccharides before being attached to the protein.
Two major types of glycosylation occur. Carbohydrate N-glycosylation
in volves attaching a carbohydrate group to the nitrogen atom of an asparagine
side chain, and O-glycosylation entails adding a carbohydrate to the oxygen ato m
of an OH group carried by the side chains of certain amino acids (see Table 1. 7).
Proteoglycans are proteins with attached glycosaminoglycans (polysaccha
rides) that usually include repeating disaccharide units containing glucosamine
or galactosamine. The best-characte rized proteoglycans are components of the
e \uacellular matrix, a complex network of macromolecules secreted by, and sur
rO Wlding, cells in tissues or in culture systems.
Addition of lipid groups
Some proteins, notably membrane proteins, are modified by the addition offatty
acyl or prenyl groups. These added groups typically serve as membrane anchors,
hydrophobic amino acid sequences that secure a newly synthesized protein
within either a plasma membrane or the endoplasmic reticulum (Table 1.8).
Anchoring a protein to the outer layer of the plasma membrane involves the
attachment of a glycosylphosphatidylinositol (GPl) group. This glycolipid group
contains a fatty acyl group tha t Serves as the membrane an chor; it is linked suc
cessively to a glycerophosphate unit, an oligosaccharide unit, and finally
tluough a phospho ethanolamine unit-to the C-terminus of the protein. The
entire protein, except the GPI anchor, is located in the extracellula r space.
Post-translational cleavage
The primary translation product may also undergo internal cleavage to generate
a smaller mature product. Occasionally the initiating methionine is cleaved from
Primary the linear sequence of amino acids in a can vary en ormously in length from a few to
polypeptide thousands of amino acids
Secondary the path that a polypeptide backbone follows varies along the length of the polypeptide; com mon
within local regions of the prima ry structu re elements of secondary structure include the a·helix
and p-plea ted sheet
the overall three-dimensional structure of a ca n take various forms (e.g. globular, rod-like. tube.
polype ptide, arising from the combin ation of all coil, sheet)
Qu:;ternary the aggregate structure of a multimeric protein can be stabilized by disulfide bridges between
(comprising more than one subunit, which may subunits or ligand bi nding, and other factors
be of more than one type)
TRANSLATION, POST-TRANSLATIONAL PROCESSING, AND PROTEIN STRUCTURE 25
the primary translation product, as during the synthesis of Il-globin (see Figure
1.23C, 0). More substantial polypeptide cleavage is observed during the matura
tion of many proteins, including plasma proteins, polypeptide hormones,
neuropeptides, and growth factors. Cleavable signal sequences are often used to
mark proteins either for export or for transport to a specific intracellular location.
A single mRNA molecule can sometimes specify more than one functional
polypeptide chain as a result of post-translational cleavage of a large precllisor
polypeptide (Figure 1.26).
(AI
1
mRNA m Gppp AAAAM.-- shown in deep blue. It is confined to the 3'
L--J sequence of exon 2 and the 5' sequence of
5' UTR "anSiation ~'
UTR exon 3. (8) Exon 1 and the 5' part of exon 2
(C) speCify the 5' untransJated regi on (5' UTR),
I1 24' 25 . 110 i
preproinsulin N IMel Ala lPhe Pile _ ASill c and th e 3' end of exon 3 specifies the
3' UTR. The UTRs are transcribed and so
are present at the ends of the mRNA. (e) A
post-trans,ationa' l
1 24 cleavage
primar y translation produce. p rep roi nsulin,
leader sequence [Met Ala I
has 110 residues and is cleaved to give
(01
(D) a 24·residue N-terminal/eader sequence
1
Plie (that is required for the protein to cross the
c'ea"ge of ,coinsulin
1
1 35
(A) (8)
1 Flgu,.1.2.7 The structure of a standard
Phe- Met Leu a-helix and an amphipathk a-helix_
OArg 'Ala
(A) The structure of an ((-helix is stabilized
4 Ala . by hydrogen bonding between the oxygen
OArg Gly of the carbonyl group «(=0) of each peptide
H
Ser Leu 2 bond and the hydrogen on the peptide bond
amide group (NH) of the fourth amino acid
O Alg '\O~4/.:IT
OAre :-.. Ala away, making the helix have 3.6 amino acid
H O Arg Pro Leu residues per turn. The side chains of each
3 amino acid are located on the outside of the
1 C ~- N
: Q \
helix; there is almost no free space w ithin
the helix. Note that only the backbone of the
H
, "'
C-N-C /5 polypeptide is shown, and some bonds have
6C1 been om itted for clarity. (B) An amphipathic
N- 0"
:
8 amino acids and hydrophobic amino acids
\ H C M located on different surfaces. Here we show
- N " \.
C-N an end view of such a helix: five positively
o \ charged arginine residues are clustered on
H C9
one side of the helix, whereas the opposing
H \ I
, 10 /C - N-~
side has a series of hydrophobic amino
N- C 0
acids (mostly Ala, Leu, and Gly). The lines
I '"o
w ithin the circle indicate neighboring
residues-the initiator methionine (position
1) is connected to a leucine (2), w hich
is connected to an arginine (3), w hich Is
adjacent to an alanine (4), and so on.
in proteins that perform key cellular functions (such as transcription factors,
where they are usually represented in the DNA-binding domains). Identical
a-helices with a repeating arrangement of nonpolar side chains can coil round
each other to form a particularly stable coiled coil. Coiled coils occur in many
fibrous proteins, such as collagen of the extracellular matrix, the muscle protein
tropomyosin, a-keratin in hair, and fibrinogen in blood clots.
The \3-pleated sheet
B-Pleated sheets are also stabilized by hydrogen bonding but, in this case, they
occur between opposed peptide bonds in parallel or anti-parallel segments of
the same polypeptide chain (Figure 1.28). B-Pleated sheets occur-often together
with a-helices-at the core of most globular proteins.
(A) (B)
".
o
'"
..0 0
"' .
..p o
" Fl,ure 1..28 The structure of a Ii-pleated
sheet. Hydrogen bonding occurs here
between [he carbonyl ((=0) oxygens
and amide (NH) hydrogens on adjacent
seg ments of (A) parallel and (8) anti-parallel
p-pleated sheets. (Adapted from Lehninger
AL, Nelson DL & (ox MM (1993) Principles of
Biochemistry, 2nd ed. With permission from
WH Freeman and Company.)
FURTHER READING 27
,
SH S~ Flgur. 1.29 lntrachain and interchain
disulfide bridges in human insulin.
GIVEQCCTS
, I C SLY Q lEN YeN
I Disulfide bridges (- 5-5-) form by a
sH SH
condensation reaction between the
sulfhydryl (-SH) groups on the side chains
SH sH of cysteine residues.They can form between
J I
F V N Q H LeG S H L V E A L Y L V C G E RG F F Y T P K T cySteine side chains within the same
C T S I
1~_______ lnte(Charn -_______
disulfide bonds
1
Sr S.
F V N Q H LeG S H L v E A L Y L V C G ERG F F YTPK T
The ~-turn
Hydrogen bonding can occur betvveen amino acids that are even nearer to each
other within a polypeptide. When this arises between the peptide bond CO group
of one amino acid residue and the peptide bond NH group of an amino acid resi
due three places farth er along, this results in a hairpin ~-turn. Abrupt changes in
the direction of a polypeptide enable compact globular shapes to be achieved.
These ~-turns can connect parallel or anti-parallel strands in ~-pleated sheets.
Higher-order structures
Many more complex structural motifs, consisting of combinations of the above
structural modules, form protein domains. Such domains are often crucial to a
protein's overall shape and stability and usually represent functi onal units
involved in binding other molecules. Another important determinant of the
StTucture (and function) of a protein is disulfide bridges. They can form between
the sulfur atoms of sulfhydryl (- SH) groups on two amino acids that may reside
on a single polypeptide chain or on two polypeptide chains (Figure 1.29).
In general, the primary structure of a protein determines the set of secondary
structures that, together, generate the protein's tertiary structure. Secondary
structural motifs can be predicted from an analysis of the primary structure, but
the overall tertiary structure cannot easily be accurately predicted. Finally, some
proteins form complex aggregates of polypeptide subunits, giving an arrange
ment known as the quaternary structure.
FURTHER READING
;\gfis PF, Vendeix FA & Gfaham WD (2007) tRNA's wobble Preiss T & Hentze MW (2003) Starting the protein sy nthesis
decoding of the genome: 40 yea rs of modification. J. Mol. BioI. machine: eukaryoti c tra nslation initiation. BioEssays 25,
366,1-13. 1201-1211.
C.lvo 0 & Manley JL (2003) Strange bedfellows: polyadenylation San der OM. Big Picture Book of Viruses. http://www.virology.net!
factors at the promoter. Genes Dev. 17, 1321-1327. Bi g_Vi rology/ BVHomePage.html
Canadine CR, Drew HR, Luisi B &Travers AA (2004) Understanding Wea r MA &Cooper JA (2004) Capping protein: new insights into
DNA. The Molecule And How It Works, 3fd od. Academic mechanism and regulation. Trends Biochem. Sci. 29,418-428.
press. Whitford D (2005) Protei n Structure And Function. John Wiley.
Crain PF, Rozenski J & McCloskey JA. RNA Modification Database.
http:!~ i brary.med.u tah.ed u/RNAmods
cedorova 0 & Zingler N (2007) Group II introns: struct ure, folding
and splicing mechanisms. BioI. Chem. 388, 665-678.
Ga rcia-Diaz M & Bebenek K (2007) Multiple functions of DNA
polymerases. (rit. Rev. Plant Sci. 26, 105-1 22.
Chapter 2
KEY CONCEPTS
Chromosomes have two funda men tal roles: the faithful transmission and appropriate exp ression of
genetic infor mation,
Prokaryotic chromosomes contain circular doub le-stra nded DNA mo lecules tha t are relatively pro tein
free, h ut eukaryotic chromosomes consist of linear double-st randed DNA m olecu les com plexed
lhrough out their lengths with pro teins.
Chromatin is the DNA-protein matrix of euka ryo tic ch romosomes. Th e comp lexed p roteins serve
structu raJ roles. including comp acting the DNA in different ways, and also regula tory roles,
Ch romosomes undergo major changes in the cell cycle, notably at S phase when they rep licate and at
M phase wh en the replicated ch romoso mes becom e separated an d allocated to two daughter cells.
DNA re plication at S phase produces two double-stranded daughter DNA molecules th at are
held togeth er at a specialized region, the centromere. VVhen the da ughter DNA molecules remain held
togethe r Iik.e this they are known as sister chromatids, but once they sepa rate at M phase they become
individ ual chromosomes.
At the metaphase stage of M phas e the chromosomes are so highly co ndensed that gene expression is
un iformly shut down. But this is th e optimal tim e for viewing them under the microscope. Staining
wi th dyes that bi nd preferen tially to GC -rich or AT-rich regions ca n give re producible chro mosome
ba ndi ng patterns that allow different chromosom es (Q be differentiated.
During inte rphase, the long pe ri od of the cell cycle that separates successive M phases, chromosomes
have general ly very long extended co nformations an d are invisib le under op tical mic rosco py. The
extended struc ture means that genes ca n be exp ressed effiCiently.
Even during interp hase some chromoso mal regions always rema in highly condensed and
transcriptionally inactive (hetero chromatin). whereas others are ex te nded to allow gene expression
(euch rom atin).
Sperm and egg cells have one co py of each chro mosome (they are haploid). but most cells are diplOid,
having two sets of chro mosomes.
Fert ilization of a ha plo id egg by a haploid sperm generates the d iploid zygo te fro m which all other
body ceUs arise by cell division.
In mitosis a cell divides to give two daughter cells, each with th e same num ber and typ es of
Meiosis is a specialized form of cell divis ion that occurs in certain cells of the testes and ovaries to
produce haploi d sperm and egg cells. DUlin g meiosis new genetiC combinations are randomly created.
partly by exchanging sequences between maternal an d paternal ch romosomes.
Three ~es of func tional elem ent are nee ded fo r eukaryo tic chromosomes (Q transmit DNA faithfully
from mo ther cell to dau ghter cells: the ce ntromere (ensures correct chromosome segregation at cell
division); replication ori gins (initiate DNA replication ); a nd telo mere s (cap the ch romosomes to stop
the internal DNA from being degraded by nucleases).
An abnormal number of chromosomes can sometimes occur but this is often leth al if present in m ost
cell s or the body.
Structural chromosome ab normalities arisin g from breaks in chromoso mes can cause genes to be
delete d or incorrectly expressed.
Having the co rrect number and structure of chromosomes is not enough. They m ust also h ave the
correct parental o rigin because certain genes are preferentially expressed on either paternally or
matern ally inherited chromosomes.
30 Chapter 2: Chromosome Structure and Function
M PJ1o~ siste r chromatids separate to give two Figure 2.1 Changes in chromosomes and
cNomosomes that are distribUted into two daughter cells DNA content during the cell cycle. The cell
cycle show n at the right includes a very short
I rJ!
Uti
chromosomes =2n
:,f-J----> (1
chromosomes =4n
)+ J
chromosomes == 2n
M phase, w hen the chromosomes become
extremely highly condensed in preparation
for nuclear and cell d ivision . Afterwards,
cells enter a long period of growth ca lled
~~ ~
interphase, during which chromosomes
are enormously extended so tha t genes
can be expressed. Interpha se is divided
late S phase~ two DNA double ~
into three phases: G1 , S (when the DNA
helices Per chromosome G2 M "\
palred <~"~~ ~
DNA is duplicated in 5 phase. After the DNA
double helix has been duplicated. the two
double helices ~V. /
resulting double helices are held together
chromosomes"" 2"
DNA"" 4C
tightly along their lengths (by specialized
protein complexes called cOhesins) until
i DNA replication M phase. As the chromosomes condense
at M phase they are now seen to con sist of
two sister chromatids, each containing a
early S phase: one ONA double helix per chromosome
._t
A small subset of diploid body cells constitute the germ line that gives rise to
gametes (sperm cells or egg cells). In humans, where n ~ 23, each gamete con
tains one sex chromosome plus 22 non-sex chromosomes (aulosomes). In eggs,
the sex chro moso me is always an X; in sperm il may be either an XOr a Y. After a
haploid sperm fertilizes a haploid egg, the resulting diploid zygote and a1mosl all
ofils descendant cells have the chromosome constitution 46,XX (female) or 46,XY
(male) (Figure 2.2).
1 A
production
Cells outside the germ line are somatic cells. Human somatic cells are usually
diploid but, as will be described later, there are notable exceptions. Some types of
non-dividing cell lack a nucleus and any chromosomes, and so are nulliploido
Other cell types have multiple chromosome sets; they are naturally polyploid as
G i i
46,XY
As an embryo develops through fetus, infant, and child to adult, many cell cycles
are needed to generate the required number of cells. Because many cells have a
limited life span, there is also a continuous requirement to generate new cells,
Figure 2.2 The human life cycle, from a chromosomal viewpoint. Haploid
egg and sperm cells originate from diploid precursors in the ovary and testis in
women and men, respectively. All eggs have a 23,X chromosome constitution,
1 cell growth. division,
and development
1
representing 22 autosomes plus a single X sex chromosome. A sperm can carry
t
either sex chromosome, so that the chromosome constitution is 23,X (SO%) and
23,Y (50%). After fertilization and fu sion of the egg and sperm nuclei, the diploid
zygote w ill have a chromosome constitution of either 46,XX or 46,XY, depending
on which sex chromosome the fertilizing sperm carried. After many cell cycles,
this zygote gives rise to all cells of the adult body, almost all of whi ch will have
the same chromosome complement as the zygote from which they originated. 46,XX 46,XY
32 Chapter 2: Chromosome Structure and Function
nucleolus -8* u + u
~~
~3 phase
'1" 1 c::ce:::::=J
---- -~ ------~
cytokine sis
----
~1 ~
-~ ------- ~ ----
i
Figure 2.3 Mitosis and cytokinesis.
(A) Mitotic stages and cytokinesis. Late in
interphase, the duplicated chromosomes
are still dispersed in the nucleus, and the
nucleolus is distinct Early in prophase, the
first stage of mitosis, the centrioles (which
were previously duplicated in interphase)
begin to separate and migrate to opposite
poles of the cell, where they will form
the spindle poles. In prometaphase, the
nuclear envelope breaks down, and the
now highly condensed chromosomes
become attached at their centromeres to
the array of microtubules e)(tending from
the mitotic spindle. At metaphase, the
chromosomes all lie along the middle of
anaphase the mitotic spindle, At anaphase, the sister
metaphase chromatids separate and begin to migrate
toward opposite poles of the cell, as a result
even in an adult organism. All of these cell divisions occur by mitosis, which is of both shortening of the microtubules and
the normal process of cell division throughout the human life cycle. Mitosis further separation of the spindle poles. The
ensures that a single cell gives rise to two daughter cells that are each genetically nuclear envelope forms again around the
identical to the parent cell, barring any errors that might have occurred durin daughter nuclei during telophase and the
DNA replication. During a human lifetime, there may be something like 10 1 9 chromosomes decondense, completing
mitotic divisions. mitosis. Constriction of the cell begins.
The M phase of the cell cycle includes various stages of nuclear division During cytokinesis, filaments beneath the
plasma membrane constrict the cytoplasm,
(prophase, prometaphase, metaphase, anaphase, and telophase), and also cell
ultimately producing two daughter cells.
division (cytokinesis)' which overlaps the final stages of mitosis (Figure 2.3). In
(B) Metaphase-anaphase transition.
preparation for cell division, the previously highly extended duplicated chromo Metaphase chromosomes aligned along
somes contract and condense so that, by the metaphase stage of mitosis, they are the equatorial plane (dashed line) have
readily visible when viewed under the microscope. their sister chromatids held tightly
The chromosomes of early S phase have one DNA double helix, but after DNA together (by cohesin protein complexes)
replication two identical DNA double helices are produced (see Figure 2.1), and at the centromeres. The transition to
they are held together along their lengths by multisubunit protein complexes anaphase is marked by disruption of the
called cohesins. Recent data suggest that individual co he sin subunits are linked cohesin complexes, thereby releasing the
sister chromatids to form independent
together to form a large protein ring. Some models suggest that multiple cohesin
chromosomes with their own centromeres,
rings encircle the two double helices to entrap them along their lengths, or which are then pulled by the microtubules
cohesin rings form round the individual double helices and then interact to of the spindle in the direction of opposing
ensure that the two double helices are held tightly together. poles (arrows).
La1er, vvhen the chromosomes have undergone compaction in preparation
for cell di\~sion, the cohesins are removed from all parts of the chromosomes
aparr from the centromeres. As a result, by prometaphase when the chromo
somes can now be viewed under the light microscope, individual chromosomes
can be seen to comprise two sister chromatids that are attached at the centro
mere by the residual cohesin complexes that continue to bind the two DNA heli
ces at this position.
Later still, at the start of anaphase, the residual cohesin complexes holding the
sister chromatids together at the centromere are removed. The two sister chro
matids can now disengage to become independent chromosomes that will be
pulled to opposite poles of the cell and then distributed equally to the daughter
MITOSIS AND MEIOSIS 33
cells (see Figure 2.3). Inte raction between the mitotic spinelle and the centromere
is crucial to this process and we will consider this in detail in Section 2.3.
iJ'I,'ary
1~ !
testis
(A) Diploid primordial germ ce lls migrate
to the embryon ic gonad (the female
CJ.\t(. . . \,.""-7 0
.t'.t. .t.\t
'"'...-,8) I'; 1
r ) .,/ "'"
(8)
females). (6) Th ese undergo further mitotic
divisions, growth, and differentiation to
1
XJ produce diploid primary spermatocytes
.t'.t.
o ~ (!).t. 00()
.t.
and diploid primary oocytes, which ca n
enter meiosis. (C) Meiosis I. After ONA
duplications, the cells become tetraplOid
but then divid e to produce two diploid cells.
.t. J. In male gametogenesis. the cell division is
symmetrica l, generating identical diplOid
p" ma',
oocyte 0"'.'-'\
....
'. ~.:
;;:'
I
(C)
l ) .,'spermalocyle
~
m", second ary spermatocytes. ln female meiosis
I, by contrast, the division is asym metric;
[~ \,
I
./ \, the secondary oocyte is much large r than
~
I ~,. \
~
pola,
bodies 1
®J C.~. . 0J ®
~ without prior DNA synthesis to give haploid
cell products. In male gametogenesis, this
i;j I spermatld s
division is agai n symmetrical. producing two
IE)
.t. .t. .t. .t. haplOid spermatids from each secondary
~ ~ ~ ('1.,
mature spermatocyte. In female meiosis II , the egg
egg produced is much larger than the second
=~J
mature spermatozoa (also discarded) polar body. (E) Maturation of
spermatids produces fou r spermatozoa.
34 Chapter 2: Chromosom e Structure and Function
1 meiosis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X sperm 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X sperm 5
The second division of meiosis is identical to mitosis, but the first division has
,
important differences. Its purpose is to generate genetic diversity between the
daughter cells. This is done by two mechanisms: independent assortment of
paternal and maternal homologs, and recombination. IA)
Independent assortment
Every diploid cell contains two chromosome sets and therefore has two copies
(homologs) of each chromosome (except in the special case of the Xand Y chro
mosomes in males). One homolog is paternally inherited and the other is mater
nally inherited. During meiosis I the maternal and paternal homologs of each
pair of replicated chromosomes undergo synapsis by pairing together to form a
bivalent. After DNA replication, the homologous chromosomes each comprise
two sister chromatids, so each bivalent is a four-stranded structure at the m eta
)0
! pairin g
phase plate. Spindle fibers then pull one complete chromosome (two chroma 19)
tids) to either pole. In humans, for each of the 23 homologous pairs, the choice of
which daughter cell each homolog enters is independent. This allows 223 or about
8.4 x lOB different possible combinations of parental chromosomes in the gam
etes that might arise from a single meiotic division (FIgure 2.5).
Recombination
The five stages of prophase of meiosis 1 (Flgw-e 2.6) begin during fetal life and, in !
»
crossing over
human females, can last for decades. During this extended process, the homologs
within each bivalent normally exchange segments of DNA at randomly posi (C)
tioned but matching locations. At the zygotene stage (Figure 2.6B), a proteina
I
ceous synaptonemal complex form s between closely apposed homologous
chromosomes. Completion of the synaptonemal complex marks the start of the
pachyrene stage (Figure 2.6C). during which recombination (crossover) occurs.
Crossover involves physical breakage of the DNA in one paternal and one mater
nal chromatid, and the subsequen t joining of maternal and paternal fragments.
The mechanism that allows alignment of the homologs is not known (see
Figure 2.6A, B), although such close apposition is required for recombination.
Located at intervals on the synaptonemal complex are very large multiprotein
! partial separation
Fi gure 2 .7 Metaphase I to production of gametes. (A) At metaphase I, the bivalents (A) :-- metaphase plate
align on the metaphase plate, at the cen ter of the spindle apparatus. Contraction of
spindle fibers draws the chromosomes in the direction of the spindle poles (a rrows).
tBl The transition to anapha se Iocc urs at the co nsequent rupture of t he chias mata.
(C) Cytokinesis segregates the two chrom osome sets, each to a di fferent primary
spermatocyte. Note that, as shown in this pa nel, after recombination during prophase
I the chromatids share a single centromere but are no longer identical. (0 ) Meiosis II
in eac h primary spermatocyte, w hich does not include DNA replication, generates
unique genetic combinations in the haploid seconda ry spermatocytes. Only 2 o f the
possible 23 different human chromosomes are depicted, for clarity, so only i2 (Le. 4) of
the possible 2 23 (8,388,608) possible combina tions are illustrated. Although oogenesis
can produce only one function al haploid ga mete per meiotic division (see Figure 2.4),
the processes by which genetic diversity arises are the same as in spe rmatoge nesis.
Location all tissues specialized germ line cells in testi s and ovary
DNA replication and cell division normally one round of only one round of replication per two cell divisions
rep lication per cell division
Duration of prophase short (.... 30 min in human cells) can take decades to complete
Recombination rare and abnormal during each meiosis; normally occurs at least once in each
chromosome arm after pairing of maternal and paternal homologs
Relationship between daughter cells genetically identical genetically different as a result of independent assortmenr of
homologs and recombination
supercoiling to form chromatin. Even in the interphase nucleus, when the DNA
is in a very highly extended form, the 2 nm thick DNA double helix undergoes at
least two levels of coiling that are directed by binding of basic histone proteins.
First, a 10 nrn thick fila ment is formed that is then coiled into a 30 n m thick chro
matin fiber. The chromatin fiber undergoes looping and is supported by a scaf
fold of nonhistone p ro teins (Figure 2,8A).
The nucleosome is the fundamental unit of DNA packaging: a stretch of
147 bp of double-stranded DNA is coiled in just less than two turns aro und a
central core of eight histone proteins (two molecules each of the core histones
H2A, H2B, H3, and H4, all highly conserved proteins) (Figure 2.SB). Adjacent
nucleosomes are connected by a short length (8-114 bp) oflinker DNA; the length
of the linker DNA varies both be tween species and between regions of the
genome. Electron micrographs of suitable preparations show nucleosomes to
have a string ofbeads appearance (Figure 2.Se). This first level of DNA packaging
is the only one that still allows transcrip tional activity.
The N- terminal tails of the core histones protrude from the nucleosomes (see
Figure 2.8B). Specific amino acids in the histone tails can undergo various types
of post-transla tional modification, notably acetylation, phosphorylation, and
methylation. As a resui t, different proteins can be bou nd to the chromatin in a
way that affec ts chromatin conde nsation and the local level of transcrip tional
activity. Additional histone genes encode variant for ms of the core histones that
(A)
20m
ONA G
uncoiling to enable
high-level gene
e)(ptesslon " - , .
fOfmat ion of
looped domain
10nm
histOl'le core
30 ()m
chromatin
fiber
I? "
(e) (D»
"'Iv
" -,/ ) ~N
N
Figure 2.8 From DNA double helix to interphase chromatin. (Al Binding of basic histon e proteins. The 2 nm thic k DNA double helix bind s basic his tones to
.!.ndergo (oiling, fo rming a 10 nm thick nuc!eosome filament that is further coiled into the 30 nm chromatin fib er. In interpha se (he chromatin fiber is organized
;m o looped domains with -50-200 kb of DNA attached to a scaffold of nonhistone acidic protein s. High levels of ge ne expression require loca l uncoiling of the
ch rom atin fiber to give the 10 nm nucleosom al fil aments. (B) A nucleosome consists of almost two turns of DNA wrapped round an octamer of core histones
:',1/0 each of H2, H3A, H3B, and H4). Note the extensive a.-helical structure of histones and their protruding N-terminal tails. (C) Electron micrograp h of 10 nm
"'tl deosomal filaments. (D) Cross section view of the 30 nm chromatin fi ber showing one turn of the solenoid [the outer red box corre sponds to that shown in
..All. The additional Hl histone, w hich b inds to linker DNA, is importa nt in organiZing the structure of (h e 30 nm chroma tin fiber. !(A) adapted from Grunstein M
1992) Sci. Am. 267, 68. With permi ssion fro m Sci entific American Inc., and Alberts B, Johnson A, Lewis J et 031. (20OS) Molecular Bio logy o f the Cell, 5th ed. Garland
5de ncerraylor & Francis lLC. (B) adapted from Alberts B, Johnso n A, lew is J et 031. (2008) Molecular Biology of th e Cell, 5th ed. Garland SCien celTaylor & Franci s
~lC from fi gures by Jakob Waterborg, University of Missouri-Kansas City. (C) courtesy of Jakob Waterborg, University of Missouri-Kansas Cit y. (D) adapted
'::om Klug A (1985) Proceedings, RWWeich Fede ration Conference, Chem. Res. 39, 133. With permi ssion from The Welch Foundation.)
38 Chapter 2: Chromosome Structure and Function
The mitotic spindle is formed from microtubules (polymers of a complex until the beg inning of M phase. At that point the
heterodimer of a -t ubulin and p-tubulin) and m icrotubule-associated centrosome complex splits in two and the two halves begin to
proteins. At each of the two spindle poles in a dividing cell is a separate. Each daughter centrosome deve lops its own array of
centrosome that seeds the outward growth of microtubule fibers microtubules and begins to migrate to one end of the cell. where it wi ll
and is the major microtubule-organizing center of the cell. Because form a spind le pole (Figure 1C).
their constituent tubullns are synthesized in a particular direction, the Three different forms of microtubule fiber occu r in the fully formed
microtubule fibers are polar, with a minus H end (the one next to the mitotic spindle:
centromere) and a plus (+) end (the distal growing end). Polar fibers. which develoPilt..pr,ophase. extend from the two
Each centrosome is composed of a fibrous matrix containing a poles of the spindle toward the equator.
pair of centrioles. which are short cylindrical structures. composed of Kinetochore fibers. which develop at prometaphase, connect the
microtubules and associated proteins; the two centrioles are arranged large m ulti protein structure at the centromere o f each chromatid
at right angles (Figure lA). During G 1, the two centrioles in a pair (the ki netochore; Figure 18) and the spindle poles.
separate. and d uring S phase, a daughter centriole begins to grow at Astral fibers form around each centrosome and extend to the
the base of each mother centriole until it is fully formed during G2• periphery of the cell.
The two centriole pairs remain dose together in a single centrosomal
(A)
cenl~osome +
V y-tubulin nng complex
-
"\ \.? .
\ '//
metaphase
plate
+ \
\,
1\
kinetochore
- - t--- microtubules
.~
",",rIOle'
l --\>
~ +_ +
kinetochore
microtuhules
kinetochore ..
anaphase, the kinetochore micro tubules pull the previously paired sister chro centromere
matids toward opposite poles of the spindle. Kinetochores at the centromeres
control assembly and disassembly of the attached microtubules, which drives i TCACATGAT 80-90 bp !TGAmCCGAA1
1~A",Gc: >c'9,",,0%J~.+T ) ! ACTAAAGGcn I
TG",r"Ac,"r",ALCC
chromosome movement. CDE 1 COE 11 CDE 111
In the budding yeast Saccharomyces cerevisiae, the sequences that specify
centromere function are very short, as are other functional chromosomal ele
ments (FIgUTe 2.(0). The centromere element (CEN) is about 120 bp long and telomere
contains three principal sequence elements, of which the central one, CDE ll, is tandem repeats based 00 the general formula
particularly important for attaching microtubules to the kinetochore. A centro (TG)1_3 TG 2-3
meric CEN fragment derived from one S. cerevisiae chromosome can replace the
centromere of another yeast chromosome with no apparent consequence.
autonomous replicating sequence
S. cerevisiae centromeres are highJy unusual because they are very small and
the DNA sequences specify the sites of centromere asse mbly. They do not have -50 bp
counterparts in the centromeres of the fission yeast Schizosaccharomyces pombe 1 'lTIl
or in those of multicellular animals. Centromere size has increased during '::: ." 7 u
imperfecl copies
eukaryote evolution and, in complex organisms, centromeric DNA is dominated core element of core element
by repeated sequences that evolve rapidly and are species-specific. The relatively
rapid evolution of centromeric DNA and associated proteins might contribute to Figure 2. 10 In S. cerevi5loe, chromosome
the reproductive isol ation of emerging species. function is uniquely dependent on
Although centromeric DNA shows remarkable sequen' \ heterogeneity short defined DNA sequence elements.
S. cerevisioe centromeres are very short
across eukaryotes, centromeres are universally marked by the presence of a
(often 100- 11 0 bp) and, unusually for
centromere-specific variant of h istone H3, generically known as CenH3 (the eukaryotic centromeres, are composed of
human form of CenH3 is named CENP-A). At centro meres, CenH3/CENP-A defined sequence elements. There are three
replaces the normal histone H3 and is essential for attachment to spindle micro contiguous centromere DNA elements
tubules. Depending on centromere organization. different numbers of spindle «DEs), of which CDE II and COE III are the
microtubules can be attached (FIgure 2.1 I). most important functiOnally. Telomeres
Mammalian centromeres are particularly complex. They often extend over are composed of tandem TG-rich repeats.
several megabases and contain some chromosome-specific as well as repetitive Autonomous replicati ng sequences are
DNA. A major component of human centromeric DNA ~satellite DNA, whose defined by short AT-rich sequences. The
structure is based on tandem repeats of a 171 bp monomer. Adjacent repeat units three types of short sequence can be
combined with foreign DNA to make an
can show minor variations in sequence, and occasional tandem amplification of
artificial ch romosome in yeast cells.
a sequ ence of several slightly different neighboring repeats results in a higher
order repeat organization. This type of a -satellite DNA is characteristic of cen
tromeres and is marked by 17 bp recognition sites for the centromere-binding
protein CENP-B.
Unlike the very small discrete centromeres of S. cerevisine, the much larger
centromeres of other eukaryotes are not dependent on sequence organization
alone. Neithe r specific DNA sequences (e.g. a-satellite DNA) nor the DNA
binding protein CENP-B are essential or sufficient to dictate the assembly of a
functi onal mammalian centromere. Poorly understood DNA characteristics
specify a chromatin conformation that somehow. by means of sequence
independent mechanisms, controls the formation and maintenance of a func
tional centromere.
~mlcrotUbUle
L.......J
125 bp
understood. ~
(8) S. pombe regional centromere
The (TTAGGG)" array of a human telomere often spans about 10-15 kb (see
~~•••[
Figure 2.12A). A very large protein complex (called shelterin, or the telosome)
contains several components that recognize and bind to telomeric DNA. Of these
components. two telomere repeat binding factors (TRFl and TRF2) bind to
double-stranded TTAGGG sequences.
As a result of natu«ifdifficulty in replicating the lagging DNA strand at the
e;\treme end of a telomere (discussed in the next section), the G- rich strand has a OTR IMR core IMR , OTR
Single-stranded overhang at its 3' end that is typically 150-200 nucleotides long cent/al domain
(see Figure 2.12AJ. This can fold back and form base pairs with the other, C-rich, 40-100 k.b
strand to form a telomeric loop known as theT-loop (Figure 2.12B, C).
Ie) H. sapIens regional centromere
Flgur. 2.11 Differences in eukaryotic centromere organization. In all
,i
It~~~
~ t]'9.I']·r·
eukaryotic centromeres a centromere-specific variant of histone H3 (generically - , 9
t II ~
called CenH3) is implicated in microtubule binding. However, the extent to which i .I ' :1 1 I :
the centromere extends across the DNA of a chromosome and the number of
microtubules involved can vary enormously. (Al The budding yeast S. cerevisiae
has the simplest form of centromere organization, a point centromere, with just
125 bp of DNA wrapped around a single nucleosome; each kinetochore makes
only one stable microtubule attachment during metaphase. (B) In the fi ssion
• • • CiGC
yeast 5. pombe, centromeres have multiple microtubule attachments clustered in lype II type I type II
Cl-saleilites a-satellites a.-satellites
a region occupying 35- 110 kb of DNA (the average chromosome has more than l J
4 Mb of DNA). The microtubule attachment sites are clustered in a non-repetitive 0.1--4 Mb
<ore sequence w hich is flanked by different t ypes of repeat sequence (IMR,
innermost; OTR. outermost). (C) Human centromeres have multiple microtubule
(0) C. elegans
attachment sites and are clustered over regions of up to 4 Mb of DNA. Higher-order diffuse centromere
~ ~~~~ ~
a-satellite DNA repeats are prominent at the centromeres. ln addition to binding
to nu(leosomes containing a CenH3 protein, CENP-A. they also have binding sites
for the CENP-B pro tein. (D) In some eukaryote species, such as the nematode
Coenorhabdiliselegons. the centromere is very diffuse: kinetochores that bind
spindle microtubules are distributed across the w hole length of the chromosome,
and the chromosomes are said to be holocentric. IAdapted from Vagnarelli P, Ribeiro •••••• c:o~ ••
SA & Earnshaw WC (2008) FE85 Lett. 582, 1950-1959. With permission from Elsevier.} telomere telomere
42 Chapter 2: Chromosome Structure and Function
Saccharomyces cerevisiaf TG l _3
Schizosaccharomycespombe
Paramecium TTGGGG
Trypanosoma TAGGGG
Chlamydomonas TTTTAGGG
ArabidopsiS TTTAGGG
Nematodes HAGGe
Vertebrates TTAGGG
"In the direction toward the end of the chromosome. Note: although telomere function is conserved
th roughout eukaryotes and telomeric repeat sequences are generally strongly conserved, the
telomeres of arthropods, such as Drosophila, are radically different in structure, being composed
of long DNA repeats that are unrelated to the TG-rich oligonucleotide repeats found in other
eukaryotes.
/
R-banding-this is essentially the reverse of the G-banding pattern. The
chro mosomes are heat·denatured in saline before being s tained with Giemsa.
referred to as the Paris nom enclature.
elsa describes the type of abnormality and the chromosome bands or sub -bands (M us musculus)
Q21.2
IA) 400, (6) 550, and Ie) B50 bands per
Q21.3 ./ haploid set, al lowing the visual subdivision
of bands into sub-bands and sub-sub
bandSas the resolution increases. (Adapted
from Cross & Wolstenholme (200 1). Human
Cytogenetics: Constitutional Analysis, 3rd ed,
(DE Rooney, ed.). With permission of Oxford
University Press.]
generate an oligonucleotide or nucleic acid probe. The longer the probe, the
more specific is its binding. Oligonucleotide probes are often 15-50 nucleotides
long and are chemically synthesized. Nucleic acid probes often range in size from
several hundred nucleotides to a few kilobases long and are often purified frag
ments of genomic DNA or DNA copies of RNA transcripts.
1
? 3
6
f 10 11
t
12
Flgur.. l .. 15 Prometaphase mitotic
chromosomes from lymphocytes of a
19 20 21 22 x
I
y
(DE Rooney, ed.). Wi[h permission of Oxford
UniverSity Press.)
STUDYING HUMAN CHROMOSOMES 47
1 ,3be,in9 and
denaturation
l
~e~ature DNA
rnsrtu 1
labeling and
denaturation
interphase chromosomes are shown in Figures
2. 17A and 2.17B, respectively. (B) Chromosome
painting uses a heterogeneous collection of
homogeneous - - - - - - - -..... 1( netl!rogenC!O\lS DNA. p robe labeled DNA probes that derive originally
ONA probr ~ (chrommome palmI from multiple regions of a defined type of
chromosome. The fragments in the probe
cocktail are simultaneously labeled w ith a
single type of fluorophore and w ill hybridize
allow to anneal,
expose to UV and to multiple regions of a single chromosome,
visualize fluorescence giving an aggregate fluorescent signal that
spa ns an entire chromosome. Figures 2. 18 and
2.9 show practical applications of painting
"
single probe bound J chromosome paint bound
metaphase and interphase chromosomes,
respectively.
48 Chapter 2: Ch romosome Structure and Function
..~
' -~
I~
~
JDo
FIgur. 2.19 Molecular karyotyping
with the SKY procedure. The analyzed
chromosom es are from a normal human
male. (A) Metaphase chromosome banding
~
oJ. "-
,vt.\ _ _, ... by the inverted OPI method, whic h is
similar to G-banding but accentuates
I .,. \. ; ~ the heterochromatic region of some
.... I
(B) (C)
same metaphase as in (A) after hybridizatio n
with chromosome paints. To obtai n suita ble
probes, indiVidual human chromosomes
were first fractionated according to size and
(A)
base composition by flow cytometry, and
different fluorophores, and simultaneously hybridized to normal metaphase sorted into pools of the same chromosome
chromosomes. The ratio of the two color signals is then compared along the type. DNA was purified and amplified from
length of each chromosome to identify chromosomal or subchromosomal differ the individual chromosome pools and then
ences between the two sources. Often a test source, such as a tumor cell's DNA, is flu orescently labeled for use as hybridization
referenced against a normal control source, to identify regions of the genome probes. Probes were labeled with at lea st
that have been amplified or lost in the test sample (Figure 2.20). one, and as many as fi ve, fluoro phore
As an alternative to standard CGH, variant methods have more recently been combinations and then pooled and
developed in which the two genomic DNAs to be compared are hybridized to hybridized to the metaphase chromosomes.
Outputs are RGB display colors. «) The
large collections of purified DNA fragments representing uniformly spaced
same metaphase spread, but here digital
regions ofthe genome. instead of to metaphase chromosomes. As a spin-off from imaging allows the fluorescence outputs
the Human Genome Project, comprehensive sets oflarge purified DNA fragments from the different chromosome paint signals
(DNA clones) have been sequenced and ordered into linear maps corresponding in (B) to be assigned artificial p se udocolors.
to each chromosome. In array CGH the target DNA is a collection of such clones. The pseudocolors are assigned according
typically at a resolution of one clone per megabase, that are deposited in defined to threshold wavelength values for
geometric arrays of microscopic spots on a solid support (a microarray). the detected fluorescence and help to
1It!"
addilion of di stinguish the different chromosomes.
Cot-l DNA Arrows indicate the sex chromoso mes. (From
isolatIOn and labeling of
wttole-genomic control DNA ' -
with mod3)nlne
(which fluo,esces <ed)
~
V
isolati on and labeling of
whole-genomic tumor DNA
with FITe (which fluoresces green)
Padilla-Nash HM, Barenboim-S tapleton L,
Difilippantonio MJ & Ried T (2007) Nature
Protocols 1, 3129-3142. With permission
from Macmillan Publishers Ltd.]
<e~d gainof
and genomic DNA from an individual who has
a normal karyotype (the control DNA) . Next,
0' ,.glon tlo the two genomes are differen tially labeled;
~ll
DNA 'h'ft'
II
~~ ; ~ ~~ ~~ ~~ §~ ~~
~~ ~~ ~~
~~ ~ ~ UU
§§ ~6 BB 88
NUMERICAL
STRUCTURAL
In sertion 46,XX,ins(21(p13q2 1q311 a rearrangement of one copy of chromosome 2 by insertion of segment 2q21-q3 1 intO a
breakpoint at 2p13
Marker 47,XX,+mar indicates a cell that contains a marker chromosome (an extra unidentified chromosome)
Reciprocal 46,XX,t(2;6i(q3S;p21.31 a balanced reciprocal translocation with breakpoints at 2q35 and 6p21.3
rrans location
Robertson ian 45,XY,der{14;21)(q 1O;ql 0) a balanced camer of a 14;21 Robertsonian translocation. q lOis not really a chromosome
translocation (gives band, but indicates the centromere; der is used when one chromosome from a translocation
rise to one derivative Is present
chromosome)
46,XX,der(14;21 Hq l0;ql 0).+2 1 an ind ividu al wi th Down syndrome possessing one normal chromosome 14, a Robertsonlan
translocation 14;21 chromosome, and two normal copies of chromosome 21
This is a short nomenclature; a more complica ted nomenclature is deli.ned by the ISCN that allows complete description of any chromosome
abnormality; see Shaffer LG, Tommerup N (eds) (2005) ISCN 2005: An International System for Human Cytogenetic Nomenclature. Karger.
;he others. it may fail to be incorporated into one of the two daughter nuclei.
Chromosomes that do not enter the nucleus of a daughter cell are eventually
.egraded.
Wxoploidy
IT
8
Dispermy is the principal ca use, accounting for 66% of cases. Triploidy is also
caused by diploid gametes that arise by occasional faults in meiosis; fertilization
n
of a diplOid ovum and fertilization by a diploid sperm account for 10% and 24%
of cases, respectively. (8) Tetraploidy involves normal fertilization and fusion of
gametes to give a normal zygote. Subsequently, however, tetraploidy arises by
fraternal twin zygotes, or immediate descendant cells therefrom, within the early
embryo. Abnormalities that would otherwise be lethal (such as triploidy) may not
be lethal in mixoploid individuals.
Aneuploidy mosalcs are common. For example, mosaicism resulting in a pro
portion of normal cells and a proportion of aneuploid (e.g. trisomic) cells can be
ascribed to nondisjunction or chromosome lag occurring in one of the mitotic
divisions of the early embryo (any monosomic cells that are formed usually die
out). Polyploidy mosaics (e.g. human diploid/triploid mosaics) are occasionally
found. As gain or loss of a haploid set of chromosomes by mitotic nondisjunction
is extremely unlikely, human diploid/triploid mosaics most probably arise by
fusion of the second polar body with one of the cleavage nuclei of a normal dip
loid zygote.
Clinical consequences
Having the wrong number of chromosomes has serious, usually lethal, conse
quences (Table 2.5). Even though the extra chromosome 21 in a person with tri
somy 21 (Down syndrome) is a perfectly normal chromosome, inherited from a
normal parent, its presence causes multiple abnormalities that are present from
birth (congenital). Embryos with trisomy 13 or trisomy 18 can also survive to
term, but both result in severe developmental malformations, respectively Patau
syndrome and Edwards syndrome. Other autosomal trisomies are not compati
ble with life. Autosomal monosomies have even more catastrophic consequences
than trisomies, and are invariably lethal at the earliest stages of embryonic life.
The developmental abnormalities associated with monosomies and trisomies
must be the consequence of an imbalance in the levels of gene products encoded
on different chromosomes. Normal development and function depend on innu
merable interactions between gene products that are often encoded on different
chromosomes. For at least some of these interactions, balancing the levels of
gene products encoded by genes on different chromosomes is critically impor
tant; each chromosome probably contains several genes for which the amount of
POLYPLOIDY
Triploidy (69,XXX or 69,XYV) 1- 3% of all conceptions; almost never born live and do not
survive long
ANEUPLOIDY (AUTOSOMES)
Trisomy (one extra usu ally lethal during embryonic or fetal a stages, but individuals
chromosome) with trisomy 13 (Patau syndrome) and trisomy 18 (Edward s
syndrome) may su rvive to term; those with tri somy 21
(Down syndrome) may survive beyond age 40
Additional sex chromosomes individuals with 47.XXX. 47,XXY, or 47,XYV all experience
relatively minor problems and a normal lifespa n
lacking a sex chromosome although 4S.Y is never viable, in 4S,X (Turner syndrome), about
99% of cases abort spontaneously; survivors are of normal
intell igence but are infertile and show minor physical diagnostic
characteristics
a)n humans, the embryonic period spans fertilization through to the end of the eighth week of
development. Fetal development then begins and la sts until birth.
CHROMOSOME ABNORMALITIES 53
~_e product made is tightly regulated. Altering the relative numbers of chromo
!iIlDles will affect these interactions, and reducing the gene copy number by 50%
uld be expected to have more severe consequences than providing three gene
copies in place of two.
Having extra sex chromosomes can have far fewer ill effects than having an
ema autosome. People with 47,XXX or 47,XYY karyotypes often functi on within
normal range, and 47,XXY men have relatively minor problems in compari
son with people witll any autosomal trisomy. Even monosomy (in 45,X women)
can have remarkably few major consequences-although 45,Y is always lethal.
Because normal people can have either one or two X chromosomes, and
either no or o ne Y, there must be special mechanisms that allow normal function
ith variable numbers of sex chromosomes. For the Y chromosome, this is
DeCauSe the very few genes it carries are very largely focused on determining
maleness. The mammalian X chromosome is also a special case because
X·chromosome inactivation in females controls the level of X-encoded gene
products independently of the number of X chromosomes in the cell.
It is not obvious why triploidy is lethal in humans and other animals. With
three copies of every autoso me, the dosage of autosomal genes is balanced and
>llO uld not cause problems. Triploids are always sterile because triplets of chro
mosomes cannot pair and segregate co rrectly in meiosis, but many triploid plants
are in all other respects healthy and vigorous. The lethality in animals may be due
;0 an imbalance between products encoded on the X chromosome and auto
'.Omes, for which X-chromosome inactivation would be unable to compensate.
variety of structural chromosomal abnormalities result from
"TIi srepair or recombination errors
Chromosome breaks occur either as a result of damage to DNA (by radiation or
rltemicals, for example) or through faults in the recombination process.
Chromosome breaks occurring during Gz phase affect only one of the two sister
chromatids and are called chromatid breaks. Breaks arising during G, phase, if
oot repaired before S phase, become chromosome breaks (affecting both chroma
rids). Cellular enzyme systems recognize and try to repair broken chromosomes.
:e pai rs involve either joining two broken ends together or adding a telomere to a
broken end.
During the cell cycle, special mechanisms normally prevent cells with ume
paired chromosome breaks from entering mitosis; however, as long as there are
!!O free bro ken ends, cells can move towa rd cell division as required. When cel
lar repair mechanisms encounter unwanted free chromosome ends produced
~' chromoso m e breakage, tluee outcomes are possible. First, the damage may be
",eparable, whereupon the cell is programmed to self-destruct. Alternatively, the
damage may be correctly repaired, or it may be incorrectly repaired and can
",slllt in chromsomes with structural abnormalities.
In addition to faulty DNA repair, errors in recombination also produce struc
m ra1 chromosome abnormalities. After homologs pair during meiosis, they are
",bjected to recombination mechanisms that ensure breakage and re-joining of
non -sister chromatids. If, however, recombination occurs between mispaired
'lomologs, the resulting products may have structural abnormalities.
In addition to meiotic recombination, other natural cellular processes involve
Ibe breakage and rejoining of DNA. Notably, a form ofrecombination also occurs
nln urally in Band T cells in which the cellular DNA undergoes programmed rear
raogements to make antibodies and T cell receptors. Abnormalities in these
:ecombination processes can also cause structural chromosomal abnormalities
;,bar may be associated with cancer.
Structural ch romosome abnormalities are often the result of incorrect joining
iDgether of two broken chromosome ends. If a single chromosome sustains two
aks, incorrect joining of frag ments can result in chromosome material being
r (deletion), Switched round in the reverse direction (inversion), or included in
('ircular chromosome (a ring chromosome) (Figure 2.22). Structurally abnor
IDa! chromosomes with a single centromere can be stabl y propagated through
";;"[ICcessive rounds of mitosis. However. any re paired chromosome that lacks a
Cl'Iltro mere (acentric) or possesses two centromeres (dicentric) will normally not
egate stably at mitosis and will eventually be lost.
54 Chapter 2: Chromosome Structure and Function
(A) 2 breaks in same arm (B) 2 breaks in different arms Flgu ... 2.l1 Stable outcomes after
incorrect repair of two breaks on a
single chromosome. (A) Incorrect repa ir
of two breaks (green arrows) occurring
in the same chromosome arm can
involve loss of the central fragment (here
containing hypothetical regions e and f)
and re-joining of the termin al fragments
(deletion), or inversion of the central
fragment through 180" and rejoining of
the ends to the terminal fragments (called
a paracentric inversion because it does
not involve the centromere). (9) When
IA\
two breaks occur on different arms of the
Inversion join broken sa me chromosome. the central frag ment
ends (encompassing hypothetica l regions b to f
in this example) may invert and re-join the
• , , terminal fragments (pericentric inversion ).
Alternatively, because the central fragment
b b
contains a centromere, the two ends can be
joined to form a stable ring chromosome,
• wh ile the acentric distal fragments are lost
d d d Like other repaired chromosomes that retain
ring a centromere, ring chromosomes can be
9 chromosome stably propagated to daug hter cells.
• b
9 9
~f/ ,a,e"ile -
satellite
acentric
fragments
exchanged
I --- \
centric and
acentric fragments
exchanged !
centric end acentric
fragm ents exchanged
( i -~)
/'
\
stable in unstable in stable in los t at
mitosIs mitos is mitosis m itosis
Figure 2.23: Reciprocal and Robertsonian translocations. (Al Reciprocal translocation. The
derivative chromosomes are stable in mitosis when one acentric fragment is exchanged for another;
when a centric fragm ent is exchanged for an acentric fragment, unstable acentric and dicentri,
chromosomes are produced. (B) Robertsonian translocatio n. This is a highly specialized reciprocal
translocation in which exchange of centric and ace ntric fragments produces a dicenuic chromosome
that is nevertheless stable in mitosis, plus an acentric chromosome that is lost in mitosis w ithout any
effect on the phenotype. It occurs exclusively aher breaks in the shott arms of the human acrocentric
chromosomes 13, 14, 15, 21, and 22, As illustrated on the right, the short arm of the acrocentric
chromosomes consists of three regions: a proximal heterochromatic region (co mposed of highly
repetitive non coding DNA), a distal heterochromatic region (called a chromosome satellite), and a thin
connecting region of euchromatin (the satellite stalk) composed of tandem rRNA genes. Breaks that
occur close to the centromere can result in a dicentric chromosome in which the two centro meres are
so close that they ca n function as a single centromere. The loss ofthe small acentric fragment has no
pheno typic consequences because the on ly genes lost are rRN A genes that are also present in large
copy number on the other acrocentric chromosomes.
I~
in a carrier of a balanced reciprocal
II
~
L
translocation. Other modes of seg regation
are also possible, for example 3:1
segregation. The relative frequency of each
possible gamete is not readily predicted.
, -_ _ _ outcomes of meiosis - - -.... The risk of a carrier having a child with
( )
each of the possible outcomes depends
on its frequency in the gametes and also
1 £
on the likelihood of a conceptus with that
abnormality developing to term. See the
~
book by Gardner and Sutherland (in Further
(II
Reading) for discussion.
gametes
fertilization
by a normal
gamete
® zygotes
I i II -
normal balanced
carrier
---
partial
trisomy
partial
monosomy
( outcomes of meiosis
)
! ), ! J. ! ),
(iJ) (/21
QD ~ OJ (II
'-../
\1
fertilization
bya normal
gamete
~
® ® ® r® r® \1 -...... ~
~ It )
(li n) (---=-- ( ~h) ( ~)
normal
/ balanced trisomy monosomy monosomy trisomy
ca rrier 14 14 21 21
Figurel.lS Possible outcomes of meiosis in a carrier of a Robertsonian translocation. Carriers are asymptomatic but often
produce unbalanced gametes that can result in a monosomic or trisomic zygote. The two monosomic zygotes and the trisomy 14
zygote in thiS example would not be expected to develop to term.
correct parental origin. Rare 46,XX conceptuses in which both complete haploid
genomes originate from the same parent (uniparental diploidy) never develop
correctly, and experimentally induced uniparental diploidy in other mammals
also results in abnormal development. Even for particular chromosomes, having
both homologs derived from the same parent (wtiparental disomy) can cause
abnormality. Parental origin of chromosomes is important because some genes
are expressed differently according to the parent of origin; such genes are subject
to imprinting.
Uniparental diploidy in humans arises most ofren by fertilization of a faulty
egg that lacks chromosomes by a diploid sperm (produced by chance chromo
some doubling) or by the activation of an unovulated oocyte (parthenogenesis).
resulting in each case in a 46,XX karyotype. The resulting conceptuses develop
abnormally.
Normal human (and mammalian) development involves the production of
both an embryo that develops into a fetus and also extra-embryonic membranes,
including the trophoblast, that act to support development and give rise to the
placenta. Uniparental diploidy changes the important balance between the
embryo or fetus and its supporting membranes. Paternal uniparental diploidy
produces hydatidiform moles, abnormal conceptuses that develop to show wide
spread hyperplasia (overgrowth) of the trophoblast but no fetal parts; they have a
significant risk of transformation into choriocarcin oma. Maternal uniparental
diploidy results in ovarian teratomas, rare benign tumors of the ovary that consist
of disorganized embryonic tissues but are lacking in vital extra-embryonic
membranes.
Uniparental disomy is thought to arise in most cases by trisomy rescue. In
such cases a trisomic conceptus that would otherwise probably die before birth
gives rise to a 46,XX or 46,XY embryo by chromosome loss. The chromosome loss
occurs through faulty mitotic division very early in development in a cell that has
58 Chapter 2: Chromosome Structure and Function
inverted terminal
duplicated
161
. ~
II ' I I!
~
!
,
normal
l)
! acentric
normal Inverted
inverted
dicentric
the capacity to give rise to all the different cell types in the organism (a totipotent
cell). Th e progeny of this cell are euploid and form the embryo, and all the aneu
ploid cells die. If each of the three copies of a trisomic chromosome has an equal
chance of being lost, there is a two in three chance that a single chromosome loss
will lead to the normal chromosome constitution and a one in three chance of
uniparental disomy (either paternal or maternal),
Uniparental disomy may often go undiagnosed because there are no obvious
phenotypic effects, and it is detected only when it results in characteristic syn
dromes. In such cases, the chromosome involved contains imprinted genes that
are expressed incorrectly. Uniparental disomy can involve heterodisomy (one
parent contributes non-identical homo logs, most probably by trisomy rescue).
Sometimes, however, the two chromosome copies inherited from one parent are
identical (isodisomy) . This can happen by monosomy rescue when a monosomic
embryo that would otherwise die achieves euploidy by selective duplication of
the monosomic chromosome.
CONCLUSION
In this chapter we have described fundamental aspects of chromosome srructure
and function, with an emphasis on human chromosomes and the generation of
chromosome abnormalities in humans. There are close similarities in the ways
that eukaryotic chromosomes are structured, notably in the way that DNA is
complexed with histones and organized into higher-order structures.
FURTHER READING 59
But there are also important differences. During eukaryote evolution, cen
tromeres have become progressively larger and more complex, andin more com
plex eukaryotes they are domi nated by rapidly evolving species-specific re
petitive DNA sequences. The mammalian sex chromosomes have unique
characteristics and we will consider them in greater detail in Chapter 10 when we
consider chromosome evolution. Specialized mammalian X-inactivation and
imprinting systems will also be explored in greater detail in the context of gene
expression in Chapter 11, as will the relationship between chromatin structure
and gene expression.
We will also provide many illustTative examples on how chromosome ab nor
mali ties cause inherited disease and cancer in Chapters 16 and 17 that also con
tain fuller descriptions of some of the technologies used to analyze chromo
somes, such as array CGH.
FURTHER READING
Chromosome structure and function Rooney DE (2001) (ed.) Human Cytogenetics: Constitutional
Analysis, 3rd ed. Oxford University Press.
Belmont AS (2006) Mitotic chromosome structu re and
Rooney DE (2001) (ed.) Hum an Cytogenetics: Malignancy and
condensation. Curro Opin. Cell Bior. 18, 632-638.
Acquired Abnormalities, 3rd ed. Oxford University Press.
Blackburn EH (2001) Sw itching and Signa ling at the telomere.
Schinzel A (2001) Catalogue of Unbalanced Chromosome
CeI/ 106,661-673.
Aberrations in Man. Walter de Gruyter.
Courbet 5, Gay 5, Arnoult N et al. (2008) Replication fork
movement sets chromatin loop size and origin choice in Shaffer lG & Bejani BA (2006) Medical application s of array CGH
mammalian cells. Nature 455, 557-$60. and the transformation of clinical cytogenetics. Cytogenet.
Haering CH, Farcas AM, Arumugam Pet al. (2008) The cohesin
Genome Res. 115,303-309.
ring concatenates sister DNA molecules. Narure 454,
Therman E & Susman M (1992) Human Chromosomes: Structure,
297-301. Behavior and Effects, 3rd ed. Springer.
Herrick J & Bensimon A (2008) Global regulation of genome Trask BJ (2002) Human cytogenetics: 46 chromosomes, 46 years
duplication in eukaryotes: an overview from the
an d counting, Nat. Rev. Genet. 3, 769-778.
epifluorescence microscope. Chromosoma 117,243-260.
Human cytogenetic nomenclature
louis EJ & Vershinin AV (2005) Chromosome ends: different
Shaffer lG & Tommerup N (eds) (2005) ISeN 2005: An
sequences may provide conserved functions. BioEssays
Internationa l System for Human Cytogenetic Nomenclature.
27,685-69 7.
Karger.
M eaburn KJ & Misteli T (2007) Chromosome territories.
Nature 445, 379- 381. Web-based cytogenetic resources
Riet hman H (2008) Human telomere structure and biology.
Useful compilations are available on certain sites such as
Annu. Rev. Genomics Hum. Genet. 9, 1-19.
www.kumc.e du/g ec/prof/cytogene.html
Vagnarelli P, Ribeiro SA & Earnshaw WC (2008) Centromeres: old
tales and new tools. FEBS Lerr. 582, 1950-1 959.
KEY CONCEPTS
The pattern of inheritance of a character follows Mendel's rules if its presence, absence, or
specific nature is normally determined by the genotype at a single locus.
Mendelian characters can be dominant, recessive, or co-dominant. A character is dominant
if it is evident in a heterozygous person, recessive if not.
Human Mendelian characters give pedigree patterns that are often recognizable, although
seldom as unambiguous as the results of laboratory breeding experiments because of the
limited size and non-ideal structure of most human families.
Genetically controlled characters can be dichotomous (present or absent) or continuo us
(quantitative). The genes controlling quantitative characters are called quantitative trait
loci.
Most characters depend on more than one genetic locus, and often also on environmental
factors. Such characters are called complex or multifactorial.
Polygenic theory explains why many quantitative characters show a normal (Gaussian)
distribution in a population.
Polygenic threshold theory provides a framework for understanding the genetics of
dichotomous mul tifactorial characters, but it is not a tool for predicting the characteristics
of individuals.
• Subject to certain conditions, gene frequencies and genotype frequ encies are related by the
Ha rdy-Weinberg formula.
• Consanguineous matings contribute disproportionately to the incidence of rare autosomal
recessive conditions; ignoring this when using the Hardy-Weinberg formul a leads to
incorrect gene frequencies.
• Gene frequencies in populations are the result of a combination of random processes
(genetic drift) and the effects of new mu tation and natural selection.
Heterozygote advantage often explains why an autosomal recessive condition is particularly
common in a population.
62 Chapter 3: Genes in Pedigrees and Populations
- -----
Genetics as a science started with Gregor Mendel's experiments in the 1860s. BOX 3.1 NOMENCLATURE OF
Mendel bred and crossed pea plants, co unted the numbers of plants in each gen HUMAN GENES AND ALLELES
eration that manifested certai n traits , and deduced mathematical rules to explai n
his results. Patterns of inheritance in humans (and in all other diploid organisms Hu ma n genes have up per-case
that reproduce sexually) follow the same principles that Mendel identified in his italicized names; exa mples are CFTR
peas, and are explained by the same basic concepts. and CDKN2A.
The process of di scovery often
Genes can be d efined in two ways :
involves seve ral competing research
A gene is a determi nant, or a co-determinant, of a character that is inherited
groups, so that the same gene may
in accordance with Mendel's rules.
initially be referred to by several
A gene is a functional unit of DNA.
different na mes. Eventua lly an officia l
name is aSSigned by the HUGO (H uman
In this chapter we will be exploring genes through the first of these defini Gen ome Organ isatio n) Nomen cl ature
tions. Later chapters will consider genes as DNA. A major aim of human molecu Co mmittee, and thi s isthe name
lar genetics is to understand how genes as functi onal DNA sequences determine that shou ld be used henceforth.
the observable characters of a person. Official names ca n be fo und on the
A locus (plural loci) is a unique chro mosomal location definin g the position nomenclature webs ite (http "i/
of an individual gene or DNA sequence. Thus, we can speak of the ABO blood www.genenames.org) or from entries
gro up locus, the Rhesus blood group locus, and so on. In genome browser prog rams such as
En sembl (described in Cha pter 8).
Alleles are alternative versions of a gene. For example, A, B, and 0 are alterna
The formal nomenclature for alleles
rive alleles at the ABO locus. Box 3.1 shows the formal rules for nomenclatu re of uses the gene na me followed by an
alleles. However, for pedigree in terpretation, alleles are usually simply denoted asterisk and then the allele name or
by upper- and lower-case ve rsions of the same letter, so that the genotypes at a number, for example, PGM·" HLA-A *3 1.
locus would be written AA, Aa, or aa. The upper-case letter is conve ntionally used When only the allele name is used, it
for the allele that determi nes the do minant charac ter (see below) . can be w ritten *', "31 . However, the
The genotype is a list of the alleles prese nt at one or a number ofloci. rules for nomenclatu re of alleles are less
Phenotypes, characters, or traits are the observable properties of an organ rigorously adhered to in publications
ism. The means of observation may range from simple inspection to so phis ti than those for genes.
cated laboratory investigations.
A person is homozygous at a locus if both alleles at that locus are the same,
and heterozygous if they are different. For different purposes we may check mo re
or less carefully whether the two alleles at a locus are really the same. For the
pedigree interp retatio n in this chapter, we will be concerned only with the phe
notypic conseque nces of a person's genotype. We will describe a person as
homozygous if both alleles at the locus in question have the same phenotyp ic
effect, even though inspection of the DNA sequence might reveal differences
between them.
A person is hemizygous if they have only a single allele at a locus. This may be
because the lo cus is on the X or Y chro mosome in a male, or it may be because
one copy of an autoso mal locus is deleted.
A character is dominant if it is manifested in a heterozygous person, recessive
if not.
This Is a short selective list of especially useful, reliable, and only patchy rewrites, so that the early part of an entry may not reflect
stable resou rces; many other useful databases may be found by current understa nding.
sea rching. The Genetic Association Database (http://geneticassociationd b
OMIM (http://www,ncbLn lm.nih.gov/omi m).TheOn line .nih.gov), maintained by the US Nationa l Institute on Aging, can be
Mendelian Inheritance in Man data base is the m OSt reliable single sea rched for a list of genes and pub1ications reporting possible genetic
source of information on human Mendelian characters and th e susceptibility factors for multifactorial diseases. At the time of writing
underlying genes. The index nu mbers quoted throughout this book it is at an ea rly stage of development, but it should become a valua ble
(e,g. OMIM 193500) give direct access to the releva nt entry. OMIM resource for accessing information that is otherwise dispersed over
co ntains about 20,000 entries, which may be seque nced genes, many individ ual publications.
characters or diseases associated with known sequenced genes, or Genecards (http://www.genecards.org). from the Weizmann
characters that are inherited in a Mendelian way but for which no gene Institute in Israel, contains about 50,000 au tomatically generated
has yet been identified. 50me entries describe characters that are not entries. mostly rela ting to specific human genes. It gives access to a
normally Mendelian. In those cases the OMIM en try will concentrate large amou nt of biological information about each gene.
on any Mendelian or near-Mendelian subset and may therefore GeneTests (http://www.genecl inics.org) is a databa se of human
not give a balanced picture of the overall etiology. Each entry is a gen etic diseases, maintai ned by the US National Institutes of Health
detailed historically ordered review of the genetics of the character, and ai med mainly at clinicians. It includes brief clinical and genetic
with subsidiary cli nical and other information, and a very useful list reviews of about 500 of the most common Mendelian diseases. There
of referen ces. Entries have accumulated te)(t over many years with is more clinical informatio n than in OMIM.
they are the characters of choice for follo,,~ng the transmission of a chromosomal
segment through a pedigree, as described in Chapter 13. Directly observed char
acteristics of proteins (such as electrophoretic mobility or enzyme activity) usu
ally display Mendelian pedigree patterns, but these characters can be influenced
by more than one locus because of post-translational modification of gene prod
ucts. Disease states or other typical traits that reflect the biochemical action of a
protein within a cellular context are less often entirely determined at a single
genetic locus. The failure or malfunction ofa developmental pathway that results
in a birth defect is likely to involve a complex balance of factors. Thus. the com·
mon birth defects (for example, cleft palate, spina bifida. or congenital heart dis·
ease) are rarely Mendelian. Behavioral traits such as IQ test performance or
schizophrenia are still less likely to be Mendelian-but th ey may still be geneti
cally determined to a greater or smaller extent.
Non-Mendelian characters may depend on two. three. or many genetic loci.
We use multifactorial here as a catch-all term covering all these possibilities.
More specifically. the genetic determination may involve a small number of loci
(oligogenic) or many loci each of individually small effect (polygenic); or there
may be a single major locus with a polygenic background-that is. the genotype
at one locus has a major effect on the phenotype, butthis effect is modified by the
cumulative minor effects of genes at many other loci. In reality. characters form a
continuous spectrum, from perfectly Mendelian through to truly polygenic
(FIgure 3.1 1. Superimposed on this there may be a greater or sm aller effect of
enviro nmental factors.
For dichotomous characters (charac ters that yo u either have or do not have,
such as extra fingers) the underlying loci are envisaged as susceptibility genes,
whereas for quantitative or continuous characters (such as height or weight)
they are seen as quantitative trait loci (QTLs). Any of these characters may tend
to run in families, but the pedigree patterns are not Mendelian and do not fit any
standard pattern.
FJgur. 3.1 Spectrum of genetic
In a further layer of complication, a common hu man condition such as eliabe
determination. ABO blood group depends
tes is likely to be very heterogeneous in its causation. Some cases may have a (with rare exceptions) on the gen otype at
simple Mendelian cause. some might be entirely the result of environmental fac just one locus. the ABO locus at ch romosome
tors, while the majority of cases may be multifactorial. Such conditions are called 9q34. Rhesus hemolytic disease of the
complex. newborn depends on the genotypes of
mother and baby at the RHO locu s at
ABO Rhesus chromoso me 1p36, but also on mother and
blood Ilemolytic Hirschsprung adu lt
group disease disease stature baby's being ABO compatible. Hirschsprung
disease depends on the interac tion of several
ge netic loci. Adult stature is determ ined by
the cumulative small effects of many loci.
Environ mental factors are also important in
Mendelian polygenic the etiology of Rhesu s hemolytic disease,
(monogenic) Hirschsprung disease, and ad ult height.
64 Chapter 3: Genes in Pedigrees and Populations
o sex unknown!
not stated
~ brothers
twin brothers
••
88
affected
carrier (optional)
.. ~
miscarriage
.. ~ 6 offspring,
80 dead sex unknown!
not stated
III
1 2
1 2
1 2 3 4 5 6 ,
8 9 10 11 12
IV
1 2 3 4
1 2
1 2 3 4 5 6 1 8
"' 1 2 3 4 5 6 7 8
"'
,v ,v
1 2 3 4 5 6
Figure 3.5 Pedigree pattern of an X-linked recessive Figure 3.6 Pedigree pattern of an X-linked dominant condition . The risk for the
condition. The females marked with dots are defin itely individual marked with a query is negligibly low if male. but 100% if female.
carriers; individuals 1113 and/ or IV4 may also be carriers,
but we do not know. The risk for the individual marked
with a Query is 1 in 2 males or 1 in 4 of all offspring.
"'
,v Figure 3.7 Hypothetical pedigree pattern
1 2 3 4 of a V-linked condition.
66 Chapter 3: Genes in Pedigrees and Populations
Autosomal dominant inheritance (Figure 3.3): X-linked dominant inheritance (Figure 3.6):
An affected person usually has at least one affected parent (for It affects either sex, but more females than males.
exceptions see Figures 3.14 and 3.21). Usually at least one parent is affected.
It affec ts either sex. Females are often more mildly and more varia bly affected than
h is transmitted by either sex. males.
A child with one affected an d one unaffected pa re nt has a 50% The child of an affected female, regardless of its sex, has a 50%
chance of being affected (th is assumes that the affected person is chance of being affected.
heterozygous, w hich is usua ll y true for rare conditio ns). For an affected male, all his daughters but none of his sons are
Autosomal recessive inheritance (Figure 3.4): affected.
Affected people are usually born to unaffected parents. V-linked inheritance (Figure 3.7):
Parents of affected peop le are usually asymptomatic carriers. It affects only males.
There is an increased incidence of parental consanguinity. Affected males always have an affected father (unless there is a
It affects either sex. new mutation) .
After the birth of an affected child, each subsequent child has a All sons of an affected man are affected.
25% chance of being affected (assuming that both parents are Mitochondrial inheritance (Figure 3.10):
phenotypically norma! ca rriers). It affect s both sexes.
X-linked recessive inheritance (Figure 3.5): It is usually inherited from an affected mother (but is often caused
It affects mainly males. by de novo mutations with the mother unaffected).
Affec ted males are usually b orn to unaffected parents; the mother It is not tran sm itted by a fa ther t o his children.
is normally an asymptomatic carrier and may have affected male There are highly variable clinical manifestations.
relatives.
Females may be affected if the father is affected and the mo the r is
a carrier, or occasionally as a resul t of non-random X-inactivation.
There is no male-to-male transmission in the pedigree (but
matings of an affected mal e and carrier fema le can give th e
appearance of male-ta-male transmission; see Figure 3.20).
they use X·inactivation (sometimes called lyonization after its discoverer, Dr ;.IA
,;,I_ _ _ _ _ _- - - - - - - ,
Mary Lyon).
Early in emblyogenesis each cell somehow counts its number of X chromo·
somes, and then permanently inactivates all X chro mosomes except one in each
somatic cell. Inactivated X chromosomes are still physically present. and on a
standard karyotype they look entirely no rmal, but the inactive X remains Con·
densed throughout the cell cycle. Most genes on the chromosome are perma·
nently silenced in somatic cells. The mecha nism is discu ssed in Chapter J 1. In
interphase cells the inactive X may be seen under the microscope as a Barr body
or sex chromatin (Plgl.lre 3,8). Regardless of the karyotyp e, each somatic cell
retains a single active X~
• XY males keep their single X active (no Barr body) .
XX fe males inactivate one X in each cell (o ne Barr b ody).
Females with Turner syndrome (45,X) do not inactivate thei r X (no Barr
body).
Males with Klinefelter syndrome (47,XXY) inactivate one X (one Barr body). IBI
• 47,XXX females inactivate two X chromosomes (two Barr bodies).
Flgur. 3..8 A Barr body. (A) A cell from an XX female has one inactivated
X chromosome and shows a single Barr body. (8) A cell from a 49,XXXXY male
has three inactivated X chromosomes and shows three Ba rr bodies. (Courtesy of
Malcolm Ferguson-Smith, University of Cambridge.)
MENDELIAN PEDI GR EE PATTERNS 67
~~~~
~~~~
~~~~
fl9ur~ 3.11 Biased ascertainment. In eac h
of these 16 pedigrees. both pa rents are
ca rriers of an autosomal recessive condition.
Overall, thei r 32 children show the expected
1:2:1 distribution of genotypes (AA, 8I32;Aa,
16/ 32; aa, 8/32). However, jf families can be
recognized only through affected children,
~~~~
only the families shown in the shaded area
will be picked up, and the proportion of the
children in th at sample who are affected
will be 8/14, not 1/4. Statistica l methods
are available for correcting such biased
ascertainment and recove ring the true ratio.
70 Chapter 3: Genes in Pedigrees and Populations
The segregation ratio for a family with n children, under complete truncate ascertainment, can
be estimated from the following formulae, which give t he proportions of sibships with different
numbers of affected children:
(1/4)' att children affected
(n, l ) (1/4)'-1 (3/4) all except one child affected
(n,2) (1/4)'-' (3/4)' all except two children affected
etc.
(n,x) mean s nl![x!(n - x)l]' where n! (n factorial) mean s
nx (n - 1) x (n - 2) x(n - 3) x .. . x 2x 1
For the mathematically ind ined, this is an examp le of a tru ncated binomial distribution, a bin omia l
expansion of (1/4 + 314)" in which the last term (no affected chi ldren) is omitted.
The overall proportion of affected children ca n now be calcu lat ed. For example, for every 64
th ree-child families we have:
1 family with 3 affected total 3 affected, 0 unaffected
9 families w ith 2 affected total 18 affected, 9 unaffected
27 families with 1 affected total 27 affected, 54 unaffected
{27 families with no affected- but these families will not be ascertai ned]
Overa ll (among ascertained families) 48 affected, 63 unaffected
Apparent segregation ratio 48/(48 + 63) = 48/1 11 = 0.432
Correcting the seg regation ratio
If: p = the true (u nbiased) segreg ation ratio
R = the number of affected children
5 = the number of affected singleto ns (children who are the on ly affected chi ld in the family)
T = the total number of children
N = the num ber of sibships
For com plete truncate ascertainment, p = (R - 5)/{T - 5).
= =
For example, for the three-child fami lies ill ustrated above, p (48 - 27)/( 111 - 27) = 21184 0.25.
For single selection, p = (R - N)/(T - N).
MENDELIAN PEDIGREE PATTERNS 71
chains. But even with these extensions, Beadle and Tatum's hypothesis cannot be
used to imply a one-to -one correspondence between entries in the OMIM cata
logue and DNA transcription units.
The genes of classical genetics are abstract entities. Any character that is
determined at a single chromosomal location will segregate in a Mendelian pat
tern-but the determinant may not be a gene in the molecular geneticist's sense
of the word. Fascio-scapulo-humeral muscular dystrophy (severe but non-lethal
weakness of certain muscle groups; OMIM 158900) is caused by small deletions
of sequences at4q35-but, althe time of writing, nobody has yet fmUld a rel evant
protein -coding sequence at that location, despite intensive searching and
sequencing. This example is unusual-most OMIM entries probably do describe
the consequences of mutations affecting a single transcription unit. However,
there is still no one-to-one correspondence between phenotypes and transcrip
tion units because of three types of heterogeneity:
o Locus heterogeneity is where the same clinical phenotype Can result from
mutations at anyone of several different loci. This may be due to epistasis,
where gene A controls the action of gene B, or lies upstream of gene B in a
pathway; alternatively genes A and B may function in separate pathways that
affect the same phenotype.
o AUelic heterogeneity is where many different mutations within a given gene
can be seen in different patients with a certain genetic condition (explored
more fully in Chapter 16). Many diseases show both locus and allelic
heterogenei ty.
o Clinical heterogeneity is used here to describe the situation in which muta
tions in the same gene produce two or more different diseases in different
people. For example, mutations in the HPRT gene can produce either a form
of goutor Lesch-Nyhan syndrome (severe mental retardation with behavioral
problems; OMIM 300322). Note that this is not the same as pleiotropy. That
term means that one mutation has mUltiple effects in the same organism.
Most mutations are pleiotropic if you look carefully enough at the pheno
type.
Locus heterogeneity
Hearing loss provides good examples of locus heterogeneity. When two people
with autosomal recessive profound congenital hearing loss marry, as they often
do, the children most often have normal hearing. It is easy to see that many dif
ferent genes would be needed to construct so exquisite a machine as the cochlear
hair cell, and a defect in any of those genes could lead to deafness. If the muta
tions causing deafness are in different genes in the two parents, all their children
will be double heterozygotes-but assuming that both parental conditions are
recessive, the children will have normal hearing (Figure 3.12).
When two recessive characters mayor may not be caused by mutations at the
same locus, a test cross between homozygotes for the two characters (in experi
mental organisms) can provide the anSwer. This is called a complementation
test. If the mutations in the two parental stocks are at the same locus, the progeny
will not have a wild-type aUele, and so will be phenotypically abnormal. [fthere
are two different loci, the progeny are heterozygous for each of the two recessive
characters, and therefore phenotypically normal, as illustrated in Table 3.1 .
Very occasionally alleles at the same locus can complement each other
(interallelic complementation). This might happen if the gene product is a pro
tei n that dimerizes, and the mutant alleles in the two parents affect different
parts of the protein, such that a heterodimer may retaln some function. In gen
eraJ , however, if two mutations complement each other, it is reasonable to assume
that they involve different loci. Cell-based complementation tests (fusing cell
lines in tiss ue culture) have been important in sorting out the genetics of human
phenotypes such as DNA repair defects, in which the abnormality can be observed
in cultured cells. Hearing loss provides a rare opportunity to see complementa
tion in action in human pedigrees.
Locus heterogeneity is only to be expected in conditions such as deafness,
blindness, or learning disability. in which a rather general pathway has failed; but
even with more specifi c pathologies, multiple loci are very frequent. A striking
example is Usher syndrome. an autosomal recessive combination of hearing loss
and progressive blindness (retinitis pigmentosa), which can be caused by muta
tions at any of 10 or more unlinked loci. OMIM has separate entries for known
examples oflocus heterogeneity (defined by linkage Or mutation analysis). but
there must be many undetected examples still contained within single entries.
Clinical heterogeneity
Sometim es several apparently distinct human phenotypes turn out all to be
caused by different allelic mutations at the same locus. The difference may be
one of degree-mutations tl,at partly inactivate the dystrophin gene produce
Becker muscular dystrophy (OMIM 300376). whereas mutations iliat completely
inactivate the same gene produce the similar but more severe Duchenne muscu
lar dystrophy (lethal muscle wasting; OMIM 310200). At other times the differ
ence is qualitative-inactivation ofilie androgen receptor gene causes androgen
insenSitivity (46,XY embryos develop as females; OMIM 313700), but expansion
of a run of glutamine codons wiiliin ilie same gene causes a very different
disease, spinobulbar muscular atrophy or Kennedy disease (OMIM 3 13200).
These and other genotype-phenotype correlations are discussed in mo re depth
in Chapter 13.
III
Mendelian polygenic
(monogeniC)
II
III
'I
II
III
III
IA) (S)
~ ,~ ~I
3 4 5
III II I t 1
n ~I ~~
~
Figur.3.21 Complications to the basic Mendelian patterns (7): new mutations. (AI A new autosomal dominant mutation. The
pedigree pattern mimics an autosomal or X-linked recessive pattern . (8) A new mutation in X-linked recessive Duchenne muscular
dystrophy. The three 9randparental X chromosomes were di stinguished by using genetic marke rs; here we distinguish them with
three different colors (ignoring recombination). III, has the grandpaternal X, whi ch has acq uired a mutation at one of four possible
points in the pedig ree-each with very different implications for ge netic counse ling:
If 1111 carries a new mutation, the recurrence risk for all family members is very low.
If II, is a germinal mosaic, there is a significant (but hard to quantify) risk for her future children, but not for those of her three
sisters.
If 111 was the result of a single mutant sperm, her own future offspring have the sta nda rd recurrence risk for an X-linked recessive
trait, but th ose of her sisters are free of risk.
If I, was a germinal mosaic, all four sisters in generation II have a significant (but hard to quantify) risk of being carriers of the
conditi on.
• The mutation causes abno rmal prolife ration of a cell that would normally (A)
~I ~I ~~ ~I
Molecular studies can be a great help in these cases. Sometimes it is possible
to demonstrate directly that a normal father is producing a proportion of mutant
sperm. Direct testing of the germ line is not feasib le in women, but other acces
sible tissues such as fibroblasts or hair roots can be examined for evidence (B)
of mosaicism. A negative result on somatic tissues does not rule out germ·line II, III ~
--
mosaicism, but a positive result, in conjunction with an affected child, proves it. .
Individual 112 in Figu.re 3.22 is an example.
Chimeras
Mosaics start life as a single fertilized egg. Chimeras, in contrast, are the result of
-_ ...........
fusion of two zygotes to form a single embryo (the reverse of twinning), or alter· Ftgun 3.11 Germ·line and somatic
natively of limited colonization of one twin by cells from a non-identical co-twin mosaicism in a dominant disease.
(Figure 3.23). Chimerism is proved by the presence of too many parental aUeles (A) Although the grandparents in this
at several loci (in a sample that is prepared from a large number of cells). If just pedigree are unaffected, individuals 112 and
1112 both suffer from familial adenomatous
one locus were involved, one would suspect mosakism for a single mutation,
polyposis, a dominant inherited form of
rathe r than the much rarer phenomenon of chimerism. Blood-grouping centers co loreccal cancer (OMIM 175100; see
occasionally discover chimeras among normal donors, and some intersex Chapter 17). The four grandpa rental copies
patients turn out to be XX/XY chimeras. A fascinating example was described by of chromosome 5 (where the pathogenic
Strain et al. (see Further Reading). They showed that a 46,XY/46,XX boy was the gene is known to be located) were
result of two embryos amalgamating after an in uitra fertilization in which three distinguished by using genetic markers,
embryos had been transferred into the mother'S uterus. and are color-coded here (ignoring
recombinatio n). (B) The pathogenic
mutation could be detected in 1112 by gel
electrophoresis of blood DNA (red arrows;
the black arrow shows the band due to
the normal, wild·type all ele). For 112 the gel
trace of her blood DNA shows the mutant
(A)
•• •• bands, but only very weakly, showing that
genetic she is a somatic mosaic for the mutation.
fertilization change
---+ •
The mutation is absent in III", even though
) marker studies showed that he inherited the
• high-risk (blue in this figure) chromosome
from his mother. Individ ual 112 must therefore
be both a ge rm· line and a somatic mosaic
for the mutation. (Cou rtesy of Bert Bakker,
Leiden University Medical Center.)
mosaic
(BI
• -----+ • ---+ • •
-
l.
fusion or •
• ,~1
r
,- eXChange •
fer tIlizatIon Flgu... ).23 Mosaics and chimeras.
•
• • •• • • ~iI (Al Mosaics have two or more genetically
different cell lines derived from a single
/ .. -----+ • ---+ ••
•
• ••
;:. ..
zygote. The genetic change indicated may
be a gene mutation, a numerical or structural
chromosomal change, or in the special case
of Iyonization, X-inactivation. (B) A chimera is
derived from two zygotes, which are usually
chm'le{a both normal but genetically distinct.
GENETICS OF MULTIFACTORIAL CHARACTERS:THE POLYGENIC THRESHOLD THEORY 79
.1: .1 I
!
strength of pull and of squeeze , force of blow, reaction time, keenness of sight and 70 80 90 100 110 120 130
hearing, color discrimination, and judgments of length. In one of the first appli (8) two loci
cations of statistics, he compared physical attributes of parents and children, and
established the degree of correlation between relatives. By 1900 he had built up a MOb
large body of knowledge about the inheritance of such attributes, and a tradition "BO
g aaBB
I::
(biometrics) of their investigation. ~
When Mendel's work was rediscovered, a controversy arose. Biornetricians ,
~
cdl~b
:: i
Mendel described, but they pointed out that most of the characters likely to be
important in evolution (fertility, body size, strength, and skill in catching prey or
finding food) were continuous or quantitative characters and not amenable to
Mendelian analysis. We aU have these characters, only to different degrees, so you
of independent Mendelian factors (polyge nic characters) would display precisely ~ aaBbCc AaBbCe AaBacc
Hidden assumptions
In the simple model of Figure 3.25 there is a hidden assumption: that there is
random mating. For each class of mothers, the average IQ of their husbands is
assumed to be 100. Thus, the average IQ of the children is actually the mid-paren
tal IQ, as common sense would suggest. In the real world, highly intelligent
women tend to marry men of above average intelligence (assortative mating).
The regression would therefore be less than halfway to the population mean,
even ifIQ were a purely genetic character.
A second assumption of our simplified model is that there is no dominance.
Each person's phenotype is assumed to be the sum of the contribution of each
allele at each relevant locus. Ifwe allow dominance, the effect of some of a par
ent's genes will be masked by dominant alleles and invisible in their phenotype,
but they can still be passed on and can affect the child's phenotype. Given domi
nance, the expectation for the child is no longer the mid-parental value. Our best
guess about the likely phenotypic effect of the masked recessive alleles is obtained
by looking at the rest of the population. Therefore, the child's expected pheno
type will be displaced from the mid-parental value toward the population mean.
How far it will be displaced depends on how important dominance is in deter
mining the phenotype.
Misunderstanding heritability
The term heritability is often misunderstood. Heritability is quite different from
the mode of inheritance. The mode of inheritance (autosomal dominant, poly
genic, etc.) is a fixed property of a trait, but heritability is not. Heritability ofIQ is
shorthand for heritability of variations in IQ. Contrast the following two
questions:
To what extent is IQ genetic? This is a meaningless question.
How much of the differences in IQ between people in a particular country at
a particular time is caused by their genetic differences, and how much by their
different environments and life histories? This is a meaningful question, even
if difficult to answer.
In different social circumstances, the heritability of IQ will differ. The more Figure 3.26 Multifactorial determination
equal a society is, the higher the heritability of IQ should be. If everybody has of a disease or malformation. The angels
equal opportunities, several of the environmental differences between people and devils can represent any combination of
genetic and environmental factors. Adding
have been removed. Therefore more of the remaining differences in IQ will be
an extra devil or removing an angel can tip
due to the genetic differences between people. the balance, without that particular factor
being the cause ofthe disease in any general
The threshold model extended polygenic theory to cover sense. (Courtesy of the late Professor RSW
dichotomous characters Smithells.)
Congenital pyloric stenosis, for example, is five times more common in boys than
girls. The threshold must be higher for girls than boys. Therefo re, to be affected,
a girl has on average to have a higher liability than a boy. Relatives of an affected
girl therefore have a higher average susceptibility than relatives of an affected boy
(Figure 3.W ] . The recurrence risk is correspondingly higher, although in each
case a baby's risk of being affected is five times higher if it is a boy because a less
extreme liability is sufficient to cause a boy to be affected (Table 3.2).
Male proband 19/296 (6.42%) 7/274 (2.55%) 5/230 (2. 17%) 5/242 (2.07%)
Fema le proband 14/61 (22.95%) 7/62('1.48%) 11/101 (10.89%) 91101 (8.9 1%)
More boys than girls are affected, but the recurrence risk is higher for relatives of an affected girl. The data fit a polygenic threshold model with
sex-specific thresholds (Figure 3.28). [Data from Fuhrmann and Vogel (1976) Genetic Counselling. Springer.]
that the first was A, and the second Aj is qp. Overall, the chance of pick
Consider two alleles at a locus. Allele A, has f requency P, and allele A2 has frequency q.
X-linked locus
t-
Autoso mal locus Males Females
Note that these genotype frequencies will be seen whether or notA, and A2 are the only alleles at
the locus.
Example 1
An autosomal recessive condition affects 1 newborn in 10,000. What is the
expected frequency of carriers?
Hardy-Weinberg relationships give the following relationships:
Phenotypes: Unaffected Affected
Genotypes: AA Aa ao
Frequencies: p2 2pq q' = 1110,000
q' is 1 in 10,000 or 10-", and therefore q = 10-' or 1 in 100. One in 100 genes
at the A locus are a, 99 in 100 are A. The carrier frequency, 2pq, is therefore
2 x 991100 x 11100, wh ich is very nearly 1 in 50.
This calculation assumes that the frequency of the condition has not been
increased by inbreeding. This is an important proviso, which is considered in
more detail below.
Example 2
If a parent of a child affected by the above condition remarries, what is the risk of
producing an affected child in the new marriage?
To produce an affected child. both parents must be carriers, and the risk is
then 1 in 4. We know that the parent of the affected child is a carrier. If the new
spouse is, from the genetic point of view, a random member of the population,
his or her carrier risk is 2pq, which we have calculated as 1 in 50. The overall risk
is then 1150 x 1/4 = 1/200.
This assumes that there is no family history of the same disease in the new
spouse's family.
Example 3
X-linked red-green color bl indness affects 1 in 12 Swiss males. What proportion
of Swiss females will be carriers and what proportion will be affected?
Assuming that we are dealing with a single X-linked recessive phenotype, the
Hardy-Weinberg relationships are as follows:
Males Females
Genotypes: Al A, ALAI ALA, A,A,
Frequencies: p q = 1112 p' 2pq q'
= = =
q 1112; therefore p 1 - q 11112. The frequency of carriers, 2pq, is then
2 x 1112 x 11112 = 221144, and the frequencies of homozygous affected fe males is
q' = 1 in 144. Thus, this single-locus model predicts that 15% of females will be
carriers and 0.7% will be affected.
In fact, the freq uency of affected females is rather lower tha n this calculation
predicts. This is part of the evidence that there are two forms of red-green blind
ness, protan and deutan. These are caused by variants in different but adjacent
genes at Xq28. The mathematics and genetics are d iscussed in OMIM entry
303900.
Inbreeding
The simple Hardy-Weinberg calculations are invalid if the underlying assump
tion, that a person's two alleles are picked independently from the gene pool, is
violated. In particular, there is a problem if there has not been random mating.
86 Chapter 3: Genes in Pedigrees and Populations
Assortative mating can take several forms, but the most generally important is
inbreeding. A marriage in which there is a previous genetic relationship between
the partners is called a consanguineous marriage.
If you marry a relative you are marrying somebody whose genes resemble
your own:
• First-degree relatives (parents, children, full sibs) share half their genes
(always for parents and children; on average for sibs).
• Second-degree relatives (grandparents, grandchildren, uncles, aunts, neph
ews, nieces, half-sibs) share one-quarter of their genes, on average.
• Third-degree relatives (first cousins etc.) share one-eighth of their genes on
average.
In each case the sharing refers to the proportion of genes that are id entical by
descent, as a result of that relationship. We all share many genes simply by virtue
of all being human. The coefficient ofrelationship between two people is defined
as the proportion of their genes thatthey share by virtue oftheirrelationship. The
coefficient of inbreeding of a person can be defined as the probability that they
receive at a given locus two alleles that are identical by descent. It is equal to half
the coefficient of relationship of the parents. Popular myth would have it that
many rural or isolated populations are heavily inbred and backward because of
that. In fact, average inbreeding coefficients in most populations are well below
0.01, and almost never exceed 0.03 even in the most isolated groups.
Although inbreeding is seldom significant at the popul ation level, it can have
majo r consequences for individuals. The calculations in Box 3,6 show that the
rarer a recessive condition is, the greater will be the proportion of all cases that
are the product of cousin or other consanguineous marriages. For most Western
societies only 1% or less of all marriages would be between first cousins.
Nevertheless, for a recessive condition with gene frequency q = 0.01 , the form ula
shows that 6% of all affected children would be born in that 1% of marriages. The
converse also holds. For example, among white northern Europeans there is little
increased consanguinity among parents of children with cystic fibrosis, because
the condition is so common. Tay-Sachs disease (OMIM 272800) is strongly asso
ciated with parental co nsanguinity among non-Jews, in whom it is rare, but much
less so among Ash.kenazi Jews, in whom it is rather common.
For rare autosomal recessive conditions, a basic Hardy- Weinberg calculation
that ignores inbreeding will badly overestimate the carrier frequency in the pop
ulation at large. For a couple who have already had an affected child, the recur
rence risk is I in 4, regardless of whether they are cousins or unrelated. But for
situations in which the risk of an affected child depends on estimating the fre
quency of carriers in the population, a correct calculation requires a knowledge
of populatio n genetic theory, and of the genetics of the particular populatio n in
question, that goes beyond the scope of this book.
Other causes of departures from the Hardy-Weinberg relationship
Inbreeding is by far the most important cause of deviations from the Hardy
Weinberg relationship between gene frequencies and genotype frequencies.
Other possibilities include selective migration and mortality. [fpeople with a cer
tain genotype are more likely to be removed from the population, by migration or
death, then those remaining will be depleted of that genotype, in compariso n
"~th Hardy-Weinberg predictions. Large-scale immigration of people from a
population with a different gene frequency would equally upset the relationship.
One generation of random mating would suffice to reestablish a Hardy- Weinb erg
relationship, albeit with a new gene frequency. However, if the population
remained stratified into gro ups that did not interbreed freely, the Hardy- Weinberg
relationship might apply within each group, but not to the overall population.
For genetic marker studies, systematic laboratory error is an important pos
sible cause of an appa rent deviation from the expected Hardy- Weinberg ratios.
For example, the assay used might score heterozygotes unreJiably. Observed pro
portions can be compared with the Hardy-Weinberg prediction by a simple X2
test with k(k - 1) 12 degrees of freedom , where there are k alleles at the locus being
tested. More sophisticated statistical treatments are needed if the numbers are
small.
FACTORS AFFECTING GENE FREQUENCIES 87
Suppose a man marries his first cousin. Consider the risk of their having a ba by affected with an 1,4 1,4
autosomal recessive condition that has gene frequency q. lf we know nothing about him, there is
a chance 2pq that he is a carrier of the condition . If he is a carrier, there is a , in 8 chance that his
cousin also is. by virtue of their common ancestry ll=igu..-e 1). The overall risk of an affected child
is 2pq x 1/8 x 1/4 = pq/16. Had he married an unrelated woma n the risk would have been q2. the
same as for any other couple. The ratio of risk is
112
pq/16:q' = p: 16q
For a rare recessive condition, q will be small; p will therefore be almost 1 and the ratio simplifies to
U6q
Suppose a proportion c of all marriages in a popu lation are between first cousins. Let each
marriage produce one child . Suppose the ri sk that child would be affected by a certain rare
autosomal recessive condition. when the parents are first cousin s, Is x. The risk for a ch ild from
/
an outbred marriage is 16qx. The ccousin marriages produce xc affected children. and the (1 - c) Figure 1 Relatives share genes. Figures
outbred marriages produce 16qx{1 - c) affected children . A proportion eI[c + , 6q(1 - e)) of all show the proportion of their genes that
affected chitdren will be born to first cousins. The table shows some examples. relatives share with a proband (arrow) by
virtu e of [heir relationship.
Proportion of all marriages Frequency of disease Percentage of all affected
that are between first allele children whose parents are
cousins
0.01 J. 001
first cousins
6
j
0.01 0.005 11
0.01
--,--
I 0.001 39
0.05 0.01 25
0.05 0.005 40
0.05 0.001 77
Note that we have used several simplifying assumptions:
In deriving the 1:16q ratio of risks we used the approximation p =: 1. which would be
reasonable only for rare recessive conditions.
We ignored the possibility that the cou si n might carry the mutant allele inherited
independently, and not as a result of her relationship to the carrier man. Again, that is
reasonable for a rare condition, but not for a very common one.
Finally, we assumed that all unions were either between first cousins or fully outbred .
Despite these approximations, it remains generally true that consanguineous unions contribute
disproportionately to the incidence of rare recessive diseases.
A final point to note is that unrecognized locus heterogeneity can upset the
use of the Hardy-Weinberg relation to calculate genetiC risks. Suppose a recessive
disease wi th frequency Fis actually caused by homozygosity at anyone of 10 loci.
The example of Usher syndrome, quoted earli er, shows that this is not an unreal·
istic scenario. To keep the mathematics simple, we will suppose that each of the
10 loci contributes equally to the incidence of the disease, so each separate locus
contributes FIlO to the incidence. We wish to advise a known carrier ofhis risk of
having an affected child if his partner is unrelated and has no family history of
the disease. As we calculated previously, the risk is 2pq x 1/4. For each locus, the
disease allele has frequ ency V(FIlO), and the carrier frequency is approximately
2-1(FlIO) . [fwe were unaware of the locus heterogeneity, we would assume the
frequency of the disease allele to be -I F, and the carrier frequency to be approxi
2v
mately F. For equally frequent disease alleles at 11 loci, the single· locus calcula·
tion overstates the risk by a factor -111.
I
For CF. the disease frequency in Denmark is about one in 2000 births.
= =
p/q 0.978/ 0.022 43.72 =,,;,,.
The present CF gene frequency w ill be mainta ined, even without fresh mutations, if Aa
CONCLUSION
[n this chapter we have considered genes in a way that Gregor Mendel would
have recognized. Genes, as treated here, are recognized through the pedigree
patterns they give rise to as they segregate thro ugh multigeneration families.
Characters that do no t give Mendehan pedigree patterns are explained with
mathematical tools that stem fro m the work of Mendel's contemporary, Francis
Galton. Further mathematical tools developed by Hardy and Weinberg 100 years
ago can be used to infer aspects of population genetics. All this can be done with
out ever asking what the physical nature of genes is. Even when the relationship
of genes to chromosomes was worked out early in the rwentieth century, this was
done without any knowledge oftheir physical nature, or any need to understand
it, as we will see in Chap ter 14. But of course we are interested in understanding
what genes do and what they are, and for this we need to get physical-first with
cells, then with DNA. This is the subject of the fo llOwing chapters.
FURTHER READING
Introduction: some basic definitions Read AP & Donnai D (2007) The New Clini ca l Genetics. Scion.
(A basic undergraduate tex tbook illu strating clinical aspects of
Druery CT & Bate so n W (translators, revised by Blumberg R) (1901) the top ics covered in this chapter.]
Mendel's paper in English. http://www. mendelweb.org/ University of Kansas Genetics Edu cation Center.
Mendel.html (An Engli sh translati on of Mendel's original 1865 http://www.kumc.edu/g ec/ (Aportaltoa large ran ge oFWeb
paper, with notes.] resou rces covering basic genetics.]
Chapter 4
KEY CONCEPTS
Cells show extraordinary diversity in size, form, and function. Histology recognizes more than
200 different cell types in adult humans, but this is a massive underestimate- the number of
different neurons alone is likely to be in the thousands.
Cells make connections with each oth er (cell adhesion); transient connections allow cells to
perform various functions, while stable connections allow functionally similar cells to form
tissues.
Tissues are composed of cells of one or more types, plus an extracellular matrix (ECM), a complex
network of secreted macromolecules that supports cells and interacts with them to regulate
many aspects of their behavior.
Cell adhesion is regulated by a limited number of different types of cell adhesion molecule that
link ceUs to neighboring cells or to ECM. Neighboring ceUs also make contact through different
types of cell jWlction.
Many cell functions require that cells sig nal to each other over both short and long distances.
Signals transmitted by one cell change the behavior of responding cells by altering the activity of
proteins, notably transcription factors, that bring about a change in gene expression.
Transmitting cells often secrete a ligand molecule that binds to a receptor on the surface of
responding cells to initiate a downstream signal transduction pathway.
Other small signaling molecules pass through the plasma membranes of responding cells to bind
to intracellular receptors, or are anchored in the membrane of the transmitting cell and interact
with receptors on the surfaces of adjacent cells.
Cell proliferation is regulated at different stages in the cell cycle. During development, cell
proliferation causes rapid growth of an organism . At marurity, cell proliferation is limited to
certain cell types that need to be renewed. and there is an equilibrium between cell birth and cell
death.
Programmed cell death is functionally important in development and throughout life. There are
different pathways for getting rid of cells that are unwanted, unnecessary, or potentially
dangerous.
During development, cells become progressively more restricted in the range of cell types that
they give rise to. At maturity, most cells are terminally differentiated.
Stem cells are comparatively unsp ecialized cells that can give rise to differen tiated cells and yet
are able to renew themselves, often by asymmetric cell division.
Stem cells derived from the early embryo can give rise to all the cells of the body. Those derived
from cells later in development have less differentiation potential.
Immune system cells are diverse and highly specialized. Those of the innate immune system
work in the nonspecific recognition of foreign or altered host molecules known as antigens.
Band T1ymphocyres are the core of the adaptive immune system, which mounts strong, highly
specific immune responses.
Unique DNA rearrangements in Band T cells means that different immunoglobulins are made in
different B cells and that different T-cell receptors are made in different T cells. As a result, a
single individual can generate huge ranges of immunoglobulins and T-cell receptors.
92 Chapter 4: Cells and Cell-Cell Communication
Cells can be classified into broad taxonomic groups according to differences in plants
their internal organization and function s (see Figure 4. t). The major division of protophyta metaphyta
organisms into prokaryotes and eukaryotes is founded on fundam ental differ
ences in cell architecture. animals
Prokaryotes have a simple internal organization, with a single. membrane
protozoa metazoa
enclosed compartment that is not subdivided by any internal membranes, as
occurs in eukaryotes (see below). Under the electron microscope, prokaryotic
cells appear relatively featureless. However, prokaryotes are far from primitive: UNtC8..LULAR MULTICELLULAR
tlley have been through many more generations of evolution than humans. All
prokaryotes are unicellular, and they comprise two kingdoms of life: Flgu... ~ . l Classiftcation of unicellular
and multicellular organisms. The first two
Bacteria (formerly called e"bacteria to distinguish them fro m archaebacteria) domains of life-bacte ri a and archaea - are
are found in many environments. Some cause disease; others perform tasks also known as prokaryotes beca use they
that are useful or essential to human survival. Huge numbers of bacteria lack internal membranes. Eukaryotes, whic h
inhabit our bodies. Comprising about 500-1000 different species, the vast form the third domain, have membrane
majority of such commensal bacteria live in the gut, and many are beneficial: encl osed organelles. Thisdomain is further
they ferment complex indigestible carbohydrates and synthesize vario us vita subdivided into unicellular and multicellular
mins, including folic acid, vitamin K, and biotin. fungi, plants, and animals. The name protist
has been used as a generiC term for any
Archaea are a poorly understood group of organisms that superficially resem unicellular eukaryote but has also been used
ble bacteria and so were formerly termed archaebaeceria. They are often found to describe unicellular eukaryotes that can
in extreme environments (such as hot acid springs) , but some species are not easily be classified as animals, plants,
found in more convivial locations such as in the gutS of cows. or fungi.
CELL STRUCTURE AND DIVERSITY 93
Prokaryotes have no defined nucleus and their chromosomal DNA does not
seem (0 be highly organized; instead it exists as a nucleoprotein complex known
as the nucleoid (Box 4.1). The typical prokaryote genome is represented as a sin
gle' circular chro mosome containing less than 10 Mb of DNA. However, prokary
otes have recently been identified that possess multipl e circular chromosomes,
multiple linear chromosomes, mixtures of circular and linear chromosomes. and
genomes up to 30 Mb in size in the case of Bacillus megaterium.
Eukaryotes are thought to have first appeared about 1.5 billion years ago.
They have a much more complex organization than their prokaryote counter
parts, with internal membranes and membrane-enclosed organelles including a
nucleus (see Box 4.1). There is only one kingdom of eukaryotic organisms-the
Eukarya-b ut it includes both unicellular organisms (yeasts, protozoans, and so
on) and multicellular fungi, plants, and animals.
In the vast majority of eukaryotes, the genome comprises two or more linear
chromosomes contained within the nucleus. Each chromosome is a single, very
long DNA molecule packaged with histones and other proteins in an elaborate
and highly organized manner. The number and DNA content of the chromo
somes vary greatly between species (see also below).
The membranes surrounding and dividing the cell are selectively permeable,
regulating the transport of a variety of ions and small molecules into and out of
the cell and between compartments.
The soluble portions of both the cytoplasm and the nucleoplasm are highly
organized. In the nucleus, a variety of subnuclear structures have been identified
and are thought to be arranged in the context of a nuclear matrix (see Box 4.1). In
the cytoplasm, the cytoskeleton is an internal scaffold of protein filaments that
provides stability, generates the forces needed for movement and changes in cell
shape, facilitates the intracellular transport of organelles, and allows communi
cation betvveen the cell and its environment.
Eukaryotic cells have many different types of membrane-enclosed are ~rst assembled. Human rRNA genes are clustered on the short
structures within their cytopla sm. Not all the structures li sted below arms of ch rom osomes 13, 14, 15, 21, and 22, and ate brought close
are present in every cell type; some human cells are so specialized to together within the nucleolus.
perform a single function that the nucleus and other o rgane lles are Cajal bodies (or coiled bodies) are thought to be sites where small
d iscarded and the cells rely o n pre-synthesized gene products. nuclear/n ucleolar ribonucleoprotein (snRNP/snoRNP) particles are
The plasma membrane provides a protective barrier around assembled, and m ay also be involved in gene regulation.
the cell. It is based o n a dou ble layer of phosp holipids (Figure 1). Speckles (or interchromatin granu les) are believed to be regions
Hydrophobic lipid'tails' are sa ndwiched between hydrophilic phosphate where fully mat ure snRNPs assemble in readiness for splicing
groups that are in contact with pola r aqueous environments, the pre-mRNA.
cytopla sm and extracellular environment. The plasma membrane is PML bodies appear as rings and are composed predominantly o f
selectively permeable, regulating the transport of a variety of io ns and the premyelocytic leukemia (PML) protei n, but their functions are
small molecules into and o ut of the cell. unknown .
The cytosol, the aq ueous component of the cytop lasm, makes Perichromatin fibrils <I re sites where nascent RNA accumulates.
u p about ha lf the volume of the cell and is the site of majo r metabolic Cleavage bodies are sites of polyadenylatio n and cleavage.
ac tivity. includ ing most protein synthesis. The cytosol is very highly Mitochondria are sites of oxidative phospho rylation, by which
o rganized by a series of protein fil aments, collectively called the organic nutrients are oxidized to generate ATP th at is then used to
cytoskeleton, that have a major role in cell m ovem ent, cell shape, and power the different functions of a cell. They have two membranes:
intracellu lar t ransport. There are three types of cytoskeletal fil ament: a comparatively smooth o uter m embrane. and a complex. h ighly
Microfifaments are polymers of t he protein actin (and so are also folded in ner mitoc hondrial membrane. The inner compartm ent,. the
known as actin filament s). They provide mechanica l support to mitochondrial morrix, contai ns enzymes and chemical intermediates
the cell, allow controlled changes to cell shape, and facilitate cell involved in energy metabolism. Mitochond ria, and also the chloroplasts
movement by form ing structures suc h as filopodia and fameflipodia of plants cells, contain DNA. They also have their own ribosomes t hat
(extensions to the ce ll that all ow it to crawl along surfaces). are dedicated to t ransla ting mRNA transcribed fr om mitochondrial
Microtubules are much more rigid than actin filaments and are DNA. However, m ost m itochondrial protein s are encoded by nuclear
polymers o f tubulin proteins. They are important constituents of the genes, synthesized on cytop lasmic ribosomes, and then imported into
centrosome and mito tic spindle (see Box 2.1) and also form t he core mitochondria.
of cilia and flagella. Peroxisomes (microbodies) are smal l si ngle·membrane vesicles
In termediate filaments have p redominantly structural roles. containing enzym es that use m o lecular oxygen to oxidize their
Examples include the neurotilaments of nervous system celis and su bstrates and generate hydrogen peroxide.
keratins in epithelial cells. Lysosomes are small mem brane-enclosed vesicles containing
The endoplasmic reticulum (ER) consists of flatte ned, sing le · hydrolytic enzymes that digest materials broug ht into the cell by
membrane vesicles whose inner compartments (cisterna e) are phagocytosis or pinocytosis. Lysosomes also help in the degradation of
interconnected to form cha nnels throughout the cytoplasm. The ER cell com ponents after cell death.
has tWO key functions: the intracellular sto rage of ci+, which is widely Cilia are sm all stru ctures containing microtub ule filaments that
used in cell signaling; and the synthesis, folding, and modification of extend from the plasma membrane and beat backwa rd and forward, o r
pro teins and lipids destined for the cell membrane or for secretion . The rotate. In vertebrates, a very few specialized cell ty pes have multiple ci lia
rough endoplasmic reticulum is studded w ith ribosomes that synthesize that are used to g enerate movement. They include epithelial cells lining
proteins that will cross the membrane into the intracisternal space the lu ngs and oviduct where the cilia bea t together to move mucus
before being transpo rted to the periphery of the cell. Here. they can be away from the lungs o r the egg toward the uterus. Most vertebrate
incorporated into the plasma m emb rane, retrieved to t he ER. or sec reted cells, however, have a single cilium. known as t he primary cilium, w hose
from t he cel l. The attachment of the sugar residues (glycosylation) that f unction is not involved in generating movement- instead, it is packed
adorn many human proteins begins in the ER. w ith many di fferent kinds of receptor molecule and acts as a sensor of
The Goigi complex consists of flattened single-membrane vesicles, the cell's envIronment. Sperm cells have a si ng le, rather large and much
which are often stacked. Its primary function is to secrete cell produc ts, longer version of a cilium, known as a flagellum . The flage llum moves in
such as proteins, t o the exterior and to hel p form the plasma membrane a wh ip-like fas hio n to propel the sperm cell forward,
and the membranes o f Iysosomes. Some of the small vesicl es that
arise peripherally by a pinching-o ff process contain secre tory products
(secretQlyvacuoJes). Glycoproteins arriv ing from the ER are further
modified in t he Goigi complex.
pnmary cil ium __
cytosol
\
, ~.~
I peroxisome
:/
endoplasmiC
reticulum
.'
The nucleus contains t he ch romosomes and
th e vast majority of the DNA of an anima l cell. It
is su rrounded by a nuclear envelope, composed
nucleoid wall III • ..... 0-0
~
I) I) ....-';__
.
nucleus
Lymphocytes (> 10) B celis, T celis, natural killer cell (Figu re 4.17)
Blood g ranulocytes (3) basophil, neut roph il, eosinophil (Figure 4 .17)
Mast cells
Heart muscle cells (3) myoblast, syncytial muscle fiber cell (Figure 4.28)
Exocrine secretion specialists (>27) g oblet (mucus-secreting) and Paneth (lysozyme-secreting) cells, of
intestine (Figure 4.18B)
Primary barrier fu nction or involved in wet stratified barrier (11) coll ecting duct cell of kidney
Absorptive function in gut, exocrine glands, and urogenital tract (8) intestina l brush bord er cell (with microvi ll i) (Figures 4.3, 4.4)
Connective tissue (many) fibroblasts, incl uding chondrocyte (cartilage) and osteoblast/osteocyte
(bone) (Figures 4.4, 4.5)
Photo receptors and cells involved in perception of acceleration and gravity, rod cell
hearing, ta ste, touch, temperature, blood pH, pain, etc. (many)
Supporting cells of sense organs and periphera l neurons (12) Schwann cell (Figure 4.6)
Metabo lism and storage specia lists (4) liver hepatocyte and li pocyte, brown and w hite fat celts
CNS, central nervous system. Note that some cells in early development are not represented in adults. A full list of the human adult cell types
recognized by histology is available on the Garland Scientific Web site at: http://www.gariandscience.com/textbooks/0815341059
.asp?type=supplements
96 Chapter 4: Cells and Cell-Cell Communication
UNICELLULAR
MULTICELLULAR
Note that gene numbers are best current estimates, partly because of the difficulty in identifying
genes encoding functional RNA products. For genome sizes on a wide range of organisms, see the
database of genome sizes at http://www.cbs.dtu.dkldatabases/DOGS/index.php
CELL ADHESION AND TISSUE FORMATION 97
(A) (8)
,
blood vessel (blood sinus)
""'- - .:::::::::>
fusion
<: - :-. ~
~>
~
myoblasts skeletal muscle fiber cell 50!lm
There are also very small differences in the DNA sequence of cells from a sin Rgul"e 4.2 Examples of polyploid
gle individual. Such differences can arise in three ways: cells arising from endomitosis or cell
fusion. (A) The megakaryocyte is a giant
• Programmed differences in specialized cells. Sperm cells have either an X or a polyploid (16C-64C) bone marrow cell
Y chromosome. Other examples are mature Band T lymphocytes, in which that is responsible for producing the
cell-specific DNA rearrangements occur so that different B cells and different thrombocytes (platelets) needed for blood
T cells have differences in the arrangement of the DNA segments that will clotting. It has a large multi-lobed nucleus
encode immunoglobulins arT-cell receptors, respectively. This is described in as a result of undergoing multiple rounds
more detail in Section 4.6. of DNA replication without cell division
(endomitosis). Multiple platelets are formed
• Random mutation and instability of DNA. The DNA of all cells is constantly
by budding from cytoplasmic processes of
mutating because of environmental damage, chemical degradation, genome the megakaryocyte and so have 1")0 nucleus.
instability, and small but significant errors in DNA replication and DNA repair. (B) Skeletal muscle fiber cells are polyploid
During development, each cell builds up a unique profile of mutations. because they are formed by the fusion of
Chimerism and colonization. Very occasionally, an individual may naturally large numbers of myoblast cells to produce
have two or more clones of cells with very different DNA sequences. Fraternal extremely long multinucleated cells.
(non-identical twin) embryos can spontaneously fuse in early development, A multinucleated cell is known as a
syncytium.
or one such embryo can be colonized by cells derived from its twin.
Cell adhesion molecules (CAMs) are typically transmemb rane receptors with
three domains: an intracellular domain that interacts with the cytoskeleton, a
transmembran e domain that spans the width ofthe phospholipid bilayer, and an
extracellular domain that interacts either with identical CAMs on the surface of
other cells (homophilic bindinltJ or with different CAMs (heterophilic binding) or
the ECM. There are four major classes of cell adhesion molecule:
• Cadherins are the only class to participate in homophilic binding. Binding
typically requires the presence of calcium ions.
• integrins are adhesion heterodimers that usually medjate cell-matrixinterac*
lions, but certain leukocyte integrins are also involved in cell--;;ell adhesion.
They are also calcium-dependent.
Selectinsmediate transient cell-cell interactions in the bloodstream. They are
important in binding leukocytes (white blood cells) to the endothelial cells
that line blood vessels so that blood cells can migrate out ofthe bloodstream
into a tissue (exEravasation).
Jg-CAMs (immunoglobulin superfamily cell adhesion molecules) are calcium
independent and possess immunoglobulin-like domains (see Section 4.6).
~ muscle cells
fluid containing a complex network of
secreted macromolecules; see Figure
connective
4.5). The inner epithelial layer (top) is
~:1l~i~~~;:§:~~~~:i~~~=~~
a semi-permeable barrier, keeping the
epithelium/C gut contents wi thin the gut cavity (the
tissue
lumen) while transporting selected
epithelial cell
nutrients from the lumen through into
the extracellular fluid of the adjacent
layer of connective tissue. [From Alberts
B, Johnson A, Lewis J et al. (2002)
biological reservoirs by storing active molecules such as growth factors, and pro Molecular Biology of the Cell, 4th ed.
teoglycans may be essential for the diffusion of certain signaling molecules. Garland SciencefTaylor & Francis LLC.]
The ECM macromolecules have different functional roles. There are structural
proteins, such as collagens, and also elastin, which allows tissues to regain their
shape after being deformed. Various proteins are involved in adhesion; for exam
ple, fibronectin and vitrinectin facilitate cell-matrix adhesion, whereas laminins
facilitate the adhesion of cells to the basal lamina of epithelial tissue (see below).
Proteoglycans also mediate cell adhesion and can bind growth factors and other
bioactive molecules. Hyaluronic acid facilitates cell migration, particularly dur
ing development and tissue repair, and the tenascin protein also controls cell
migration.
asymmetry (polarity) in a plane that is at right angles to the cell sheet, with a
basolateral part of the cell adjacent to and interacting with the basal lamina, and
an apical part at the opposing end that faces the exterior or the lumen of a cylin
drical tube.
Connective tissue
Connective tissue is largely composed of ECM that is rich in fibrous polymers,
notably collagen. Sparsely distributed within connective tissue is a remarkable
variety of specialized cells, including both indigenous cells and also some immi
grant cells, notably immune-system and blood cells. The indigenous cells com
prise primitive mesenchymal cells (undifferentiated multipotent stem cells) and
various differentiated cells that they give rise to, notably fibroblasts that synthe
size and secrete most ofthe ECM macromolecules (see Figure 4.5).
Cells are sparsely distributed in the supporting ECM, and it is the ECM rather
than the cells it contains that bears most of the mechanical stress on cOImective
tissue (see Figure 4.5). Loose connective tissue has fibroblasts surrounded by a
flexible collagen fiber matrix; it is found beneath the epithelium in skin and many
internal organs and also forms a protective layer over muscle, nerves, and blood
vessels. In fibrous connective tissue the collagen fibers are densely packed, pro
viding strengili to tendons and ligaments. Cartilage and bone are rigid forms of
connective tissue.
Muscle tissue
Muscle tissue is composed of contractile cells that have the special ability to
shorten or contract so as to produce movement ofthe body parts. Skeletal muscle
fibers are cylindrical, striated, under voluntary control, and multinucleated
because they arise by the fusion of precursor cells called myoblasts (see Figure
4.2B). Smooth muscle cells are spindle-shaped, have a single, centrally located
nucleus, lack striations, and are under involuntary control (see Figure 4.4).
Cardiac muscle has branching fibers, striations, and intercalated disks; the com
ponent cells, cardiomyocytes, each have a single nucleus, and contraction is not
under voluntary control.
Nervous tissue
Nervous tissue is limited to the brain, spinal cord, and nerves. Neurons are elec
trically excitable cells that process and transmit information via electrical signals
(impulses) and secreted neurotransmitters. They have three principal parts: the
cell body (the main part of the cell, performing general functions); a network of
dendrites (eXlensions of the cytoplasm that carry incoming impulses to the cell
body); and a single long axon that carries impulses away from the cell body to the
end ofthe axon (FIgure 4.6).
I)
;>. 1\
.__ /
~
/ ,.- I I ,
-..
layers of
r -, myelin Schwann ~ myelin
't"~-<::
nucleus 10• .. _._ sheath cell
Flgu... 4..6 Neurons and myelination. (Al Neuron structure. Each neuron has a single long axon with multiple axon termini (dendrites) that are
connected to other neurons or to an effector cell such as a muscle cell. Neurons are insulated by certain glial cells such as Schwann cells that form a
myelin sheath. (B) Myelination of an axon from a peripheral nerve. Each Schwann cell wraps its plasma membrane concentrically around the axon,
forming a myelin sheath covering 1 mm of the axon. [From Alberts B, Johnson A, Lewis J et al. (2008) Molecular Biology of the Ceil, 5th ed. Garland
ScienceJTaylor & Francis LLC.]
102 Chapter 4: Cells and Cell- Cell Communication
Neurons account for less than 10% of cells in the nervous system; the other
90% are glial cells. Glial cells do not transmit impulses, but instead support the
activities of the neurons in a variety of ways. The axons of neurons have an insu
lating sheath of a phospholipid, myelin, that is produced by cenain glial cells:
oligoden.drocytes (in the central nervous system) and Schwan.n cells (in the
(A)
peripheral nervous system). Astrocytes are small star-shaped glial cells that
ensheath synapses and regu late neuronal function. Microglial cells are phago
cytic and protect against bacterial invasion. Other glial cells provide nutrients by
® ®
binding blood vessels to the neurons. !. --~_-1:
Neurons communicate with each other at two types of synapse, as detailed in transmitting I
the next section. cell "'"
(esponding
4.3 PRINCIPLES OF CELL SIGNALING cell
All cells receive and respond to signals from their environment, such as changes
in the extracellular concentrations of certain ions and nutrient molecules, or
temperature shocks. Whereas single-celled organisms largely function independ
ently, the cells of multicellular organisms must cooperate with each other for the
benefit of the organism. To coordinate and regulate physiological and biochemi (8)
cal functions they must communicate effectively by sending and receiving sig
nals. The role of cell signaling during early development will be explained in
Chapter 5. In this section, we give some of the principles underlying cell
"'• .
signaling.
Signaling molecules bind to specific receptors in responding cells
J,
to trigger altered cell behavior J,
In a multicellular organism virtuaUy all aspects of cell behavior-metabolism,
movement, proliferation, differentiation-are regulated by cell signaling.
.....- •
--.......
••
Transmitting cells produce Signaling molecules that are recognized by respond
ing cells, causing them to change their behavior. Intercellular signaling can take
place over long distances, as in the endocrine signaling used to transmit hor
mones, or over short distances by several different mechanisms: •
• Pamerine signaling involves signaling between neighboring cells. A cell sends
a secreted signaling molecule that diffuses over a short distance to respond
ing cells in tlle local neighborhood.
• Synaptic signaling is a specialized form of signaling that involves signaling
across an extremely narrow gap, the synaptic cleft, between the terminus of (C)
an axon and the cell body of a communicating neuron (see below).
• Juxtacrine signalingoccuIs when the transmitting cell is in direct contact with
the responding cell; the signaling molecule is tethered to the surface of the
transmitting cell and is bound by a receptor on the surface of the responding
cell.
Cell signaling is initiated when transmitting cells produce signal ing molecules
that are recognized and bound by specific receptors in the responding cells. The
transmitting cells and the res ponding cells are usually different cell types, but in
FI9ure 4.7 Three types of relationship
autocrine signaling a cell produces a signaling molecule that can bind to a recep
between ligand and receptor in cell
tor on its own cell surface or on identical neighbor cells. Autocrine signaling can signaling_ (A) Soluble ligand and cell surface
be used to reinforce a signaling decision or to coordinate decisions by groups of receptor. The ligand is often a protein that
cells. binds to a transmembrane protein receptor,
Some small hydrop hobic signaling molecules can pass directly through me activating its cytoplasmiC tail; see Figures
plasma membrane of the responding cell and bind to intracellular receptors 4.10 and 4.15 for examples. (8) Small so luble
{Figure 4.78 1. In many examples of cell Signaling, however, the signaling mole ligand and intracellular receptor. The
cule cannot cross the cell membrane-it may be a soluble molecule (Figure 4.7A) ligand may be a gas (such as nitric oxide),
or be anchored in tlle plasma membrane ofthe transmitting cell (Figure 4.7C) or a small protein or steroid that ca n freely
and so binds a receptor spanning the plasma membrane of the responding cell. pass through membranes. The intracellular
Most vertebrate cell signaling pathways involve the migration of signaling mole receptor is often in the nucleus but may be
in the cytoplasm, as in the example shown
cules that then bind to receptors on the surface of responding cells. Tallie 4.3
in Figure 4.9. (C) Ligand and cell surface
provides examples of the different cell signaling systems in vertebrates, and we receptor anchored in the plasma membrane
consider detatls of some mechanisms in the sections below. of adjacent cells; see the interaction between
The endpoint of most cell Signaling is altered gene expression, producing the Fas receptor and its ligand in Figure 4.16
behavioral changes in the responding cells. The altered gene expression is usuaUy for an example.
PRINCIPLES OF CELL SIGNAliNG 103
TABLE 4.3IMPORTA!!'" CLASSES OF VERTEBRATE CELL SIGNALING MOLECULES AND THEIR RECEPTORS 9
Signaling molecule Receptor Examples/comments
5mallsignaling molecules that cross the cell Internal receptors See Agure4.7A
membrane
Steroid hormones, retinoids can reside in cytoplasm or nucleus; are converted signaling using nuclear hormone receptors
to transcription factors when bound by their (Fig ures 4.8 and 4.9)
ligand
Signaling molecules that migrate to reach Cell surface receptors See Figure 4.78
cell surface
Some growth factors (FGFs, EGFs), ephrins, receptors with intrinsic kinase activity many; FGF, EGF, and ephrin signaling are
and some hormones (insulin) widely used during embryonic development
Cytokines (lnterleukins, interferons, etc.); recepto rs with associated tyrosine kinase activity; many
some hormones and growth factors (growth internal domains have associated JAK protein
hormone, prolactin, erythropoietin, etc,) (Fig ure 4.10)
Various hormones (epinephrine, serotonin, G-protein-coupled recepto rs (Figure 4. J 1); they signaling involving olfactory receptors, taste
gl ucagon, FSH, etc.), histamine, opioids, span the cell membrane seven times and their receptors, rhodopsin receptors, etc.
neurokinins, etc. internal domains have associated GTP-binding
proteins
TGF-J! family receptors w ith associated serine/threonine kinase many examples in signaling during embryonic
activity, internal domain associated with SMAD development
proteins
Signaling molecules immobilized on cell Cell surface receptors See Figure 4.7C
surface
EGF, epidermal growth factor; FGF, fibrob last growth fa ctor; FSH, follicle·stimulating hormone; TGF-J!, transforming growth factor·~.
Genes are regulated by a variety of protein transcription factors that The helix-loop-helix (HlH) motif also consists of two a-helices,
recognize and bind a short nucleotide sequence in DNA. Eukaryotic but this time connected by a flexible loop that, unlike the short turn
transc ription factors generally have two distinct functions located in in the HTH motif, is flexible enough to permit folding back so that the
different parts of the protein: two helices can pack against each other (that is, the two helices lie in
a DNA-binding domain that allows the transcription factor to planes that are parallel to each other). The HLH motif mediates both
bind to a specific sequence element in a target gene; DNA bind ing and protein dimer formation. Heterodimers comprising
an activation domain that stimulates transcription of the target a full-length HLH protein and a truncated HLH protein that lacks the
gene, probably by interacting with basal transcription factors in full length of the O'.-helix necessary to bind to the DNA are unable to
the transcription complex on the promoter. bind DNA tightly. As a result, HLH heterodimers are thought to act as
Some proteins, such as steroid hormone receptors, have only a a control mechanism, by enabling the inactivation of specific gene
DNA-binding domain but, after binding a ligand, they can cooperate regulatory proteins.
with proteins called co-activators to perform the activities of a The leucine zipper is a helical stretch of amino acids rich in
transcription factor. See the example of the glucocorticoid receptor in hydrophobic leucine residues, aligned on one side of the helix. These
Figure 4.9. hydrophobic patches allow two individual a-helical monomers to join
Common DNA-binding motifs in transcription factors together over a short distance to form a coiled coil. Beyond th is reg ion,
Several common protein structural motifs have been identified, most the two a -helices separate, so that the overall dimer is a V-shaped
of which use a-helices (or occasionally p-sheets) to bind to the major structure. The dimer is thought to grip the double helix much like
groove of DNA. Although such structural motifs provide the bas is a clothes peg grips a clothes line. Leucine zipper proteins normally
for DNA binding, the precise sequence of the DNA-binding domain form homodimers but can occasionally form heterodimers. The latter
determines the DNA sequence-specific recognition. Most transcription provides an important combinatorial control mechanism in gene
factors bind to DNA as dimers (often homodimers), and the DNA regulation.
2
binding region is often distinct from the region specifying dimer The zinc finger motif involves the binding of a Zn + ion by four
formation. conserved amino acids (norma!iy either histidine or cysteine) so as to
The helix-tum-helix (HTH) motif (Flgu.r. , ) is a common form a loop (finger), which is often tandemly repeated. The so-called
motif found in transcription factors. It cons ists of two short a.-helices OH2 (Cys2/His2) zinc finger typically comprises about 23 amino acids,
separated by a short amino acid sequence that induces a turn, so that with neighboring fingers separated by a stretch of about seven or
the two a-helices are orientated in different planes. Structura l studies eight amino acids. The structure of a zinc finger may consist of an
have suggested that the C-termina l helix (shown to the right) acts as a.-helix and a p-sheet held together by coordination with the
a specific recognition helix because it fits into the major groove of the Zn 2+ ion, or of two a-helices, as shown in Figure 1. In either case,
DNA controll ing the precise DNA sequence that is recognized. the primary contact with the DNA is made by an O'.-helix binding to
the major groove.
HTH motif HTH motif binding to DNA HLH monomer HLH dimer binding to DNA
loop
a-helix
eOOH
l eu l L
LL
II
leu II
((-helix
Leu
leu
FJgllN t Structural motifs commonly found in transcription factors and DNA-binding proteins. HTH, hel ix- tum-helix motif; HLH, helix-Ioop
helix motif.
PRINCIPLES OF CELL SI GNALING 105
(A) nuclear receptor (8) response element Figure 4.8 The nuclear receptor
1 777 superfamily. (A) Members of the nuclear
GR Iill!lJ 4:li ' TCT
;'G~.. nnn
nnn I~TT~i
AA ·, receptor superfamily all have a similar
structure, with a central DN A-binding
1 553
domain (D80) and a ( -terminal ligand
ER Em i:l, ~ggI~~~ g 'i~¥aM binding domai n (LBO). Numbers refer to
1 432 the protein size in amino acid residues.
},SG~n n n nn
RAR m!iI : ':Ii r ell , nnnnn ~8~8~~ GR, glucocorticoid recepto r; ER, estrogen
receptor; RAR, retinoic acid recep tor; TA,
1 408
~G~,!GACrr thyroxine recepto r; VDR, vita min 0 receptor.
TR IiliD ':!i .... CAG . ACT Ga.;
(8) The response elements recognized by
1 427 the nuclear receptors also have a conserved
YDR IIill!lI "II ~g~lIggg ~~~r~~ struc ture, with two hexanudeotide
recogn ition sequ ences typically separated
by either three o r five nucleotides. The
hexanuc\eotide sequences have the gen eral
The receptors for steroid hormones and also those for the signaling molecules consensus of AGNNCA with the two central
thyroxine and retinoic acid belong to a common nuclear receptor superfamily. nucleotides (pale shading) co nferring
Each receptor in this superfamily contains a centrally located DNA-binding specificity and belonging to one of three
domain of about 68 amino acids, and a ligand-binding domain of about 240 classes; AA, AC. o r GT,
amino acids located close to the C terminus (FIgure 4.8A.). The DNA-binding
domain contains structural motifs known as zinc fingers (see Box 4.2) a nd binds
as a dimer, with each lnonomer recognizing one of wo hexanucleotides in the
response element. The two h exanucleotides are either inverted repeats or direct
repeats that are usually separated by three or five nucleotides (Figure 4.8B).
The nuclear hormone receptors are normally found in the cytoplasm in an
inactive state. Either the ligand-binding domain directly represses the DNA
binding domain or the receptor is bound to an inhibitory protein , as in the gluco
corticoid receptor [Agure 4.9). Upon ligand binding, the inhibition is relieved
.-GC
and the activated ligand-recep to r complex migrates to the nucleus.
GCR
Signaling through cell surface receptors often involves kinase cascades
The plasma membranes of a nimal cells positively bristle with transmembrane
M
• to+
receptors for signaling molecules, and the primary cilium, which protrudes into
th e external environment, is studded with such receptors. This previously myste J.
rious organelle is now thought to have a major role as a sensor, using the multiple
different receptors to sense, and respond to, alterations in the extracellular
....
,
-,"
-
environment. (1 -V
When a receptor binds its signaling molecule on the external surface of the 'CJ:)
cell, a conformational change is induced in its internal (cytoplasmic) domain.
This change alters the properties of the receptor, perhaps affecting its contact
with another protein, or stimulating some latent enzyme activity. For many sig
r.~~
r .·), I
'-J\.._
activation
Hsp90
naling receptors, either the receptor or a n associated protein has integral kinase
activity. The activated kinase often caUSes the receptor to phosphorylate itself, ,I.',
_ 'r" ~ , , __
which then allows it to phosphorylate other proteins inside the cell, thereby acti
vating them. These target protein s are themselves often kinases. They, in turn, -- \....-<i
"'111"
can phosp horylate and thereby activate proteins further down a signal transduc dimeriZ3tlon and
I
tion pathway, resulting in a kinase cascade. translocation to
Eventually, a transcription factor is activa ted (or inhibited as appropriate), nucleus
resulting in a change in gene expression. The length of the signaling cascade can
be short (as in the cytokine-regulated TAlC-STAT pathway; See FIgure 4. 10) or nucleus
have many steps (such as the MAP kinase pathway) . Pathways in which receptors
do not have ki nase activity often have several downstream components with
either kin ase or phosphorylase activity.
e· T· '}
'"
.("
I j t ~ \
Flgure 4.9 Cell signaling by ligand activation of an intracellular receptor. target genes
A glucoco rticoid (Gel, like other hydrophobk hormones, can pass through
the plasma membrane and bind to a specific intracellular receptor. The
glucocorticoid receptor (GCR) is normaJly bound to an Hsp90 inhi bito ry protein
complex and is found w ithin the cytopla sm, After binding to glucocorticoid.
howeve r, the inhi bitor complex is released and the now act ivated receptor
I~ta':et l
forms dimers and trans locates to the nucleus. Here, it works as a transcription
factor by specifically binding to a particula r response eJemenr sequence (lower
panel; see Figure 4.8B) in target genes at. and with the cooperation of, specific
. gene
co-activator p roteins. activating the target genes.
106 Chapter 4: Cells and Cell-Cell Communication
•
prolactin prolactin receptors
Figure 4.10 ligand activation of a
receptors as dimerize
plasma membrane receptor. JAK- STAT
monomers
signaling involves JAK kinases that are
STATS
=:J=
/,r'
of the receptors activa tes a dormant
receptor kinase activity that then ta rgets
a specific STAT protein, in th is case STATS.
phosphorylated d1meriz8tion and
l
Once activated by phosphorylation, STATS
translocation
dim erizes, translocates to the nucleus, and
activates th e transcriptiOn of various target
genes.
/'
I (-'.r L I
Lll_J
lr"~et1
gene
J_ __Jr
I
Signal transduction pathways often use small intermediate
intracellular signaling molecules
Signal transduction pathways initiated by the binding of a signaling molecule to
a cell surface recepto r are often complex. In addition to the kinases and other
enzymes mentioned above, the cascade of interacting molecules involved in
communicating the signal within the cell can include various small, diffusible
intracellular signaling molecules that act as intermediates in signal transduction.
These are known as second messengers (Table 4.4), with the first messenger
being the extracellular signaling molecule.
G proteins are one common type of second messenger, which bind to
G-p rotei n-col1pled receptors (GPeRs). Mammalian genomes typically have more
than 1000 genes that encode GPCRs, many of them serving as receptors for pros
tagland ins and related lipids, vario us neurotransmitters, neuropeptides, and
peptide hormones; others are responsible for relaying the sensations of sight,
smell, and taste. GPCRs characteristically have a transmembrane domain that
passes through the plasma membrane seven times and a cytoplasmic domain
that is bound to a G protein, a GTP- binding protein with three SUbunits-a, ~,
and y (Figure 4.11 ).
Hydrophilic Cyclic AMP (cAMP) prod uced from ATP by adenylate cyclase effects are usually mediated through protein kinase
(cytosolic)
Cyclic GMP (cGMP) produced from GTP by guanylate cyclase besHha racterized role is in visua l reception in the
vertebrate eye
HydrophobiC phospha tidyli nositol PIP2 is a phospholi pid that is enriched IP J binds to receptors in the endo~lasmic reticulum,
(membrane 4,5-bisphosphare (P1P21 at the plasma membrane and can be thereby causing the release of Ca + into th e cytosol;
associated) and its cleavage products; cleaved to generate DAG on the inner Ca 2+-dependent protein kinase C is thereby recruited
diacylglycerol (DAG); inositol layer of th e plasma membrane, releasing to the plasma membrane (see Figure 4.12)
trisphosphate (lp)) [P3 into the cytoplasm
PRINCIPLES OF CELL SIGNALING 107
I t
Binding Of ligand (ll to fue extracellular
domain of a GPCR activates its cytoplasmic
do main and causes exchange of GTP for GOP
@> GI) on the G protein . As a result. the Ga subunit
activateCI activated is activated and dissociates from the GPy
GCI subunit GPrsubunil
dimer. which in turn becomes ac tiva ted.
Different subtypes of G-protein subunits ca n
In its inactive form, the G protein has GDP bound to its a subunit. On binding perform different functions
of a ligand to the GPCR, the G protein is stimulated and the a subunit releases (see Figure 4. 12).
GDP and binds GTP instead. Binding of GTP causes the activated a subunit to
dissociate fro m the ~y dimer, which is then activated. Each of the activated a and
~y units can then interact with proteins downstream in the signal transduction
pathway.
There are several different types of G protein, each of which associates with
different seco nd messengers. Some G protein s stimulate or inhibit the
membrane-bound enzyme adenylate cyclase, causing a change in intracellular
cAMP levels. Other G proteins stimulate the production oflipids, such as inositol
I,4,5-trisphosphate and diacylglycerol, and the release of calcium ions (Figure
4.12). In turn, the second messengers activate downstream protein kinases such
as cAMP- dependent protein kinase A and calcium-dependent protein kinase C,
which go on to phosphorylate (and change the activity of) particular transcrip
tion factors.
There is extensive crosstalk between different Signaling pathways. At any
moment, the response given by a particular cell depends on the sum of all signals
that it receives and the nature of the receptors available to it.
+.· ~)•....
endoplasmic reticulum. With the help of
• •· caf • •• Ca 2+, the membrane-bou nd DAG causes the
~
2 activation of protein kinase C, which is then
• to • : ••
•
Ca2-+relea sed
recruited to the pl asma membrane, where
it phosphorylates target proteins that differ
into cyto sol according to ce ll type.
108 Chapter 4: Cells and Cell-Cell Communication
that spreads as a traveling wave along the plasma membrane of the axon. The synaptic vesicle
containing
change in electrical potential at the axon terminus causes a neurotransmitter neurotransmitters
to be released that diffuses across the synaptic cleft to bind to receptors on \
dendrites of interconnected neurons, causing local depolarization of the >~
plasma membrane (see Figure 4.13). In a similar fashion, neurons also trans D't"
• .,L ____ neurotransmitter
mit signals to muscles (at neuromuscularjunctions) and glands. • . 1::!r""" receptor
Electrical synapses are less common. Here, the gap between the two connect ":Ii,
ing neurons is very small, only about 3.5 nm, and is known as a gap junction.
Neurotransmitters are not involved; instead, gap junction channels allow ions
axon terminus -, I l dendrite
synaptiC
to cross from the cytoplasm of one neuron into another, causing rapid depo cleft
larization of the membrane.
Figw. 4.13 Synaptic signaling.
At chemical synapses, a depolarized axon
4.4 CELL PROLIFERATION, SENESCENCE, AND terminu s of a transmitting (presyna ptic) cell
releases a neurotransmitter that is normally
PROGRAMMED CELL DEATH stored in vesicles. The release occurs by
exocytosis: the vesicles containing the
Cells are formed usually by cell division. Cell division may be common in certain neurotransmitte rs fuse with the plasma
tissues but especially so in early development, when rapid cell proliferatio n membrane, releaSing their contents into the
underlies the growth of multicellular organisms and the progression toward narrow (about 20-40 nm) synaptic cleft. The
maturity. However, throughout development, cell death is common, and by the neurotransmitters then bind to transmitter
time of maturity an equilibrium is reached between cell proliferation and cell gated ion channel receptors on the surface
death. of the dendrites of a communicating neuron,
Although some cell loss is accidental and is the result of injury or disease, causi ng local depolarization of the plasma
planned or programmed cell death is very common and is functionally impor membrane.
tant. Like the organisms that contain them, cells age, and th e aging proce ss-cell
senescenc1>-is related to various factors, including the frequency of cell division.
Most of the cells in mature animals are non -dividing cells. but
some tissues and cells turn over rapidly
The number of cells in a multicellular organism is determined by the balance
between the rates of cell proliferation (which depends in turn on continued cell
division) and cell death. Tracking the birth and death of mammalian cells in vivo
is problematic. Most of our knowledge about rates of mammalian cell prolifera
tion has therefore come from cultured cells, where, under normal circumstances,
one turn of the mammalian cell cycle lasts approximately 20-30 hou rs.
M phase (mitosis and cytokinesis) lasts only about I hour, and cells spend the
great majority of their time (and do most of their work) in interphase. Interphase
comprises three cell cycle phases: 5 phase (DNA synthesis) and two intervening,
or gap, phases that separate it from M phase: G1 phase (cell growth, centrosome
duplication, and so on) and G2 phase (preparation offactors needed for mitosis).
Of the two gap phases, Gj is particularly important and its length can vary greatly,
under the control of several regulatory factors. Furthermore, when the supply of
nutrients is poor, progress through the Gj phase of rhe cycle may be delayed.
If the cells receive an antiproliferative stimulus they may exit from the cell
cycle altogether to enter a modified Gj phase called Go phase (Figure4,HA ). Cells
in Go phase are in a prolonged non-diviiling state. But they are not dormant: they
can become terminally differentiated; that is, irreversibly committed to serve a
specialized function. Most cells in the body are in this state, but they often actively
synthesize and secrete proteins and may be highly motile.
Go cells can also continue to grow. For example, afrer withdrawing from the
cell cycle, neurons become progressively larger as they project long axons that
continue to lengthen until growth stops at maturity. For some neurons the cyto
plasm-nucleus ratio increases more than lOO,OOO-fold during this period. Some
Go cells do not become terminally differentiated but are quiescent. In response to
certain external stimuli they can rejoin the cell cycle and start dividing again, to
replace cells lost through accidental cell death or tissue injury.
Mature multicellular animals do contain some dividing cells that are needed
to replace cells that naturally undergo a high turnover. Sperm cells are continu
ously being manufactured in mammals. There is also a high turnover of blood
cells, and gut and skin epithelial cells are highly proliferative to compensate for
the continuous shedding of cells from these organs. Even the adult mammalian
CELL PROLIFERATION, SENESCENCE, AND PROGRAMMED CELL DEATH 109
G,
\ metaphase to
division. Cells in Gl ph ase may enter either
a quiescent Go pha se, from whic h they
anaphase transition ca n re-enter the cell cycle, or a terminally
differentiated state. (B) Each che<kpoint is
are aU chromosomes
attached to spindle? G, regulated by a spedfic eye lin-dependent
kinase (Cdk) bound to a cyelin protein.
G:;/ M checkpoint Cyelins are synthesized and degraded at
specific times within the cell cyc le, limiting
is all DNA replicated?
Is environmen t favorable? the availability of each Cdk- cyc lin com plex.
q=
(S)
S<:yclin (A)
~
M<YCS7/.S\
s G, M r G1
... ~.
M·Cdk
$-Cdk
-+ -+ ~ Cdk ~
Cdk S
~ M-Cdk
S-Cdk
mactive G1/$.Cdk inactive
S-Cdk M-Cdk
mitogen
nucle us
Myc gene
-- Flgu,. ~ . 1.!i Mitogens promote cell
proliferation through MAP kinase
pathways. Mitogens bind to tyrosine
receptor kinases, causing the monomers
to dimerize and transphosphorylate each
other. The activated receptor relays the
signal through an accessory protein, leading
ultimately to activation of a MAP kinase and
subsequently activation of transcription of
target genes, such as the gene encoding
the Myc transcription factor. Myc in turn
activates various genes, including some
that lead to increased G,-Cdk activity,
which in turn causes phosphorylation
of the retinoblastoma protein Rb 1.
Un phosphorylated Rb1 normally binds the
transcription factor E2F and keeps it in an
inactive state, but phosphorylation of Rbl
causes a conformational change so that it
releases E2F. The activated E2F transcription
factor then activates the transcription of
activation of genes genes that promote 5 phase, notably the
promoting S phase gene encoding cyciin A. Black arrows signify
transcriptional activation.
CELL PROLIFERATION, SENESCENCE, AND PROGRAMMED CEll DEATH 111
cell breaks apart into several membrane-lined vesicles called apoptotic bodies
that are phagocytosed_
Another form of PCD is autophagy, a catabolic process in wh ich the cell's
components are degraded as a way of coping with adverse conditions such as
nutrient starvation or infection by certain intracellular pathogens. Intracellular
double-membrane structures engulf large sections of the cytop lasm and fuse
with lysosomes, thereby targeting the enclose d proteins and organelles for degra
darion. Taken to an extreme, autophagy can lead to cell death. Other forms of
PCD are known but are poorly characterized.
Killing of defective cells du ring removal of defective immature natural mistakes occur in the DNA rearrangements that give T and B
development lymphocytes lymphocytes their specific receptors; up to 95% of immature T cells have
defective receptors and die by apoptosis
Killing of harmful cells removal of harmful T lymphocytes some immature T cellscontain receptors that recogn ize self antigens instead
of foreign antigens and need to be eliminated by apoptosis before they
cause normal host cells to die
removal of viTally infected and a class ofT lymphocytes known as killerT cells is responsible for inducing
tumor cells apoptosis in host cells that express viral antigens or altered antigens/
neoa ntigens produced by tumor cells
remova l of cells with damaged celis with damaged DNA tend to accumulate harmful mutations and are
DNA potentially harmful; DNA damage often induces celi suicide pathways
Ki lling of excess, obsolete, or removal of surplus neurons during in the embryo, many more neurons are produced than are needed; those
unnecessary celis development neurons that fail to make the right connections, for example to muscle celis,
undergo apoptosis
removal of interdigital cells duri ng development, fingers and toes are sculpted from spade-like
hand plates and footplates by causing apoptosis of the unnecessary
interdigital cells (see Figure 5.7}
CEll PROLIFERATION, SENESCENCE, AND PROGRAMMED CEll DEATH 113
key immune system cells, allowing disease progression toward AIDS. Many can
cer cells have devised strategies to oppose the immunosurveillance systems that
normally induce apoptosis of cancer cells. Successful chemotherapy often relies
on using chemicals that help induce cancer cells to apoptose. Other forms of
PCD are important in neurodegenerative disease, such as in Huntington disease
and Alzheimer disease, as well as in myocardial infarction and in stroke, in which
secondary PCD caused by oxygen deprivation in the area surrounding the ini
tially affected cells greatly increases the size of the affected area.
and can undergo auto activation. Their job is to initiate cell death pathways
after they have been activated by signals transmitted through cell surface
/ chemical or ionizing
radiation, etc.
apoptosis pathways involve the activation of
a cell surface death receptor by a ligand On
a neighboring cell. The Fas death receptor is
normally found as a monomer but is induced
to form a trimer by its trimeric ligand, FasL.
.- FasL ulmeric ligand
The resulting clustering of Fas receptors
extrinsic
pathway
r ~' ji'1;-- recruits the FADD adapter protein. FADD
acts as a scaffold to recruit procaspase 8,
intrin sic
'~ • ~ ~ ~~
components are damaged or stressed, for
)
..
cytochmme c \---. •
••:.~j~:':::----
....
- \
mitochondrlon
example in response to harmful radiation,
chemicals, or hypoxia . These pathways
are activated from within cells by using
mitochondrial and endoplasmic reticulum
n n ,r-- ~ 'pa'1
components. In the mitochondrial pathways,
pro-apoptosis factors, such as 8ax shown
U--~ procaspase 9 here, form oligomers in the mitochondrial
outer membrane, forming pores that allow
~ the release of cytochrome c. ln the cytosol.
initiator
caspases lnil
inJ caspase 8 [Y}- caspase 9
cytochrome c binds and activates the Apafl
protein and induces the forma tion of an
apoptosome that activates procaspase 9
and ultimately the same effeCior caspases
I
~ 3,6,7
p,ocaspaS'JS
(caspase 3, caspase 6, and caspase 7) as the
death receptor pathways.
~"U
effector
caspases
'~
caspases
3,6.7
Totipotent the capacity to differentiate into all possible cells af the the zygote and its immed iate descendants
organism, including the extra-embryonic membranes
Pluri potent can give rise to all of the cells of the embryo, and therefore of cells within the in ner cell mass of a blastocyst (Figure 4.1 9A)
a whole animal, but are no long er capable of giving ri se to the
extra-embryonic structures
Multipotent can give rise to multiple li neages but has lost the ability to give hematopoietic stem cell (Fig ure 4.17)
ris e to al l body cells
Oligopotent can give rise to a few types of differentiated ce ll neural stem cell that can create a subset of neu ron s in brain
Unipotent can give ri se to one type of differentiated celt only spermatogonial stem cell
[n adults, tissue stem cells are typically rare comp onents of the tissue that
dley renew, and the lack of convincing stem cell-specific markers makes dlem
generally difficult to purify. The ones dlat we know most about in mammals come
from easily accessible tissues with rapidly proliferating cells. Two of the best
characterized systems are blood and epidermis, both of which are known to be
maintained by a single type of progenitor cell. Blood cells are initially made in
certain embryonic structures before the fetal liver takes over production. From
fetal week 20 onward, blood cells originate from dle bone marrOw where two
types of stem cell have long been known:
• Hematopoietic stem cells (HSCs). HSCs are anchored to fibroblast-like osteo
blasts ofdle spongy inner marrow ofdle long bones. Grafting studies in mice
provide especially persuasive evidence that a single type of HSC is responsible
for making all the different blood cells. When a mouse that has been irradi
ated, to ensure destruction of the bone marrow, receives a graft of purified
HSCs from another mouse strain, the incoming cells are able to differentiate
and re-populate the blood. Only about 1 in 10,000 to 1 in 15,000 bone marrow
ceUs is an HSC; the frequency in blood is about 1 in 100,000. In comparison
widl adler stem cells, however, HSCs have been reasonably highly pllJified.
They are multipotent, but dle cell lineages to which they give rise become
increasingly specialized, ultimately producing all of the terminally differenti
ated blood cells (Figure 4.17) .
• Mesenchymal stem cells (MSCs). MSCs are primitive connective tissue stem
cells. Bone marrow MSCs (also known as marrow stromal cells) are poorly
defined, and so are heterogeneous. Unlike HSCs, dley are not constandy self
renewing but are relatively long-lived. They are multipotent and give rise to a
variety of cell types, including bone, cartilage, fat, and fibrous connective
tissue. Figure 4.1 1 A single type of self-renewing
MSCs are also found in umbilical cord blood, where dley seem to have a wider multipotent hematopoietic stem cell gives
differentiation potential dlan in the bone marrow. [n addition, MSCs are known rise to all blood cell lineages.
hematopoietic
stem cell
lj
1 ,/.1. 1@1
U I~
1 ,j.
B lymphocyte
.1.
reticulocyte
1
-
megakaryocyte
4-4-H4-4-4
monocyte neutrophil eOSinophil
C)
plasma T-ce ll subclasses NKcelis
erythrocyte
platelets
~
dendritiC macrophage basophil
cell cells
lymphocytes granulocytes
SUM CELLS AND DIFFERENTIATION 117
(8 )
,~""'!..
-'-1 ~ r-"
..
] epidermis
,
i
Intestinal
slem cells
Paneth
cells
to occur in a wide variety of non-hematopoietic tissues. Recent evidence sug Figure •. 18 Locations of skin, hair follicle,
gests that the MSCs in diverse tissue types may originate from progenitor cells and intestinal stem cells. (A) location of
found in the walls of blood vessels. epidermal stem cells. Epidermal stem cells
can be found in three types of niche located
Stem cell niches in the bulge region of the hair follicle,
the ba sa l layer of the epidermis, and the
A variety of other adult tissues are also known to contain stem cells, including sebaceou s gland. Stem cells in the epidermal
brain, skin, intestine, and skeletal muscle. Germ ceUs are sustained by dedicated basal layer give rise to proliferating
male and female germ-line stem ceUs. Such stem cells exist in special supportive progenitor cells known as rransit amplifying
microenvironments, known as stem cell niches, where they are kept active and cells, which in turn give rise to increasingl y
alive by substances secreted by neighboring cells. differentiated upper cell layers. (B) Location
The location of some stem ceUs has been studied, notably in model organ of intestinal stem celis. Intestinal stem
isms. Epidermal stem cells are known to occur in three locations (Figure 4_18A), cells are loca ted within narrow bands in
of which those in the bulge region of th e hair follicles give rise to both the hair intestinal crypts, They move downward
follicle and the epidermis. The stem ceUs th at will form epidermis give rise to to differentiate into lysozyme-secreting
Paneth cells, o r u pwa rd to generate tran sit
keratinocytes that progressively differentiate as they move toward upper layers.
amplifyin g (TA) cells. The nearby intestinal
In the epithelium of the smaU intestine, stem cells are located near the base of villi are regenerated every 3-5 days from the
Clypts, which are small pits interspersed between projec tions called villi (Figure cryptTA celis and consist o f three types of
4.18B). differentiated cell : mucus-secreting goblet
One characteristic ofstem cells that helps in identifying where they are located cells, absorptive enterocytes (by far the
is that they divide only rarely. They do not undergo cell division very frequently, most prevalent villu s cell), and occasi onal
so as to conserve their proliferative potential and to minimize DNA replication hormone-secreting enteroendocrine cell s.
errors. As a resul t, differentiation pathways generally begin with stem ceUs giving [(A) courtesy of Th e Graduate School of
rise to proliferating progenitor ceUs known as transit amplifylng (TA) cells. Biomedical SCiences, University of Medicine
and Dentistry of New Jersey. (B) adapted
Unlike stem cells, TA cells are committed to differe ntiation and divid e a finite
from Blanpain (, HorsleyV, & Fuchs E (2 007)
number oftimes until they become differentiated.
Cell, 128, 445-458. With permission from
Stem cell renewal versus differentiation Elsevier.]
Some progress has been made in identi!')~ng factors that are important in stem
cell renewal and differentiation potential, but much of the detail has yet to be
wo rked out. Stem cells can in principle renew themselves by normal (symmetric)
cell divisions. To produce cells that are more specialized than they are, and so
launch differentiation pathways, stem cells need to undergo asymmetric cell
divisions (Box 4.3). Initially, the stem cell gives rise to a cell that is committed to
a different fate and that typically gives rise in turn to a series of progressively
more specialized cells.
o
Cell division is described as asymmetric (A) (8) (C,
when two daughter cells differ in
their behaviors and/or fates. The cell
division may be intrinsically symmetric
A
I different U ~ , 0
to produce two identical da ughter mlc;oenvironment
1 A 1
•A
cells that nevertheless are exposed to ,
different m icroenvironments, causing
them to become different (Figure lA).
Alternatively, the cell division m ay be
intrinsically asymmetric to produce two
o o c() 0 ~
different daughters. The asymmetry
1
• •
may be in troduced by positioning the
spin dle toward one end of the cell so
that cell division gives rise to one large
and one small daughter cell (Figure 1B). flgur.' Origins of cell asymmetry. (A) Exposure to different microenvironments can cause origina lly
Alternatively, asymmetric localization identical daughter cells to become different. (8) Eccentric positioning of the mitotic spindle results in
of cell components or upstream cell one large and one small daughter cell. (C) Asymmetric localization of cell products such as of regu latory
polarity regulators dictate how cells proteins (shown in small green filled circles) can resu lt in asymmetric daugh ter cells. As show n on the
develop. Th e spindle is deliberately rig ht, this ca n happen after asymmetric localization of upS(ream polarity regulators (pink sem icircles).
oriented so that the daughter cells will
differ in their content of cell fa te determinants (Figu re 1C).
Meiotic divisions can be intrinsically asymmetric. The two are needed in early development or in response to depl etion through
successive meiotic divisio ns in female mammals each result in one injury, disease, or interve ntions such as chemotherapy, and there is
small cell (a polar body) and one large cell (ultimately to become the evidence that stem ce ll s proliferate rapidly in vivo by symmetric division.
egg). The large size difference ensures that most of the maternal stores Some mammalian stem cells seem to be programmed to switch
are retained within the egg, and is due to positioning of the spindle between symmetric and asymmetric cell divisions during development.
away f rom the center of the cell. For examp le, both neural and epidermal progenitors unde rgo
Various mitotic d ivisions are intrinsically asymmetric. In mammals, symmetric division to expand stem cell numbers during embryonic
the zygote undergoes success ive symmetric divisions, but later in development. The daug hter cells are in t he same microenvironment as
development and adulthood variou s asymmetric cell divisions are the origi nal parental ce ll (Figure 28). In mid to late gestation, however,
important in a va riety of circumstances, for example in the adaptive asymmetric division is predom inant. In both cases, thiS seems to be
immune system to ensure the balanced production of effector and achieved by reorienting the mitotic spindle through 90° so that the
memory T cells. two identical daughters now receive different signa ls. One of them
retains the position of the parenta l stem cell in the stem cell niche and
Asymmetric and symmetric stem cell division so rece ives signals to ma intain stem cell identity by direct contact wit h
Asymmetric cell division is often undertaken by stem cells, which need signa li ng cells. The other daughter cell is geographically remote, does
to both self-renew and give rise to differentiated cells. Each stem cell not receive the appropriate signals, and differentiates (ora nge cells in
could achieve this econom ically by an asym metric division to produce Figure 2C).
one daughter cell identica l to the parent and one that is committed Cultured embryon ic stem cells rely on symmetric divisions so that
to a differentiation path (Agure2A). However, t his mechanism alone the daughter cells have the same potential to differentiate, dependi ng
would be inefficient at generating the large number of stem cells that on the experimental circumstances.
(B) (C)
epithelial ce li s
A..
.tE~
i
""'~
basal layer
! 1
Flgur.l Asymmetric and symmetric stem cell divisions. (A) Asymmetric division of stem cells.
(8) Symmetric division of stem cells. (C) Asymmetric d ivision o f stem cells produced by reorienting the
differentiation spindle so t hat one of the two daughter cells is in a different microenvironment.
shown to be able to give rise to healthy adult mice afrer being injected into blas ,(10;..1_ _ _ _ _ _ _ _ _ _ _- .
tocYSts from a different mouse suain that were then transferred into the oviduct r •
of a foster mother (the technology and some applications are described in
Chapter 20).
Human ES cells were first cultured in the late 1990s. They have traditionally
been derived from embryos that develop from eggs fertilized in vitro in an IVF (ill
vitro fertilization) clinic. The IVF procedure is intended to help couples who have
difficulty in conceiving children; it normally results in an excess of fertilized eggs
that may then be donated for research purposes.
ES cells can be established in culture by transferring ICM cells into a dish con
taining culrure medium. Culture dishes have traditionally been coated on their
inner surfaces with a layer of feeder cells (such as embryonic fibrobl as ts). The
overlyi ng ICM cells attach to the sticky feeder cells and are stimulated to grow by
nutrients released into the medium from the feeder cells (Figure 4.196). Afterthe
[CM cells have proliferated, they are gently removed and subcultured by re- IBI
plating into several fresh culture dishes. After 6 months or so of repeated subcul
turing, the original small number of[CM cells can yield large numbers ofES cells.
ES cells that have proliferated in cell culture for 6 months or more without dif
ferentiating, and that are demonstrably pluripotent while appearing genetically
normal, are referred to as an embryollic stem (ESI eel/line.
Pluripotency tests
To test whether ES cells are likely to be pluripotent, their differentiation potential
is initially analyzed in culture. When cultured so as to favor differentiation, they
first adhere to each other to form cell clumps known as embryoid bodies, then
differentiate spontaneously to form cells of different types, but in a rather unpre
dictable fashion. By manipulating the growth media, they can also be tested for Figur. 4.19 Origins and potential of
embryonic stem cells_ (A) A 6-day-old
their ability to differentiate along particular pathways.
human blastocyst. The blastocyst is a nuid·
A more rigorous pi uri potency test involves assaying the ability to differentiate
filled hollow ball of cells containing abou t
ill vivo. ES cells are injected into an immunosuppressed mouse to test for their lOO cells. Most cells are located on the
ability to form a benign tumor known as a teratoma. Naturally occurring and outside and will form t rophoblast. The
experimentally induced teratomas typically contain a mixture of many differenti cells in the interior, the inner cell mass ({eM)
ated or partly differentiated cell types from all three germ layers. If ES cells can will form all the cells of the embryo.
form a teratoma they are considered to be pluripotent. (8) Cultured human ES (embryonic stem)
cell s. Cells surgically removed from the
Embryonic germ cells ICM can be used to g ive rise to a cultu red
An alternative strategy for establishing pluripotent embryonic cell lines has human ES cell line. Human ES cell lines
were traditionally established by growing
involved isolating primordial germ celis, the cells of the gonadal ridge that will
the human ICM cells on a supporting layer
normally develop into mature gametes. When cultured ill vitro primordial germ
of mouse embryonic fibrobl ast feeder cells.
cells give rise to pluripotent embryonic germ (EG) cells. Human EG cells, derived Human ES cells fill most of the image shown
from primordial germ cells or embryos and fetuses from 5 to 10 weeks old, were here, and the feeder cells are located at the
first cultured in the late 1990s. extremities. (Courtesy of M Herbert (A) &
EG cells behave in a very similar fashion to ES cells and share many marke r M l ako (8), Newcastle University.)
proteins that are believed to be important in ES cell pluripotency and self
renewal. Currently, however, the molecular basis of plunpotency and of the
capacity for self-renewal is imperfectly unde rstood.
The exact relationship between cultured ES cells and pluripotent [CM cells is
also not understood. [CM cells are no t stem cells: they persist for only a limited
number of cell divisions in the intact embryo before differentiating into other cell
types with more restricted developmental potential . Even at the earliest stages
ICM cells are hete rogeneous, and some evidence suggests that culrured ES cells
more closely resemble primitive ectoderm cells, whereas other evidence suggests
that ES cells are of germ cell origin.
B cells express immuno· originate in bone marrow and humoral (antibody' immunity; response to foreign
globulins (lgs) and when mature migrate to blood antigens; important antigen-presenting cells
class II major histo and lymphoid organs
compatibility complex
(MHC) proteins
Plasma cell secrete IgM, IgG, IgA, or prod uced by antigen -stimu lated secrete solu ble antibodies
IgE antibodies B cells in lymph nodes
Memory B cells clonally expanded a-cell produced by ant igen-stimulated respond in secondary immune responses
populations B cells in lymph nodes
T cells express T-cell receptors originate in bone marrow, cell-mediated immunity; important in antiviral and
ITCRs) carried by the blood to the anti-tumor responses
thymus where they mature
HeJ perTh cell CD4+ receptors derived from a~ T cells recognize exogenous antigens bound to class II MHC protein;
Signal to other Immune system cells, stimulating them to
proliferate and be activated
Cytotoxic Tc cell COS+- receptors derived from a~ T cells recognize endogenous proteins bound to class I MH C; kill
virally infected host cell s and tumor cells
Regulatory (Treg) C04+ and C025 derived from a~ T cells suppress autoimmune responses
cells (suppressor
T cells)
MemoryT ce ll s clonally expanded derived by clonal expansion of stimulated to expand rapidly in secondary immune
populations of etI) T ce ll s Tc, Th, orTreg cell s responses
NKT cells TCRs interact wit h CO l recognize glycolipid antigens; important regulatory celts
and not MHC proteins
)'6T cells distinctive TeR has yand recogn ize glycol ipid antigens; important reg ulatory cells
ochains
Natural killer C016+- and COS6+- originate from bone marrow innate immunity; first line of defense against many
INK) cells Fc receptors for IgG; and are present in blood and different viral infections; produce interferon--y and TNF ·a.
receptors that detect spleen which can stimulate dendritic cell maturation, activate
dass lMHC macrophages. and regulate Th cells; recognize and kill
stressed/diseased cells expressing reduced class I MHC
proteJ~5 or stress proteins
IMMUNE SYSTEM CELLS: FUNCTION THROUGH DIVERSITY 121
Monocytes in blood
Tissue-specific macrophages alveolar macro phages (lung), histocytes (connective tissue), pote nt phagocytes
intestinal macrophages, Kupffer cells (liver), mesangial cells
(kidney), microg lial cells (bra in ), osteoclasts (bone)
DENDRITIC CELLS (DCs) in tissues in contact with external environment (mainly In dedicated antigen-presenting cells; limited
skin and inner linings of nose, lungs, stomach, intestines) phagocytic activity; key coordinators of innate
and adaptive immunity
Interstitial Des move from interstitial spaces in most organs to lymph nodes/
spleen
BLOOD GRANULOCYTES in blood; migrate to sites of infection/inflammation during dedicated antigen-presenting cells
immune response
Neutrophil l ysosome~ l i ke granules active motile phagocytes; often the first cells to
arrive at a site of inflammation after in fection
Eosi nophil Fe receptors for IgE important in killing antibody-coated large paras ites
Basophil Fe receptors for IgE; hi stamine-containing gra nul es; are important in mounting allergic responses and in
equ ivalents of th e mast cells in tissues responses to large para sites
MAST CELLS Fe receptors for JgE; histamine-containing granules; ti ssue dedicated antigen·presenting cells; Important in
equivalents of the basophils in blood mounting allergic responses and in responses to
large parasites
filter blood). They are each packed with mature immune cells, predominantly
lymphocytes but also including mac rophages, dendritic cells, and other cells
whose fun ctions will be detailed below.
Rather less organized mucosa-associated lymphoid tissue is found in various
sites in the body. Gut-associated lymphoid tissue collectively constitutes the larg
est lym phoid organ in the body, which is consistent with its need to interact with
a huge load of antigens from food and commensal bacteria. It comprises the ton
sils, adenoids, Peyer's patches in the small intestine, and lymphoid aggregates in
the appendix and large intestine. Epithelium-associated lymphoid tissue is also
found in the sldn and in the mucouS membranes lining the upper airways, bron
chi, and genitourinary tract.
The first line of defense in vertebrate immune systems is physical. The epithe
lial linings of the skin, gut, and lungs act both as barriers (epithelial cells are firmly
anached to each other by tight junctions) and as surfaces for trapping and killing
pathogens. The epithelia can produce thick mucous secretions to tra p pathogens,
and can then kill them with lysozyme (in sweat) and antimicrobial peptides
known as defensins (in mucus).
Vertebrates mount an immune response based on cooperation between the
innate and adap tive immune systems. Cell signaling is a key element in this coop·
eration. Various cell surface molecules and secreted molecules called cytoidnes
orchestrate the interactions between immune system cells that are needed for
the proper fun ctioning and regulation of the overall immune response. The most
prevalent cytoldnes are members of the inte rleuldn and interferon protein
families.
of a species, and are rapid, developing within minutes. The principal effector - C1 complement
complex
cells perform two major functions:
• Various cells engulf and kill bacteria and fungi by phagocytosis, a form of
endocytosis. The major phagocytic cells are neutrophils (a white blood cell
with many lysosomal-like cytoplasmic granules), tissue macrophages (ubiq
uitous, but concentrated in sites vulnerable to infection such as lungs and
gut), and blood monocytes (precursors to macrophages).
NK(natllral killer) cells induce virus-infected cells and tumor cells to undergo
apoptosis. Two principal pathways are used. First, NK cells bind to diseased
cells, and small cytoplasmic granules are released by exocytosis into the inter • microbe •
•
cellular space. Some of the released proteins (perforins) are inserted into the
•
1
·
stranded RNA, methylated DNA, and zymosan (in fungal cell walls). TLRs are
• • •
abundant on macrophages and neutrophils, and on epithelial cells lining the
lung and gut. Once stimulated, TLRs induce the surface expression of co
stimulatory molecules that are essential for initiating adaptive immune responses
(see below) and stimulate the secretion of pharmacologically active molecules
(prostaglandins and cytokines) that both initiate an inflammatory response and
also help induce an adaptive immune response.
• •• •
The complement system comprises 20 or so proteins that circulate in the F1gunt-4.20 Activating the complement
blood and extracellular fluid. Complement proteins are made mostly by the liver, system. Antibodies bound to the surface of
but blood monocytes, tissue macrophages, and epithelial cells of the gastrointes a microbe can be bound by complement via
tinal and genitourinary tracts can also make significant amounts. The proteins a component (Clq) of the Cl complement
are proteases that interact in a proteolytic cascade on the smface of invading complex, activating the classical complement
microbes. Eventually a membrane attack complex is formed that can punch holes pathway. Eventually a membrane attack
complex is formed: a ring of complement C9
through the outer membranes of the microbes, causing them to swell up and
proteins surrounding a core of complement
bmst (Figure 4.20). Activation of complement is possible by three different proteins CS to C8 integrates into the cell
pathways: membrane of the microbe. The membrane
The classical pathway involves interaction with the adaptive immune system, attack complex then punches a hole in
particularly invariant domains of antibodies, which bind to antigens on the the microbe's cell membrane, and the cell
surface of a microbe. contents are released, killing the microbe.
The lectin pathway uses nonspecific pattern recognition alone. The trigger is
innate illlmune system when the latter fails to provide adequate protection
against an invading pathogen. It has three key characteristics:
• Exquisite specificity: it can discriminate between tiny differences in molecu
lar structure.
• Extraordinary adaptability: it can respond to an unlimited number of
molecules.
• Memory: it can remember a previous encounter with a foreign molecule and
respond more rapidly and more effectively on a second occasion.
Whereas innate immune responses are the same in all members of a species,
adaptive immune responses vary between individuals: one individual may mount
a strong reaction to a particular antigen that another individual may never
respond to. There are two major arms to the adaptive immune response:
• Humoral (antibody) immunity is mediated by B lymphocytes (also called
B cells).
• Cell-mediated immunity is effected by T lymphocytes (T cells) and is an
important antiviral defense system.
Both Band T cells have dedicated cell surface receptors tbat can specifically
recognize individual antigens.
What makes the adaptive immune system so proficient at recognizing anti
gens is tbat during its development in the primary lymphoid organs each naive B
orT cell acquires a cell surface antigen receptor of a unique specificity. Binding of
this receptor to its specific antigen activates the cell and causes it to proliferate
into a clone of cells wlth the same immunological specificity as the parent cell
(clonal selection). As a result, the nwnber oflymphocytes that can recognize the
specific antigen can be rapidly expanded.
Immunologicalmemoryis another important feature ofthe adaptive immune
system. During tbe massive clonal expansion of antigen-specific lymphocytes
that occurs after an initial encounter with antigen, some oftbe expanding daugh
ter cells differentiate into memmy cells that are able to respond to tbe antigen
more rapicUy or more effectively. Memory cells are endowed wltb higher-affinity
antigen receptors, and also combinations of adhesion molecules, homing recep
tors, and cytokine receptors tbat direct the lymphocytes to migrate efficiently
into specific tissues (extravasation).
variable
regions
IgM pentamer/monomer B-cell receptor (monomer) and first Ig to be produced (see Figure 4.26A); activates
secreted antibody (pentamer) complement
IgD monomer B-cell receptor only second Ig to be produced (see Figure 4.26A); function
unknown
IgA often dimer or 8-cell receptor, but predominantly as predominant Ig in external secretions such as breast milk,
tetra mer antibodies secreted by plasma cells saliva, tea rs, and mucus of the bronchial, genitourinary,
and digestive tracts
H H
a a a
In cell-mediated immunity, T cells recognize cells containing Figure 4.22 Some representatives of
the immunoglobulin superfamily, The
fragments of foreign proteins immunogl obulin domain is a barrel-like
Microorganisms such as viruses that penetrate cells and multiply within them stru cture held together by a disulfide bridge
(pin k dot). Mul tiple copies of (his structural
are out of reach of antibodies. T cells evolved to deal with this need; they carry on
domain are found in ma ny proteins with
their surfaces dedi cated T-cell receptors (TCRs) that recognize sequences from important immune system functio ns, but
foreign antigens. TCRs are heterodimers that are evolutionarily related to both other members of the imm unoglobulin
immunoglobulins and other cell receptors of the immunoglobLllin superfamily super family can have different funClions,
(see Figure 4.22). such as Ig·CAM cell adhes ion molecules
There are two types of TCR, composed either of a and ~ chains or ofy and 8 (Section 4.2). Shaded boxes show the
chains. Unlike the immunoglobulins on B cells, most a~ TCRs and the cells that variable do mains of immune system
bear them are limited to recognizing foreign proteins, and only after they have proteins. 111M, 112·microglobulin.
been degraded intracellularly. Peptides of an approp riate size and sequence
derived ftom this degradation bind to major histocompatibility complex (MHC)
proteins [Box 4.4) that are then transported to the surface of the cell. With the
peptide held in a cleft on its outer surface, the MHC protein presents the peptide
to a T cell carrying an appropriate TCRap recepto r (antigen presentation), thereby
activating that T cell.
IAI (C)
-Ab +Ab
h~-
,,"'r
~
docking
...
I
1*
protein
fl.gUN 4 .23 Aspects of antibody functi o.(l..
_~ ~~cePtoc
antibody
{} (A) Inhibiting vi ral infec tion. Viruses lnfe<
___ cell receptor cells by first using docking proteins to
bind to certain receptors on the plasma
effector cell
memb rane of host cells. An tibodies (Ab .:.an
host cell bind to the viral docking proteins to prt·
e.g. macrophage
them from binding to host cell receptcY'i.
activation
(8) Neutralizin g toxins. Antibodies bil"l-c rJ
lSI toxins released from inva ding micro~ GI1d
-Ab +Ab so stop them h om binding to cell re(-e~o rs.
(e) Activating effector cell s. Antibo ol€ ~ G n
~ bind and coat the surfaces of microb!os >lnd
large target cell s. Various immune system
•
• • •
• •
-t:: . V
effector cells (notab ly macrophages. r\a w ral
killer cells, neutrophils, eosinophils and
\/
bacterial • ce ll
receptor
. ·1: ~ mast cells) ca rr y Fc receptors t hat ~ Ila b le
them to bind to the Fe reg ion on IgA.lgG,
,...,
t OXinS
or IgE antibodies. Antibody bi r1 ding to an Fc
receptor activates the effector cel l and can
host cell lead to cell killing by phagocytosI s, or release
o( lytic enzymes, death signals, and so on .
126 Chapter 4: Cells and Cell-Cell Communication
BOX 4.4 THE MAJOR HISTOCOMPATIBILITY COMPLEX, MHC PROTEINS. AND ANTIGEN PRESENTAnON
The major histocompatibility complex (MHC) class II region clas s III re gion class I region
is a large gene duster that conta ins genes most ly
various complement
associat ed with cell-mediated immune responses and,
in particular, with antigen processing and present~tion.
~. genes and other genes
(see Figure 9.7A)
Among the protein products of the MHC are the class U
I and II MHC antigens, which are members ofthe
immunoglobulin superfamily of cell surface proteins U
(see Figure 4.22). These transmembrane proteins are
.,~r.:~ ~
~5!:!9~;;
U 8•
'U
II . " •• • • • • , II ' ,
" , L,
;::
heterodimers. The MHC contains genes for the highly
"'ultamatilte genes on
polymorphic a (heavy) chain of class I and both a and
dJffarent haplotypes
B chains of class II MHC proteins. The lighter cha in
FIgure 1 Organization of HLA genes in the human MHC.
of class I MHC protein, Brmicroglobu li n, is not polymorphic and is
encoded by a gene elsewhere in the genome.
other cell types (including fibroblasts in skin, vascular endothelial cells,
The human MHC (known as the HLA complex; HLA stands for
human leukocyte antigen) spans several mega bases at 6p2 1.3. The
P
glial cells, and pancreatic cells) can be induced to express class II
MHC proteins.
class I MHC a cha in genes (HLA-A, HLA-B, HLA-C, HLA-E, HLA-F, and
HLA-G ) are located at the telomeric side; class II MHC genes are tightly From antigen presentation to T-cell activation
clustered at the centromeric end. An intervening region, sometimes Co-stimulatory molecules are su rface molecules that interact
known as the cla ~ III MHC gene region, contains various complement with speci fic receptors on the T cell (Figu,e 2). The co-stimulat ory
and other genes (Figure 1)_ signal delivered by this interaction is necessary to induce the Th
The MHC class I and class II proteins are among the most cell to synthesize the T-cell growth factor interle ukin-2 (l L-2J and to
polymorphk vertebrate prote ins known. Most ind ividuals are express high -affinity receptors for IL-2. The secreted IL-2 stimulates
heterozygous at the corresponding loci, and both allelic forms the proliferation ofT cells expressing the C04 or COB cell surface
are f unctional and co-expressed at the cell surface. This extreme receptors. Th is complex way of ind ucing T-cell responses is presumably
polymorphism is presumably driven by intense natural selection necessary to avoid the inadvertent activation ofT celis, which would
pressure from viruses and other parasites, which favors indiViduals be potentially damaging, and to allow the detailed regulation of their
with a greater capacity for recognizing infected host cells. proliferation and functiona l differentiation.
Because of their very high degree of polymorphism, class I and Co-stimulatory molecules are of such critical importance to the
II MHC proteins are important in graft rejection after organ or tissue initiation and regulation of immune responses that they are not
transplantation from one individual to another. If the donor and constitutively expressed, even by professional antigen-presenting
recipient (host) are unrelated, major differences could be expected in cells. Their expression can be induced via the TCR-triggered transient
the MHC proteins expressed by transplanted cells and host cells. The expression of (040 ligand on T cells and the subsequent inte raction
graft will be rejected because it will be recognized as foreign and will with and crosslinking of this ligand to C040 on profeSSional an tigen
be attacked by the immune system. To inc rease transplant success presenting cells. The expression of co-stimulators is also triggered by
rates, tissue typing is performed to identify alleles at important signaling arising from the recognition of the conserved features of
MHC loci (in easily accessible blood cells) so that potential donors pathogens that is part of the innate immune system, most importantly
can be sought with a closely matching MHC profile, and drugs are the signaling through Toll-like receptors expressed on dendritic cells,
administered that can suppress immune responses. However, MHC macrophages, and certain other cells.
polymorphism rema ins a challenge for successful cell therapy.
Antigen presentation T cell
The function of classical MHC proteins is to bind and transport
pep tides to the cell surface for antigen presentation to T cells with
TCRlll~ receptors.
Class I MHC proteins bind peptides predominantly derived
from endogenous antigens and present them to cytotoxicT (Tc) n ("I
lymphocytes. -----1~'\ \~\ C~4 - ,~ \
Class II MHC proteins bind peptides predominantly derived CD28 \ ) (j '\...) CD8 LJR\) - LFAl
from exogenous antigens and present them to helperT (Th) '. \ ("<
\\ _---T~..__- 0
I
lymphocytes.
All cells with MHC proteins on their surface can present pept ides
to T cells and in this context can be referred to as antigen-presenting 'M;i (:,
cells. However, when such cells are recognized and potentially killed 87.1 (CD80) protein ICAM
byTc cells they are often known as target cells. Almost all nucleated or 87.2 (CD86)
cells express class I MHC proteins and so can function as a target cell
presenting endogenou s an ti gens to Tc cells. Target cells are often cells antigen·present ing
cell
infected w ith a virus or some other intracellular microorganism. In
addition, altered self cells such as cancer cells and aging body cells can Flgurlt:2 Presentation of peptide antigens to T cells. Cell adhesion
serve as target celi s, as can allogeneic cells introduced via skin grafts or (mediated by molecules such as ICAMl and LAF1) brings the cells
orga n transplantation (between genetically non-identical individuals). together. The peptides (about eight or nine amino acids long for
By contrast, relatively few cells express class II MHC molecules. class I MHC, and somewhat larger for class II MHC) are held in an
Among the most important are dendritic cells, macrophages, and outward-facing cleft in the MHC molecule. COB or C04 proteins bind
B cells. These cells are also among the few that can express the to non-polymorphic components of class I and class II MHC proteins,
co-stimulatory molecules that are essential for initiating immune respectively. In addition to TCR-antigen recognition, a co-stimulatory
responses in both Tc and Th cells. For both these reasons they are signal is needed, such as a member of the B7 family that is recognized
sometimes referred to as professional antigen-presenting cells. Several by C028 receptors on the T~cell surface.
IMMUNE SYSTEM CELLS: FUNCTION THROUGH DIVERSITY 127
This process, in which most ap T ceUs recognize antigens only after they have
been degraded and become associated with MHC molecules, is often referred to
as MHC restriction. It allows T cells to survey a peptide library derived from the
entire set ofproteins that is contained within the cell but is presented on the sur
face of the cell by MHC molecules, thereby providing a remarkable evolutionary
solution to the problem of how to detect intracellular pathogens. At the same
time, it restricts T cells to recognition of only those antigens that are associated
with cell-surface MHC molecules and are derived from intracellular spaces. T
cells therefore complement B cells and antibodies that can recognize only extra
cellular antigens and pathogens.
Cell-associated antigens that are synthesized within the cell are referred to as
endogenous antigens. They derive from proteins produced within the cells
(including normal cellular proteins, tumor proteins, or viral and bacterial pro
teins produced within infected cells). Such antigens are degraded within the
cytosol by proteasomes, large complexes containing enzymes that cleave peptide
bonds, converting proteins to peptides. These antigens are bound by class I MHC
proteins. Antigens that are captured by endocytosis or phagocytosis are described
as exogenous antigens. The internalized antigens move through several increas
ingly acidic compartments, where they encounter hydrolytic enzymes including
proteases. These antigens are presented by class II MHC proteins.
An important aspect of this process is that MHCmolecules cannot distinguish
self from nonself antigens, and therefore. except on the rare occasions when a
cell does indeed become infected or captures a microbial protein, the MHC mol
ecules on its surface are presenting pep tides derived from the degradation of self
proteins. A key event during the development of ap T cells in the thymus is that
only those T cells that have receptors that potentially recognize foreign peptides
in association with self MHC molecules are allowed to mature and be released
into the periphery (positive selection). Those that recognize self peptides are
induced to commit suicide by apoptosis (negative selection).
I-cell activation
Of the major classes ofT cells (see Table 4.7), there are three main ones that par
• He/per T (Th) cells (which usually carry a CD4 cell surface receptor) recognize
peptides that are predominantly produced from exogenous protein antigens
and bound to a class II MHC protein (see Box 4.4). Only a few rypes of cell
dendritic celis, macrophages, and B cells- express class II MHC proteins and
act to present antigens to Th cells. Once activated, Th cells secrete cytokines
to mohilize or regulate other immune system cells so as to generate an effec
tive immune response. There are three rypes: Thl cells activate macrophages
and Tc cells (see below); Th2 cells stimulate activation of eosinophils and pro
mote antibody production by B cells; Th 17 cells promote autoimmunity and
inflammation.
• Cytotoxic T (Tc) cells (which usually carry a CDS cell surface receptor) recog
nize peptides that are predominantly produced from endogenous protein
antigens and are bound to a class I MHC protein. Almost all nucleated cells
express class I MHC proteins and so can present antigen toTc cells. Once acti
vated, and in association with cytokines secreted by Th cells, they differenti
ate into effector cells called cytotoxic T lymphocytes (CTLs).
CTLs are particularly important in recognizing and killing virally infected
cells. Like NK cells, they induce apoptosis, either by the Fas pathway (see
Some viruses seek to avoid heing killed by CTLs by downregulating the expres
sion of class I MHC molecules on host cells. However, NK cells have receptors
that screen host cells for the presence of class I MHC antigens and are acti
vated when they sense host cells lacking class I MHC proteins.
• RegulatOlY T (neg) cells. These cells generally express CD4 but are distin
guished from helper T cells by their constitutive surface expression of CD25
(one of the chains of the interleukin-2 receptor) and intracellular expression
of the transcription factor Foxp3. They suppress autoimmunity (immune
responses to self antigens; Box 4_5)
128 Chapter 4: Cells and Cell-Cell Communication
The central characteristic of immunity is the capacity to disting uish efficient because autoimmune diseases do occur. For examp le, in
between self and nonself.lg and TCR gene loci are activated by DNA rheumatoid arthritis, self- reactive T cells attack the tissue in joints,
rearrangements that produce a random array of different a-cell and ca using an inflammatory response that results in swelling and tissue
T-cell receptors. Inevitably, self-reactive lymphocytes are produced, cells destruction. Peripheral tolerance is a back-up form of tolerance that is
that carry high-affinity receptors against antigens present on host designed to inactivate, or otherwise regulate, auto reactive T or B cells
cells. These potentially harmful self-reactive lymphocytes need to be in secondary lymphoid tissues.
eliminated or inactivated during the maturation process. For B cells One mechanism for achieving peripheral tolerance is anergy, in
this process is less critical, because their activation is often dependent which self-reactive T cells that encounter antigen will generally do
on T cells and they lack cytotoxic activity. ForT cel ls it is of central so without receiving co-stimulatory signals. In these circumstances,
importance, and multiple mechanisms for preventing autoimmune a permanent state of unresponsiveness is induced in which the cells
damage by self-reactive T cells have been identified. are unable to respond even if they subsequently encounter antigen
During their development in the thymus, immature T cells with in the presence of co-stimulatory signals. Another mechanism is via
high-affinity receptors for self-peptide:MHC complexes are eliminated. regulatoryT cells (Treg cells) that act in secondary lymphoid tissues
This is a central form of self tolerance. However, it is clearly not 100% and at sites of inflammation to down regulate autoimmune processes.
Not all a~ T cells recognize peptides presented by MHC class I and class II
molecules. In particular, a subclass called NKT cells can react with glycolipid
antigens, which are presented by members of the CD! protein family rather tban
by MHC proteins. CD! proteins are non-polymorphic and structurally closely
resemble class I MHC proteins; however, like class II MHC proteins, their expres
sion is restricted to professional antigen-presenting cells.
Other T-cell populations express a y8 T-cell receptor and are found in a wide
variety oflymphoid tissues. Despite the abundance and presumed importance of
tbese cells, very little is known about their antigen recognition and function
except that in many cases it seems to be MHC- unrestricted. POSSibly, y8 T cells
recognize antigens presented by non-MHC-encoded molecules, or perhaps like
B cells tbey can recognize soluble antigens. Some human y8 T cells resemble NKT
cells in being able to recognize bacterial glycolipids bound to CD! molecules.
Otber populations of y8 T cells, such as tbose found in the skin, have essentially
invariant TCRs and presumably recognize a common microbial antigen or a self
molecule, but tbe nature of this is unknown.
%;;i~~ ~
~ ": .j "';:: ';'
~""')~
~::i ~ ""'" '":-
V n N
....~~ R:l: ~~;:::;
$::,-"'%. .... )8<0;2
Fig ure 4.l4 Multiple gene segments in
the IGH Ig heavy-chain gene locu s at
s' ---te»t '" ... ...,,.., .., ""..: oIo ei!~:o\~
200 kb
14q32. Th e IGH gene locus spans
" .~
. <
Telomeric 1250 kb of 14q32.3 and, fo r cla rification,
~
~ is shown as six successive segments of
,!
•
.
~
segments span the fi rst 900 kb. In addition
to functional segments (g reen), t here are
many nonfunctiona l pseudoge ne segments
~ ~ ",,j-
~ ~ ~...
..
M,j- (red) and some others whose functional
-QQ--O-()--{)
status is uncertai n (ye ll ow). Next are
- ;;
-:;.)
-~
' ~
":; ",
"" . "
27 ~ ge ne segment s (dark blue within
pink box) followed immediately by
19 !~~ ~ ~~ ~;~~g~~~~~~ ~ ~2 ~ ~~ 9 JH ge ne seg ments (within pale blue box),
~~~~fltHB~~~HG.u~
i ~HH~"-D~~~-
1; ~ :::
"'" " and finally various t ra nscription units
~ <:> ~
cell' Co. CrJ, and so on), each of which
~ ~~ ~ ;~~~~~~ ~~~~;: ~ ~ . : ~ ~~~~~
-(}---£t--<f--<l~ItI'-o-IH:Hi:J+o-o----D-I-I· 80(1 kb
contains several exoos. The number of
gene segments shows some variation, as
I represented by the variable presence of
additional gene segments such as the five
;
_ ... ..,-=-- ... "";: ~ ~~e:
i> S ~~ ~~ :! &; ~~~ ~ ~~ :g;;
~ ~~~
."
:; ::i VH gene segments at about the 500 kb
J I l U ll J !!.I~ II I I I.!, I! I It l ll f position. (Reproduced w ith permission
" ~~ t
-If. s ~~ r--
from The International Immunogenetics
ot':'~
" J I wo
1 000 lib
Information System at hctp:l!www.imgl.org)
v I)
1250 kb
G3 GI ,., 1 02 C4 E ... 2
-o--~J----Cr---------------G--Er--c~>
y3 y1 a1 12 )'I E a2
Centromeric
exon that potentially encodes the va riable domain of a receptor chain (Figure
4.25). A rypical receptor gene locus may contain 100 V, 5 D, and 10 J elements.
Rando m recombinatio n of tlJese gives 5000 permutatio ns, and because each
imm une receptor contains two chains, each formed from rando mly rearranged
receptor genes, the rearrangement process by itself creates a potential basic
germ·line repertoire of (5000)' or about 25 million specificities.
heavy chain
! D-J joining
! V- D-J joining
,J,
C. C" C" C" C. , C
---•
C" C., C, C. 2
DNA of
early 8 ce lls
recombination. (Al Early (partial) switch
to IgO. The constant region of a human Ig
1
c"
co, and r..joln
c._
different strands
C, c", occurs by intrachromatid recombina tion, in
which the same VOJ exon is brought close
to the initially more distal constant region
transcription units (COl ( y. or CJ by a looping
out of the intervening segments. The new
VDJ-C combinatjon is expressed to give
C" C" Ig A, IgG, or Ig£ Th e combination for 19G is
illustrated here.
c,
DNA of
mature B cells
producing IgG
spliced to give
dmRNA --7 IgG
Large as the combinatorial diversity above may seem, it represents only the
tip of the iceberg of receptor diversity. A much bigger contribution to V domain
variability is provided by mutational events that take place before the joinlng of
V, D, and) elements and which create junctional diversity. First, exonucleases
nibble away at the ends of each element, removing a variable number of nucle
otides. Second, a template-independent DNA polymerase adds a variable num
ber ofrandom nucleotides to the nibbled ends, creating N regions. Combinatorial
and junctional di versity together create an almost limitless repertoire of variable
domains in immune receptors.
The genetic mechanisms leading to the production of functional VJ and VDJ
exons often involve large-scale deletions of the sequences separating the selected
gene segments (most probably by intrachromatid recombination; Figure 4.26B).
Conserved recombin.ation signal sequences fl ank the 3' ends of each V and) seg
ment and both the 5' and 3' ends of each D gene segment. They are recognized by
RAG proteins that initiate the recombination process, and they are the only lym
phocyte-specific proteins involved in the entire recombination process. The
recombination signal sequences also specify the rules for recombination,
enabling joining o rv to L or D to) followed byVto DL but neverV to Vor D to D.
Principles of
Development
KEY CONCEPTS
The zygote and each cell in very early stage vertebrate embryos (up to the 16-cell stage in
mouse) can give rise to every type of adult cell; as development proceeds, the choice of cell
fate narrows.
• In the early development of vertebrate embryos, the choice between alternati ve cell fates
primarily depends on the position of a cell and its interactions with other cells rather than
cell lineage.
Many proteins that act as master regulators of early development are transcription factors;
others are components of pathways that mediate signaling between cells.
The vertebrate body plan is dependent on the specification ofthree orthogonal axes and the
polarization of individual cells; there are very significant differences in the way that these
axes are specified in different vertebrates.
• The fates of cells at different positions along an axis are dictated by master regulatory genes
such as Hox genes.
Cells become aware of their positions along an axis (and respond accordingly) as a result of
differential exposure to signaling molecules known as morphogens.
Major changes in the form of the emb ryo (morphogenesis) are driven by changes in cell
shape, selective cell proliferation, and differences in cell affinity.
Gastrulation is a key developmental stage in early animal embryos. Rapid cell migrations
cause drastic restructuring of the embryo to form three germ layers- ectoderm, mesoderm,
and endoderm-that will be precursors of defined tissues of the body.
Mammals are unusual in that only a small fraction of the cells of the early embryo give rise
to the mature animal; the rest of the cells are involved in establishi.ng extraemb ryonic
tissues that act as a life support.
Developmental control genes are often highly conserved but there are imponant species
differences i.n gastrulation and in many earlier developmental processes.
134 Chapter 5: Principles of Development
pathways. Until quite recently, the gene products that were known to be involved
in modulating transcription and in cell signaling were exclusively proteins, but
now it is clear that many genes make noncoding RNA products that have impor
tant developmental functions. In this chapter we will focu s on vertebrate, and
particularly mammalian, development. Our knowledge of early human develop
ment is fragmentary, because access to samples for study is often restricted for
ethical or practical reasons. As a result, much of the available information is
derived from animal models of development.
Invertebra tes Coenorhabditis elegans easy to breed and maintain (GT::: 3-5 days); fate body plan different from lhat of vertebrates;
(roundworm) of every single cell is known difficult to do targeted mutagenesis
Drosophila melanogoster easy to breed (GT = 12-14 days); sophist ica ted body plan different from that of vertebrates;
(fruit fly) genetics; large numbers of mutants available cannot be stored frozen
Fish Donia rerlo (zebrafish) relatively easy to breed (GT = 3 months) and small embryo size makes manip ulation difficult
maintain large populations; good genetics
Frogs Xenopus /oevls transparent large embryo that is easy to genetics is difficult because of tetraploid
(large-clawed frog) manipulate genome; nol so easy to breed (GT = 12 months)
Xenopus tropica/is diplOid genome makes it genetical ly more smaller embryo than X, /aevis
(smal klawed frog) ame nabl e than X./aevis; easier to breed than
X. faevis (GT < 5 months)
Birds Gallus gal/us (chick) accessible embryo that is easy to observe and genetics is difficult; moderately lo ng generation
manipulate time (5 months)
Mamma ls Mus musculus (mouse) relatively easy to breed (GT = 2 months); small embryo size makes manipulation difficult;
sophisticated genetics; many different strains and implantation of embryo makes It difficu lt to
mutants access
Popio homadryas extremely similar physiology to humans very expensive to maintai n even small colonies;
(baboon) difficult to breed (GT = 60 months); major ethical
concerns about using primates for research
investigations
r'1------>· ~
apparent at the blastocyst stage, w hen there
l
villi, and syncytiotrophoblast, which wi ll
morula embryonic
epiblast
ingress into uterine tissue. The inner cells
l "; :.";'
(primitive
of blastocysts also give rise to different
1
blastocyst ectoderm) extraembryonic tissues plu s the embryonic
----? Inner
cell mass ~ .,.
) epiblast - '
epiblast thatwlll give rise to the embryo
proper. Some embryo nic epi b last cells are
induced early on to form prim ordial germ
cells, the precursors of germ celis, th at will
later migrate to the gonads (as detailed in
Section 5.7). The other cells of the emb ryonic
amniotic
ectoderm
epiblast will later give rise to the three
germ layers-ectoderm, mesoderm, and
hypoblast extra- olk endoderm-that will be precursors of our
(visceral ~ embryonic - - - - . ~ac ~ -. ~ extra
, somatic tissues (as detailed in Figure 5.13).
endoderm) endoderm embryonic
mesoderm The dashed line indicates a possible dual
origin of the extraembryoniC mesoderm.
outer
cells
_ _ _.,)~ trophoblast -+ trophoblast
cyto- --+ trophoblast
syncytio
The three germ layers, as they are known, are ectoderm, mesoderm, and endo
derm, and they will give rise to all the somatic tissues.
The constituent cells of the three germ layers are multipotent, and their
differentiation potential is restricted. The ectoderm cells of the embryo, for
example, give rise to epidermis, neural tissue, and neural crest, but they cannot
normally give rise to kidney cells (mesoderm-derived) or liver cells (endoderm
derived). Cells from each of the three germ layers undergo a series of sequential
differentiation steps. Eventually, unipotentprogenitor cells give rise to terminally
differentiated cells with specialized functions.
(AI (8) transplanl ce lls from figure 5..2 Ectoderm cells In the early
ventral to dorsal surface Xenopus embryo become committed to
r:. ~
I. -l epidermal or neural cell fates according to
their position. {AI Cells in the dorsal midline
""Hal
sUrface ~ !) do"al
surface
ec toderm (red) give ri se to the neural plate
that is formed along the anteroposterior
(A-P) axis; other ectodermal cells such
as ventral ectoderm (green) give rise to
!
A
!
A
! A
epidermis. (8) If the ven tral ec toderm is
grafted onto the dorsal side of the embryo,
it is re·specified and now forms the neural
plate instead of epidermis. This shows that
the fate of early ectod erm cells is fle xi bl e
and does not de pend on their lineage but
on their position. The fate of dorsal midline
ectoderm cells is speCified by signals from
celi s of an underlying mesoderm structure,
p p p the notochord, that forms along the
anteroposterior axis.
The ectodermal cells ale therefore flexible. If positioned on the front (ventral)
surface of the embryo they will normally give rise to epidermis, but if surgically
grafted to the dorsal midline surface they give rise to neural plate (Figure 5.2).
Similarly, if dorsal midline ectoderm cells are grafted to the ventral or lateral
regions of th e embryo they form epidermis. The dorsal ectoderm cells are induced
to form the neural plate along the midline because they receive specialized sig
nals from underlying mesoderm cells. These signals are described in Section 5.3.
The fate of the ectoderm-epidermal or neural-depends on the position of
the cells, not their lineage, and is initially reversible. At this point cell fate is said
to be specified, which means it can still be altered by changing the environment
of the cell. Later on, the fate of the ectoderm becomes fixed and can no longer be
alte red by grafting. At this stage, the cells are said to be determined, irreversibly
committed to their fate because they have initiated some molecular process that
inevitably leads to differentiation. A new transcription factor may be synthesized
that cannot be inactivated, or a particular gene expression pattern is locked in
place through chromatin modifications. There m ay also be a loss of competence
for induction. For example, ectoderm cells that are committed to becoming epi
dermis may stop synthesizing the receptor that responds to signals transmitted
from the underlying mesoderm.
~.n~
of neural stem cells depends on the
plane of cell division. (A) Neural stem
cells, occurring in neuroepithelium, have
cytoplasmic processes at both the apical
i: f
pole and the basal pole, with the latter
Notch-l
providing an anchor that COnnects them to
~~ljA
.'~~
the underlying basal lamina. The Notch-l cell
Numb
surface receptor is concentrated at the apical
pole, and its intracellular antagonist Numb is
concentrated at the basal pole. (8) Symmetric
divisions of neural stem cells occurring in
basal lamina the plane of the neuroepithelium result
! 1 1- '
in an even distribution of Notch-l and
Numb in the daughter cells. (C) However,
l,n J~
asymmetric divisions perpendicular to the
il)I neuroepithelium result in the formation of a
J .•.,. ?
I
stemct n J t replac ement stem cell that remains anchored
in the basal lamina and a committed
.f" ' .A
\0\
J
.. ..
A
."
.. .."
..
A
. (A::\ ~)
• A. )
• (
comm;tted
cell
neuronal progenitor that contains the
Notch-l receptor but not Numb. The latter
cell is not anchored to the basal lamina and
so can migrate and follow a pathway of
neural differentiation.
\
1..I,r _ .
(A) A (8 )
right lung
(3 looe s)
Flgut. 5.4 The three axes of bilaterally symmetric animals. (Al The three axes. The anteroposterior (A-P) axis
is sometimes known as the craniocaudal or rostrocaudal axis (from the Latin cauda = tail and rostrum = bea k). In
vertebrates the A-P and dorsoventra l (D-V) axes are clearly lines of asymmetry. By contrast, the left-right (L-R) axis
is superficiall y symmetriC (a plane through the A-P axis and at right angles to the L- R axis seems to divide the body
into tlNo equal halves). (B) Internal left- rig ht asymmetry in humans. Vertebrate embryos are initially symmetric with
respect to the L- R axis. The breaking of this axis of symmetry is evolutionarily conserved, so that the organization and
placement of internal organs shows l-R asymmetry. In humans, the left lung has two lobes, the right lung has three.
The heart, stomac h, and spleen are placed to the left; the liver is to the right, as shown in the left paneL In about 1 in
10,000 individuals the pattern is reversed (situs inversus) without harm ful consequen ces . Failure to brea k symmetry
leads to isomerism: an individua l may have two right halves (the liver and stomach become centralized but no
spleen develops) or two left halves (resulting in two spleens). In some cases, assignment of left and right is internally
inconsistent (h etero raxia), leading to heart defects and other problems. ((6) adapted from McGrath and Brueckner
(2003) Curro Opin. Genet. Dev. 13, 385-392. With permission from Elsevier.)
The establishment of polarity in the early embryo and the mechanisms of axis
specification can vary significantly between animals. In various types of animal,
asymmetry is present in the egg. During differentiation to produce the egg, cer
tain molecules known collectively as determinants are deposited asymmetrically
within the egg, endowing it with polarity. When the fertilized egg divides, the
determinants are segregated unequally into different daughter celis, causing the
embryo to become polarized.
In adler animals, the egg is symmetric but symmetry is broken by an external
cue from the environment. In chickens, for example. the anteroposterior axis of
the embryo is defined by gravity as the egg rotates on its way down the oviduct. In
frogs, both mechanisms are used: there is pre-existing asymme try in the egg
defined by the distribution of maternal gene products, while the site of sperm
entry provides another positional coordinate.
In mammals the symmetry-breaking mechanism is unclear. There is no evi
dence for dete rminants in the zygote, and the cells of early mammalian embryos
show considerable developmental flexibility when compared wit h those of other
vertebrates (Box 5. l) .
BOX 5.1 AXIS SPECIFICATION ANO POLARITY IN THE EARLY MAMMALIAN EMBRYO
In many vertebrates, there is clear evidence of axes bei ng set up in ce lls, the innercefi mass (reM), whIch are located at one end of the
the egg or in the very early embryo. Mammalian embryos seem to be embryo, the embryonic pole. The opposite pole of the embryo is t he
different. There Is no clea r sign of polarity in the mouse egg, and initi al abemb ryon ic pole. The embryon ic face of the leM is in co ntact w ith
claims that the first cleavage of the mouse zygote is related to an axis the t rophectoderm, whereas the abembryonic face is open to the
of the em bryo have not been substantiated. fluid-filled cavity, the blastocele. This difference in environm ent is
The sperm entry point defines one early opportunity for sufficient to spec ify the first two distinct cell layers in the leM: prim itive
establishing asymmetry, Fertil ization induces the second meiotic ectoderm (epiblast) at the embryonic pole, and primitive endoderm
division of the egg, and a second polar body (Figure 1) is generally (hypoblast) at the abembryonic pole (Fig ure 1B, left panel). This, in
extruded op posite the sperm entry site. This defines the anima/ turn, defines the dorsoventral (O- V) axis of the embryo. It is not clear
vegeta l axis of the zygote, with the polar body at the animal pole. how the reM becomes positioned asymmetrica lly in the blastocyst in
Sub sequent cleavage divisions result in a blastocyst that shows the first place, but it is interesting to note that when the site of sperm
bilateral symmetry aligned with the former an imal-vegetal axis of the entry is tracked w ith fluoresce nt beads, it is consistently localized to
zygote (see Figure 1A). trophectoderm cells at t he em bryonic-abembryoniC border.
The first clear sign of polarity in the early mouse embryo Is at the It is sti ll unclear how the anteroposterior (A- P) axis is specified, but
blastocyst stage, when the cells are organized into two layers- an the posItion of sperm entry may have a defining role. The first overt
outer cel l layer, known as trophectoderm, and an inner group of indication of the A-P axis is the primirive streak, a linear structure that
PATTERN FORMATION IN DEVELOPMENT 141
develops in mouse embryos at about 6.5 days after fertilization, and in The left- right (L-R) axis is the last of the three axes to form. The
human embryos at about 14 days after fertilization. The d ecision as to maj or step In determining L- R asymmetry occurs during gastrulation,
which end should form the head and which should form the ta il rests when rotat ion of cilia at the embryonic node results In a unidirectional
with one of two major signaling centers in the early embryo, a region flow of peri nOdal fluid that is required to specify the L-R axis.
of extraembryonic tissue called the anterior visceral endoderm (AVE). In Somehow genes are activated specifically on the embryo's left-hand
mice, this is initially located at the tip of the egg cylinder but it rotates side t o produce Nodal and Left y-2, initiati ng Signaling pathways
toward the future cranial pole of the A - P axis just before gastrulation that activate genes encoding left· hand·specific transcription factors
(Figure' e). The second major signaling center, t he node, is established (such as Pirx2) and inhibiting those t hat produce right-hand-specific
at the opposite extreme of the epiblast. transcription factors.
~ -f
just after fertilization. (B) The embryonic
abembryonk axis and development of the
anteroposterior axis in t he mouse embryo. At
the blastocyst stage, at about embryonic day 4
(E4: 4 days after fertilization) in mice, the inner
Q,m vegetal
sperm entry po s ltlon
vegetal
~. / /
V
vegetal
Pet al. (2007) Principles of Development, 3rd ed. (5.5 days) of AVE
of primitive streak
How do cells become aware of their position along an axis and therefore
behave accordingly? This question is pertinent because functionally equivalent
cells at different positions are often required in the production of different struc
tures. Examples include the formation of different fingers from the same cell
types in the developing hand, and the formation of different vertebrae (some
with ribs and some without) from the same cell types in the mesodermal struc
tures known as somites.
[n many developmental systems, the regionally specific behavior of cells has
been shown to depend on a signal gradient that has different effects on equiva
lent target cells at different concentrations. Signaling molecules that work in this
way are known as morphogens. In vertebrate embryos, this mechanism is used
to pattern the main ante roposterior axis of the body, and both the anteroposte
rior and proximodistal axes of the limbs (the proximodistal axis ofthe limbs runs
from adjacent to the trunk to the tips of the fingers or toes).
(A)
HOMe [
"'" pb Dfd Scr ArUp
1 2 3 4 5 6 7 8 9 10 11 12 13
Hoxa 3'
Hoxb
Hoxc
Figure 5.S Hox genes are expressed
Hoxd at different positions along the
anteroposterior axis according to their
position within Hox clusters.
(6) (Al Conservation of gene function within
Hox clusters. In Drosophila, an evolutionary
A ( , p
rearrangement of what was a single
cerVical thoracic lumbar sactal caudal
Hox cluster resulted in two subc lusters
collectively called HOM-C Mammals have
four HOI( gene clusters, suc h as the mouse
Hoxa, Hoxb, Hoxc, and Hoxd clus ters shown
here, that arose by the duplication of a
single ancestral HOI( cluster. Colors indicate
sets of paralogous genes that have similar
functions/expression patterns, such as
Drosophila labial (lab) and Hoxo 1, Hoxb 1, and
Hoxdl. (8) Mouse HOI( genes show graded
expression along the anteroposterior al(is.
Genes at the 3' end, such as Hoxo 1(A 1) and
Hoxbl (81), show e)(pression at anterior (A)
parts of the embryo (in the head and neck
regions); those at the 5' end are expressed
at posterior (Pl regions toward the tail.
(Adapted from Twyma n (2000) Instant Notes
In Developmental Biology. Taylor & Francis .}
PATTERN FORMATION IN DEVELOPMENT 143
{AI limits of HoxD orga nization Figwe 5.6 Digit formation in the chick
gene expression of digits limb bud is specified by the zone of
polarizing activity. (A) Normal specification
z;;
of digits. The zone of polarizing activity (ZPA)
"
~ _
D10
~. -i]. ) '"
II is located in the posterior margin of the
_z~J
D11 developing limb bud (left panel). ZPA cells
{:o?
OC O 0
1
O$l'O~~;'><'»
D12
013 -
~ &:;::/ IV
produce the motphogen Sonic hedgehog
(S hh), and the ensuing gradient of Shh levels
generates a nested overlapping pattern of
expressio n patterns for the five differen t
HoxD genes (middle panel), w hich is
(S)
important for the specificatio n of digits (right
,
Shh Shh
/ panel). (8) Duplication of ZPA signal s causes
,
RA RA mirror-image duplication of digits. Grafting
1~
/
of a second ZPA to the amerior margin of
A'>
~ 'v
ZPA "" C>
,-'> o.
p~~~ =
-""J
.v ,,"
the limb bud (or placement of a bead coated
...J 1\1
~
with Shh or retinoic acid, RA) establishes
'
I>')
, II an oppoSing morphogen gradient and
~ II results in a mirror-image reversal o f digit
~fi<,A fates. [From Twyman (2000) Instant Notes In
ZPA _ .0 0 00
0<.9 ....0~~v>u> §;.;;::? :~ Developmental Biology. Taylor & Franci s.)
The source of the m orphogen gradient that guides Hox gene expression along
the major anteroposterior axis of the embryo is thought to be a transient embry
onic structure known as the node. which is one of two major signaling centers in
the early embryo (see Box 5.1). and the morphogen itself is thought to be retinoic
acid. The nod e secretes increasing amounts of retinoic acid as it regresses, such
that posterior cells are exposed to larger amounts of the chemical than anterior
cells. resulting in the progressive activation of more Hox genes in the posterior
regions of the embryo.
5.4 MORPHOGENES IS
Cell division. with progressive pattern formation and cell differentiation. would
eventually yield an embryo with organized cell types. but that embryo wo uld be
a static ball of cells. Real embryos are dynamic structures. with cells and tissues
undergoing constant interactions and rearrangements to generate structures
and shapes. Cells form sheets. tubes. loose reticular masses. and dense clumps.
Cells migrate either individually or en masse. In so me cases. such behavior is in
response to the developmental program. In other cases. these processes drive
development. bringing groups of cells together that would otherwise never come
into contact. Several different mechanisms underlying morphogenesis ate sum
marized in Table 5.2 and discussed in more detail below.
Change in ce ll shape change from columnar to wedge-shaped cells during neura l tube closu re in birds and mammals
Change in cell size expansion of adipocytes (fat cells) as they accumulate lipid droplets
Gain of cell- cell adhesion condensation of ce lls of the cartilage mese nchyme in vertebrate limb bud
Loss of cell-cell adhesion delamination of cells from epiblast during gastrulation in mammal s
Ceil- matrix interaction migration of neura l crest cells and germ cells
Loss of cell matrrx adhesion delamination of cells from basal layer of the epidermis
Differential rates of cell proliferation selective outgrowth of vertebrate limb buds by proliferatio n of ce ll sin the progress zon e, the
undifferentiated population of mesenchyme cell sfrom which successive parts of the li mb are laid down
Alternative pOS itioning and/or different embryonic cleavage patterns in animal s; stereotyped cell division sin nematodes
orientation of mitotic spindle
Apoptosis separation of digitsin verte brate limb bud (Figu re 5.7); selection of functional synapses in the
mammalian nervous system
Migration The movement of an individual cel! with re spect to other cells in the embryo. Some cells, notably the neu ral crest cells (Box 5.4)
and germ cells (Section 5.7), migrate far from their original locations during development
Ingression The movement of a cell from the surface of an embryo into its interior (Figure 5.13)
Egression The movement of a cell from the interior of an embryo to the external surface
Delamination The movement of cells out of an epithelial sheet, often to convert a single layer of cells into multiple layers. This is one of the major
processes that underlie gastru lation (Figure 5.13) in mammalian embryos. Cells can also delaminate from a basement membrane,
as occurs in the development of the skin
Intercalation The opposite of delamination: cells from mUltiple cell layers merge into a single epithelial sheet
Condensation The conversion of loosely packed mesenchyme cells into an epithelial structure; sometimes called a mesenchymal-la-epithelial
transition
Dispersal The opposite of condensation: conversion of an epithelial structure Into loose ly packed mesenchyme cells; an epithefiaJ-to
mesenchymal transition
IAI
are created by the death of interdigital cells in the hand and foot plates beginning
at about 45 days of gestation (Figure 5.7'). In the mammalian nervous system,
apoptosis is used to prune out the neurons with nonproductive connections,
allowing the neuronal circuitry to be progressively refined. Remarkably, up
to 50% of neurons are disposed of in this manner, and in the retina this can
approach 80%.
(A) (8) Figur • .5.8 The specialized sex cells. (AI The
plasma perivitelline zona pellucida head [ J:N;- acrosomal vesicle egg. Th e mammalian egg (oocyte) is a large
membrane
\
space
I
\ I cortical
5 ~m ,.~ - nucleus
]
midpiece
5Vm
cell, 120 Ilm in diameter, surrounded by an
extracellular env.,elope. the zona pellucida.
which contains three glycoproteins. Zp-l .
ZP-2. and ZP-3. that polymerize to form
a gel. Th e first polar body <the product of
meiosis I; not show n here) lies under the zona
within the perivitelline space. At ovulation.
plasma membrane OO{ytes are in metaphase II. Meiosis II is not
50~m completed until after fertilization. (8) The
flagellum sperm. This cell is much smaller than the
egg, with as!Jm head containing highly
compacted DNA. a S.um cylindrical body (the
midpiece, conta ining many mitochondria),
and a SO pm tail. At the franc, the acrosomal
vesicle contains enzymes that hel p the sperm
120 )lm to make a hole in the zona pellucid a, allowing
it to access and fertilize the egg. Fertili zati on
triggers secretion of cortical granules by the
The male gamete, the sperm cell, is a small cell with a greatly reduced cyto egg that effectively Inhibit further sperm from
plasm and a haploid nucleus. The nucleus is highly condensed and transcrip passing through the zon a pellucida. [From
tionally inactive because the normal histones are replaced by a special class of Alberts B, Johnson A, Lewis J et al. (2002)
packaging proteins known as protamines. A long flagellum at the posterior end Molecular Biology of the Cell, 4th ed. Garland
provides propulsion. The acrosomal vesicle (or acrosome) at the anterior end ScienceITaylor & Francis LLC.]
(Figure 5.8B) contains cligestive enzymes. Human sperm cells have to migrate
very considerable distances, and out of the 280 million or so ejaculated into the
vagina only about 200 reach the required part of the oviduct where fertilization
takes place.
Fertilization begins with attachment of a sperm to the zona pellucida followed
by the release of enzymes from the acrosomal vesicle, causing local digestion of
the zo na pellucida. The head of the sperm then fuses with the plasma membrane
of the oocyte and the sperm nucleus passes into the cytoplasm. Within the oocyte,
the haplo id sets of sperm and egg chromosomes are initially separated from each
other and constitute, respectively, the male and female pronuclei. They subse
quently fuse to form a diploid nucleus. The fertilized oocyte is known as the
zygote.
Figure $.9 Early development of the mammalian embryo, from zygote to zygote
blastocyst. In mammals, the first cleavage is a normal meridional division (along
the vertical axis), but in the second cleavage one of the two cells (blastomeres)
divides meridionally while the other divides at right angles. equatorially (roto rionol
cleavage). At the eight-cell stage. the mouse morula undergoes compaction, when
the blastomeres huddle together to form a compact ball of cells. The tight packaging,
stabilized by tight junctions formed between the outer cells of the ball, seals off the
inside of the sphere.The inner cells form gap junctions. enabling small molecules
and ions to pass between them. The morula does not have an internal cavity, but 2·ceUstage
the outer cells secrete fluid so that the subsequent blastocyst becomes a hollow bal l
of celis with a fluid-filled internal cavity, the blastocele .The blastocyst has an outer
layer of cells, the trophoblast. that will give rise to an extraembryonic membrane,
the chorion, plus an inner cell mass (ICM) Ioc.ated at one end of the embryo, the
embryonic pole. Cells from the ICM will give rise to the other extracellular membranes
as weJl as the embryo proper and the subsequent fetus . For convenience, the zona
pellucida that surrounds the early embryo (see Figure 5.10) is not shown. [Adapted
from Twyman (2000) Instant Notes In Developmental Biology. Taylor & Francis.]
as maternal genome regulation, and the maternal gene products are often
referred to as maternal determinants.
4cell stage
~9'
species. As a result, the cleavage divisions are controlled by rhe zygotic genome
rather than by maternally inherited gene products. The divisions are slow
because cell cycles include G1 and G2 phases and are asynchronous.
• The cleavage mechanism is unique. The first cleavage plane is vertical, but in
the second round of cell division one of the cells cleaves vertically and the
other horizontally (rotational cleavage-see Figure 5.9). Additionally, cells do
not always divide at the same time to produce two -cell then four-cell then
eight-cell stages, but can sometimes divide at different times to produce
1
morula
embryos with odd numbers of cells, such as three-cell or five-cell embryos.
• The embryo undergoes compaction. The loosely associated blastomeres of
c8
bd
the eight-cell embryo flatten against each other to maximize their contacts
and form a tightly packed morula. Compaction does occur in many non
mammalian embryos, but it is much more obvious in mammals.
Compaction has the effect of introducing a degree of cell polarity Before com
paction tlle blastomeres are rounded cells with uniformly distributed microvilli,
and the cell adhesion molecule E-cadherin is found whereve r the cells are in con
1
compacted morula
tact with each other. After compaction, the microvilli become restricted to tl,e
apical surface, while E-cadherin becomes <listributed over the basolateral sur
~f0)
faces. Now the cells form tight junctions with their neighbors and the cytoskeletal
elements are reorganized to form an apical band.
A! about the 16-cell morula stage in mammals, it becomes possible to dis
criminate between two types of cell: external polarized cells and internal non
polar cells. As the population of nonpolar cells increases, the cells begin to ',,'~.--::J< tight junclion
communicate witll each other tluough gap junctions. The distinction between
the two types of cells is a fundamental one and underlies the more overt distinc
tion between the outer and inner cell layers of tlle subsequent blastocyst (see
1
blastocyst
Figure 5.9).
embryonic pole inner ce ll mass
~/
Only a small percentage of the cells in the early mammalian
embryo give rise to the mature organism
~lJO-'
[n many animal models of development, the organism is formed from cells that
have descended from all the cells of the early embryo. Mammals are rather differ
ent: only a small minorityofthe cells of the early embryo give rise to the organism
proper. This is so because much of early mammalian development is concerned
with establishing the exlraembryonic membranes-tissues that act as a life sup
port but mostly do not contribute to the final organism (Bo~ 5.2). trophoblast cells
EARLY HUMAN DEVELOPMENT: FERTILIZATION TO GASTRULATION 149
_~ ~LA 'A
the uterus and then into the oviducts
_---"i'"S' ' ~lr 3"~ (= fallopian tubes). During ovula tion, an egg
9W1"~~~":C \~.'
,_piC'--;;P~' ~" ~"- I:l~
is released from an ovary into the oviduct,
__ _ I =. . -00--. where it may be fertilized by a sperm. The
Ac'.""r!'~/ , 1-t1-
~& /
~ \ ""iduct
I fertilized eg9 is slowly propelled along the
.~>--
ii
~,_ ~""o'
"v
~~
I
I: I
oviduct. During its journey, the zygote goes
1'; "\~ '1. -
,:;>®@>@ @;'--. ' ''L' through various cleavage diviSions, bur
I(ff{I,Y~
·.,
w (
J r:; "j .
","- -),! ." ,,,,,,,,
the zona pellucida usually prevents it from
adhering to the oviduct walls (although this
.~ iertilization occasionally happens in humans, causing a
dangerous condition, an eclOpicpregnanCf).
0fI
~.--1-
,>zo
(.,,";; '
\
ea,'y stage of
end
Implantation
.
ometnum
o"' ati~n
-,..,..,-
Once in the uterus, the zona pellucida
is partly deg raded and the blastocyst is
relea sed to allow implantation in the wall
of the uterus (endometrium). [From Gilbert
(2006) Developmental Biology, 8th ed. With
permission from Sinauer ASSOC iates, Inc.)
As the blastocele forms (at about the 32-cell stage in humans), the inner non
polar cells congregate at one end of the blastacele, the embryonic pole, to form
an off·center inner cell mass ([CM). The outer trophoblast cells give rise to the
chorion, the outermost extraembryonic membrane, while the cells of the ICM
will give rise to all the cells of the organism plus the other three extra·embryonic
membranes. At any time until the late blastocyst stage, splitti ng can lead to the
production of ide ntical (monozygotic) twins (Box 5.3).
Early mammalian developmen t is unusual in that it is concerned trophoblast cells, wh ich produce the enzymes that erode the lining
primarily w ith the formati on of tissues that mostly do not contribute of the uterus, helping the embryo to implant into the uterin e wall.
to the final organism. These tissues are the four extraembryo nic The chorion is also a source of hormones (chorion ic gonadotropin)
membranes: yolk sac, amnion, chorion, and allantois. The chorion that influence the uterus as well as other systems. In all these cases,
combines with maternal [issue to form the placenta. As well as the chorion serves as a surface for respiratory eXChange. In placental
protecting the embryo (and later th e fetus), these life support systems mamma ls, the chorion provides the fetal component of the placenta
are required to provide for its nutrition, respiration, and excretion. (see below).
Yolk sac Allantois
The most primitive of the four extraembryonic membranes, the yolk The most evolutionarily recent of the extraembryonic membranes, the
sac is found in all amniotes (mammals, birds, and reptiles) and also allantois develops from the posterior part of the alimenta ry canal in
in sharks, bony fishes, and some amphibians. In bird em bryos, the the embryos of reptiles, birds, and mammals. It arises from an outward
yolk sac surrounds a nutritive yolk mass (the yellow part of the egg, bu lging of the floor of the hindgut and so is composed of endoderm
consisting mostly of phospholipids). In many mammals, incl uding and splanchniC lateral plate mesoderm. In most amniotes it acts as a
humans and mice, the yolk sac does not contain any yolk. The yolk sac waste (uri ne) storage system, but not in placental mammals (including
is generally important because: humans). Althoug h the allantois of placental mammals is vestigial and
the primordial germ cells pass through the yolk sac on their may regress, its blood vessels g ive rise to the umb ilical cord vessels.
migration from the epiblast to the genital ridge (for more deta ils,
Placenta
see Section 5.7);
The placenta is found only in some mammals and is derived partly
it is the source of the first b lood cells of the conceptus and most of
from the conceptus and partly from the uterine wall. It develops after
the first blood vessels, some of which extend themselves into the
implantation, when th e embryo induces a response in the neighboring
developing embryo.
maternal endometrium, changing it to become a nutrient-packed
The yolk sac originates from splanchniC (= visceral) lateral p late
high ly vascular tissue called the decidua. During the second and third
mesoderm and endoderm.
weeks of d evelopmen t, the trophoblast tissue becomes v acuolated,
Amnion and these vacuoles connect to nearby maternal capillaries, rapidly
red ucing the effects of gravity on the body), and acting as a hydraulic
differentiated fu lly and contains a vascular system that is connected
cushion to protect the embryo fro m mechanical jolting. The amnion to the embryo. Exchange of nutrients and waste products occurs over
derives fro m ectoderm and somatic lateral plate mesoderm. the chorionic villi. Initially, the embryo is completely surrounded by
Chorion the decidua, but as it grows and expa nds into the uterus, the overlying
The chorion is also derived from ectoderm and somatic lateral plate deCidual t issue (decidua capsu laris) thins out and then disintegrates.
mesoderm. In the embryos of birds, the chorion is pressed against The mature placenta is derived completely from the underlying
the shell membrane, but in mammalian embryos it is composed of decidua basalis.
150 Chapter 5: Principles of Devel op ment
Approximately one in every 200 human pregnancies gives rise to stage. As a result, two separate blasrocysts are formed. giving rise
twins. There are two distinct types: to embryos shrouded by independent sets of extraembryonic
Fraterna l (dizygotic) twins result from the independent membranes. In the remaining two-thirds, twinning occurs at the
fertil ization of two eggs and are no more closely related t han blastocyst stage and involves divi sion of the inner cell mass.
any other siblings. Although developi ng in the same womb, the The nature of the twinning reflects the exact stage at which
embryos have separate and independent sets of extraembryonic the d iviSion occurs and how complete the division is. In most cases,
membranes (Fig ....' . 1). the d ivision occurs before day 9 of gesta tion, which is when the
Identical (monozygotic) twins arise from the same fertilization amnion is formed. Such twins share a common chorionic cavity hut
event, and are produced by the division of the embryo while the are surrounded by individual amnions. In a very sm all proportion of
cells are still totipotent or pluripotent (see Figure 1l. births, t he division occurs after day 9 and the developing embryos
About one-third of monozygotic twins are produced by an are enclosed within a com mon amnion. Either through incomplete
early diviSion of the embryo, occurring before or during the morula separation or subsequent fusion, these twins are occasionally conjoined.
Figura 1 Monozygotic ilnd dizygotic twinning . (Adapted from Larsen (2001 ) Human Embryology, 3rd ed . With permi ssion from Elsevier.)
Implantation
At about day 5 of human development, an enzyme is released that bores a hole
through the zona pellucida, causing it to partly degrade and releasing the blasto
cyst (see Figure 5.10). The hatched blastocyst is now free to interact directly with
the cells lining the uterus, the endometrium. Very soon after arriving in the uterus
(day 6 of human development), the blastocyst attaches tightly to the uterine epi
theliwn (implantation). Trophoblast cells proliferate rapidly and differentiate
into an inner layer of cytotrophoblast and an outer multinucleated cell layer. the
syncytiotrophoblast, that starts to invade the connective tissue of the uterus
(Figure S. II ).
EARLY HUMAN DEVELOPMENT: FERTILIZATION TO GASTRULATION 1S
(AI lSI
maternal
\ ",1
~~) \ ~ syncytiotrophoblast
'",- ,t(w.~/·,~J':·:S primar,' ~:,: -,- ~' 1I1i
l.J.~~?J:: .-:-"{i),....
J~f.~' #i'1j::'~ . - --1
~~
"e
' ~~
' -='~')2',"t- '\,F - :~~~:~~al
~-~
amniotic cavity ~ ~-t,..""I1>_ I
,. ,..- ~11J I
J"j±":;~
cytot(ophOblast
i.i
hypobla st
maternal blood ~
(primitive blastocele
vesse ls ij.f
endoderm)
O.lmm
Cytotrophoblast __
"---. ~- -- y<:lt :sac
-*-.
~+~~ . ~it!"'
..." r
·" .'.., l' ~
~.~~)~ Ai%.,\,jl':..";':· '
;r.... . •
.
~
Even before implantation occurs, cens of the ICM begin to differentiate into Figure 5.1 1 Detail of human embryo
two distinct layers. An external cell layer, the epiblast (or primitive ectoderm) win implantation. (A) Human embryo
implanta tion begins at about 6 days
give rise to ectoderm, endoderm, and mesoderm (and hence all of the tissues of
after fertilization, wh en the hatched
the embryo), and also the amnion, yolk sac, and allantois. An internal cell layer,
blastocyst adheres to the wall of the
the hypoblast (also called primitive endoderm or visceral endoderm) will give rise uterus (endometrium). Trophob last
to the extraembryonic mesoderm that lines the primary yolk sac and the cells differentiate into an inner layer of
blastocele. cytorrophoblast and an OUler multinucl ~J,(~
A fluid· filled cavity, the amniotic cavity, forms within the inner cell mass, cell layer, the syncyriotrophoblast, that
enclosed by the amnion. The embryo now consists of distinct epiblast and hypo invades the co nnective tissue of the utefU$,.
blast layers that will come together to form a flat bilnminar germ disk located The inner cell mass of the blastocyst has
between two fluid·filled cavities, the amniotic cavity on one side and the yolk sac given rise to twO distinct cell layers. the
on the other (Figure 5.11B and Figure 5.12). epiblast and the hypob last. (B) By about
11 days after fertilization, the primary
yolk sac detaches from the surro unding
Gastrulation Is a dynamic process by which cells of the epiblast cytotrophoblast. l oose endod ermal cells ': ' 0:
give rise to the three germ layers scattered round the yolk sac. Villi composeo
of cytotrophoblast cells begin to extend into
Gastrulation takes place during the third week of human development and is the the syncytiotrophoblast. (Reproduced from
first major morphogenetic process in development. During gastrulation, the ori Mclachlan (994) Medical Embryolog y. Will'!:
entation of the body is laid down, and the embryo is converted from a bilaminar permission from Addison-Wesley.)
structure into a structure ofthree germ layers: ectodenn, endoderm, and _m eso
derm. It should be noted that whereas the bilaminar germ disk is a flat disk in
humans, rabb its, and some other mammals, it is cup·shaped in the mouse (co m·
pare Figure lB in Box 5.1).
In mammals, birds, and reptiles, the major structure characterizing gastrula
tion is a linear one, th e primitive streak. This transient structure appears, at
abo ut day 15 in human development, as a faint groove along the longitudinal
midline of the now oval-shaped bilarninar germ disk. Over the course ofthe next
day, the primitive streak develops a groove (primitive groove) , which deepens and
elongates to occupy close to half the length of the embryo. By day 16 a deep
depression (the primitive pit), surrounded by a slight mound of epiblast (the
primitive node), is evident at the end of the groove, near the center of the germ
disk (see Figure 5.12).
The process of gastrulation is extremely dynamic, involving very rapid cell
movements. At day 14-15 in human development, the epiblast ceUs near the
primitive streak begin to proliferate, flatten, and lose their connections with one
another. These flattened cells develop pseudopodia that allow them to migrate
downward through the primitive streak into the space between the epiblast and
152 Chapter 5: Principles of Development
extraembryonic mesoderm
the hypoblast (Figure 5.J3B). Some of the ingressing epiblast cells invade th e
hypoblast and displace its ceUs, leading eventually to complete replacement of
the hypoblast by a new layer of epiblast-derived cells, the definitive endoderm.
Starting on day 16, some of the migrating epiblast cells diverge into the space
between the epiblast and the newly formed definitive endoderm to form a third
layer, the intraembryonic mesoderm (Figure 5.13C). When the intraembryonic
mesoderm and definitive endoderm have formed, the residual epiblast is now
described as the ectoderm, and th e new three-layered structure is referred to as
the trilaminar ge,m disk (see Figure 5.13C).
The ingressing mesoderm cells migrate in different directions, some laterally
and others toward the anterior, and others are deposited on the midline (FIgure
5.14).
The cells that migrate through the primitive pit in the center of the primitive
node and come to rest on the midline form two structures:
• The prechordal (also called prochordal) plate is a compact mass of mesoderm
to the anterior of the primitive pit. The prechordal plate will induce important
cranial midline structures such as the brain.
• The notochordal process is a hollow tube that sprouts from the primitive pit
and grows in length as cells proliferating in the region of the primitive node
add on to its proximal end. The notochordal process and adjacent mesoderm
IAI (9) day 14- 15 (C) day 16
ej)jblasl
(primitive
hypoblast
(primitive
endoderm)
fii~±::!!!!!I!!'ii"~3!!!!!!t:'iF{' ecloderm
.. embryoniC
Figure 5.11-During human gastrulation, a flat bilaminar germ disk transforms into a trilaminar embryo. The outcome of gastrulation is similar in
all mammals, but major differences may occur in the details of morphogenesis, particularly in how extraembryonic structures are formed and used. (A) In
humans, the epiblast and hypoblast come into contact to form a flat bilaminargerm disk. The epiblast cel ls within this disk are described as the primitive
ectoderm but will give rise to alt three germ layers- ectoderm, endoderm, and mesoderm-as described in panels (8) and (C). The epiblast celts that are
not in contact with the hypoblast w ill give rise to the ectoderm of the amnion. The hypoblast will give rise to the extraembryonic endoderm that lines the
YOlk sac. (8) The bflaminar germ di sk at 14- 15 da ys of human development. Along the length of the primitive streak. epiblast cells migrate downward to
invade the hypoblast, and in so doing they become converted t o embryonic endoderm and displace the celt s of the hypoblas t. (C) The bflaminar germ
disk at 16 days of human development. A secon d wave of in gressing epiblast cells diverge into the space between the epiblas t and the newly formed
emb ryonic endoderm to form embryonic mesoderm . The remai ning epiblast cells are now known as the em bryonic ectoderm. [From larsen (2001 )
Hum an Embryology, 3rd ed. With permission from Elsevier.1
EARLY HUMAN DEVELOPMENT: fERTILIZATION TO GASTRULATION 153
induce the overlying embryonic ectoderm to form the neural plate, the pre
cursor of the central nervous system (Figure 5. 15). At the same time as the
notochordal process extends, the primitive streak regresses. By day 20 of
human development, the notocbordal process is completely form ed, but it
then transforms from a hollow tube to a solid rod, the notochord. The noto
chord will later induce the formation of components of the nervous system
(as described in Section 5.6).
The ingressing mesoderm cells that migrate laterally condense into rodlike
and sheetlike structures on either side of the notochord. There are three main ~"~e::
endoderm
-t
head
mesenchyme ~ ::=============~)~ outer layers of eye
head muscles
axia l mesoderm
(notochord) Ffgu... 5. 17 Principal derivatives of
- + axial ske leton
-t
scleroto me
the three germ layers. The three germ
paraxial myotome ) trunk skeletal muscles layers formed during gas{(u!ation w ill
mesoderm
dermatome ~ connective ti ssue layers of skin eventually form all tissues of the embryo.
The connective tissue o f the head and
intermediate._..r-+ Mullerian ducts ~ oviducts, uterus the cartilage o f the skull and of structures
~ mesoderm -y mesonephros ~ kidney, ovary, testi s, ureter derived from branchial arches are a mixture
~.~
~
~-:~ proliferation of ectoderm at t iof-~ g jns of
<t~
.'
(, " . ~
the neural plate pushing the <.C.::>:j. together.
~
~ n '::R
j J ~ neu,al As the neural folds come t ~"'le;. neural
U J:;; crest crest cellsdelaminate and beg:1 to migrate
ectoderm notochOrd neural tuM ' . cells
away to diverse locations, .n d@.scribed latfT
1 S6 Chapter 5: Principles of Development
The vertebrate neural crest is quite extraordinary and, although Tilbte, Major classes of neural crest derivatives
derived from ectoderm, its importance has ted to suggestions that
it be recognized as a fourth germ layer. Neura l crest cell s originate Tissue/ region Cell types/structure
during neurulation from dorsa l ectoderm cell s at the edges of the
neural foldS (Figure 5.18). The neural crest is a transient structure - the Peripheral nervous sensory ganglia; sympathetic ganglia;
cells disperse soon after the neu ral tube closes. Th ey und ergo an system (PNS) ne urons parasympathetic ganglia
epithelial-ta-mesenchymal transition and they migrate away from the
PNS glial cells Schwann ce ll s; non-mye!inating glial cells
midline (Figure 1).
Neural crest cells migrate to peri pheral locations, where they
Endocrine/ adrenal medulla; calcitonin-secreting
assume a variety of different ectodermal and mesodermal fates, giving paraendocrine cells; carotid body type I cells
rise to a quite prodigious number of differentiated cell types (Table 1). deri vatives
Most of our information comes from fate-mapping studies in the chick,
which have been aided by the ability to transplant corresponding Head and neck corneal endothelium and stroma; dermis;
domains of the do rsa l neural tube between chick and quail embryos smooth muscle. and adipose tissue;
and to identify the cellular derivatives of such domains. facial and anterior ventral skull cartilage
Neural crest (NC) cells form at all levels along the anteroposterior and bones; lachrymal gland, connective
axis of the neural tube; however, four main, but overlapping, regions tissue; pituitary gla nd, connective tissue;
are recognized with characteristic derivatives and functions as listed salivary gland, connective tissue; thyroid,
below. connective tissue; tooth pa pillae
sympathetic. notochord
L ganglion R
aorta Figure' Main migratory pathways during neural crest
adrenal migration. Schematic cross section through the middle part of the
gland trunk of a chick embryo. Cells taking the pathway just beneath the
coelomic ectoderm (outer red arrows) w il l form melanocytes; those that take
cavity
enteric the deep pathway via the somites (i nner red arrows) w ill form the
ganglia neurons and glial cells of sensory and sympathetic ganglia. and parts
gut tube
of the adrenal gland. The neurons and glia of the enteric ganglia, in
the wal! of the gut, are formed from neural crest cells that migrate
along the length of the body from either the neck or sacral regions.
[From Alberts 8, Johnson A, Lewis J et al. (2002) Molecular Biology of
v
the Cell. 4th ed. Garland Science/Taylor & FranCis LLC,]
factors of the Emx and Otx families are expressed. In the hindb rain and spinal
cord, positional identities are controlled by the Hox genes. In the hindbrain, it
seems that the positional values of cells are fixed at the neural plate stage and
that migrating neural crest cells carry this information with them and impose
positional identities on the tissues surrounding their eventual resting place.
NEURAL DEVELOPMENT 157
.
~~:<~;?
/----- ~
. TGf'./l D~ 2.----::'-:::V/
.~----.--~. / • Shh vO
~ ... V1
}/ V2
M
V3
Conversely, positional identity in the spinal cord is imposed by signals from the FlguTe 1.19 Pattern formation of the
surrounding paraxial mesoderm, which can be shown by transplanting cells to nervous system. Dorsoventral pattern
different parts of the axis. Cranial neural crest cells behave according to their lin formation involves a competition between
secreted signa ls from the notochord (SoniC
eage when moved to a new position; that is, they do the same things as they would
hedgehog, Shh) and from the ectoderm
in their original position. Trunk crest cells, in contrast, behave according to their
(initially BMP4, but later other members of
new position-they do the same things as their neighbors. the TG F ·~ superfami ly). As the neural plate
The developing nervous system is a good model of pattern form ation and dif fold s up. the signals are expressed in the
ferentiation, because various cell types (different classes of neurons and glial neural tube itself. The result is the activation
arise along the do rsoventral axis. This process is controlled by a hierarchy of of different tran scription fac tors in different
genetic regulators (Figure 5.19). Dorsoventral polarity is generated by opposing zones resu lting in the regional specification
sets of signals originating from the notochord and the adjacent ectoderm. The of different ne uron al dorsal (D) and ventra l
notochord secretes the signaling molecule Sonic hedgehog (Shh), which has ven (V) subtypes and motor (M) neurons.
tralizing activity, whereas the ectoderm secretes several members of the TGF-~
(transforming growth factor) superfamily, including BMP4, BMP?, and a protein
appropriately named Dorsalin.
As the neural plate begins to fold, these same signals begin to be expressed in
the extreme ventral and dorsal regions of the neural tube itself- the floor plate
and roof plate, respectively. The opposing signals have opposite effects on the
activation of various homeodomain class transcription factors (such as DbxI,
lnG, and NKx6.I) . Expression of specific transcription factors divides the neural
tube into discrete dorsovenu·al zones, which later become the regional centers of
different classes of neuron (see Figure 5.19). For example, the zone defined by the
expression of Nkx6.1 alone becomes the region populated by mo tor neurons,
which are residen ts of the ventral third of the neural tube.
~
6-'~
- -- L1eUrOgenin .~ , \\ax;alm"SOle
• Shh
~ Pax6 MATH!
IE~~. .......:::.-<'" .
~
• Pax3 MASH1 ----< "'.2 , Lom1
Neurogenin ,
~
V dorsal 11mb
Pax?
-
¥ '
MASH1
and anteroposterior axes of the nervous system, which is defined by the combi
nation of transcription factors discussed in the previous section.
Further diversification is controlled by refinements in the expression patterns
of the above transcription factors. For example, all motor neurons initially express
two transcription factors of the LIM homeodomain family: Islet-l and [slet-2.
Later, only those motor neurons projecting their axons to the ventral limb mus
cles express just these two LIM transcription factors. Neurons expressing lsI-I,
Isl-2, and a third transcription factor, Lim-3, project their axons to the axial mus
cles of the body wall. Neurons expressing Isl-2 and Lim-l projecttheir axons to
dorsal limb muscles, whereas those expressing Isl-l alone project their axons to
sympathetic ganglia (see Figure 5.20). The transcription factors determine the
particular combinations of receptors expressed on the axon growth cones and
therefore determine the response of growing axons to different physical and
chemical cues (such as chemoattractants). Once the first axons have reached
their targets, further axons can find their way by growing along existing axon
paths-a process termed fasciculation.
The PGCs then migrate away from the posterior region of the primitive streak
into the endoderm and enter the developing hindgut for a short period. At the
end of gastrulation the PGCs are determined and migrate into the genital ridge, a
mesoderm structure that is part of the developing gonad. At this stage, the gonad
is bipotential, being capable of developing into either testis or ovary.
~ ligand
I
~-- BMP4 Dpp
conserved in several species, th e pathway
has diverse roles in 'lertebrates, flies
(D. melonogasrer), and worms (c. e/egans).
plasma
- membrane
(8)
./""-= .
1-- G protein Ras1
bd--
! Ras LET-60
r- plasma
membrane
FURTHER READING
General Noncoding RNA in development
Gilbert SF (2006) Developmental Biology, 8th ed. Sinauer Amaral PP & Mattick JS (2008) Noncoding RNA in development.
Associates. Mamm. Genome 19, 4S4-492.
Hill M (2008) UNSW Embryology. http://embryology.med.unsw Stefani G & Slack FJ (2008) Small non-coding RNAs in anima l
.edu.aul (An educational resource for learning concepts in development. Nat. Rev. Mol. Cell8iol. 9, 219-230.
embryological development hosted by the University of New
South Wales, Sydney.l Animal models of development
Chapter 6
KEY CONCEPTS
• DNA molecules are velY long and complex and are prone to shearing when isolated from
other cell constituents, making their purification difficult.
• DNA cloning means making multiple identical copies (clones) of a DNA sequence of
interest, the target DNA; the increase in copy number is called amplification. The process
requires a DNA polymerase to replicate a target DNA sequence repeatedly, either within
cells or ill vitro.
• In cell-based DNA cloning, target DNA is fractionated by transfer into bacterial or yeast cells
(transformation). Each transformed cell typically takes up just one target DNA molecule and
replicates it using the host cell's DNA polymerase.
• Before transfer into cells, a target DNA is joined to vector DNA molecules, a plas mid, or a
ph age DNA capable of self-replication inside cells. The target-vector DNA complexes are
known as recombinant DNA.
• Restriction endonucleases cut DNA molecules at defined short recognition sequences and
are needed to prepare target and vector DNA for joining together.
After having been transferred into ceils, recombinant DNA typically replicates
• Large target DNAs are more stable in cells if carried by vectors whose copy number is
constrained to one or two per ceil.
Megabases of DNA can be cloned in yeast by adding very short sequence elements
necessary for chromosome function to form a linear artificial chromosome.
• In expression cloning the target DNA lS designed to be expressed. It is converted to RNA and
in many cases translated into a recombinant protein.
• Cell-based DNA cloning is slow and laborious; however, it can produce very large amounts
of cloned DNA and is the first choice fo r cloning large DNA sequences, expressing genes,
and making comprehensive sets of DNA clones (DNA libraries).
• DNA cloning can also be performed rapidly in vitro with the polymerase chain reaction
(PCR). A purified DNA polymerase is usually used to replicate a specific DNA sequence
selectively within a complex sample DNA. Oligonucleotides are designed to bind specifically
to sites in the sample DNA that flank the target sequence, allowing the polymerase to
initiate DNA replication at these sites.
• PCR can also be used to amplify many different target DNA fragments simultaneously, for
example by using complex mixtures of numerous different primer sequences. This
indiscriminate amplification can replenish many of the sequences in precious samples
where there is little starting DNA.
• As well as being a DNA cloning p rocedure, PCR is also widely used as an assay to quantitate
DNA and RNA after it has been converted by reverse transcriptase to complementary DNA
(cDNA). Real-time PCR is the most efficient method, and quantitates PCR products while
th e peR reacti on is occurring.
164 Chapter 6: Amplifying DNA: Cell-ba sed DNA Cloning and PCR
The fundamentals of current DNA techno logy are velY largely based on two quite
different approaches to studying specific DNA sequences within a complex DNA
population, as listed below (see also FIgure 6.1):
• DNA cloning. DNA sequences of interest are selectively replicated in some way
to produce very large numbers of identical copies (clones). The resulting huge
increase in copy number (amplification) means that DNA cloning is effec
tivelya way of purifying a desired DNA sequence so that it can be comprehen
sively studied.
• Molecular hybridization. The fragment of interest is not amplified or purified
in any way; instead, it is specifically detected within a complex mixture of
many different sequences. This will be covered in Chapter 7.
Before DNA cloning, our knowledge of DNA was extremely limited. DNA clon
ing teChnology changed all that and revolutionized the study of genetics. In
comparison with protein sequences, DNA molecules are extremely large and
complex-individ ual nuclear DNA molecules often contain hundreds ofmiUions
of nucleotides. When DNA is isolated from cells by standard methods, these huge
molecules are fragmented by shear forces, generating complex mixtures of still
very large DNA fragments (abou t 50-100 kb in length with standard DNA isola
tion methods). Give n the complexity of the DNA isolated from the cells of a typi
cal eukaryote, or even prokaryote, the challenge was how to analyze it.
For many eukaryotes, one early approach had been to separate different pop
ulations of DNA sequence by centrifugation. Ultracentrifugation in equilibrium
density gradients (e.g. CsCI density gradients) typically fractionates the DNA
from a eukaryotic cell into a maj or band representing the bulk of the DNA plus
several minor satellite bands that are composed of classes of repetitive DNA. The
DNA molecules of the sateliite bands (satellite DNA) have different buoyant den
sities from that of the bulk DNA (and from each other) because they consist of
very long arrays of short tandem repeats whose base composition is significantly
different from that of the bulk DNA. The satellite DNA sequences were found to
be involved in specific aspects of chromosome structure and function. Although
valuable and interesting, o1e purified satellite DNAs were a minor componen t of
the genome and did not can tain genes.
What was needed were more general methods that allowed any DNA sequence
to be purified, not just DNA sequences whose base compositions deviated sig
nificantly from that of the total DNA. TWo major DNA cloning methodologies
were developed that made this possible. Both use DNA polymerases to make
multiple replica copies of DNA sequences (DNA clones) , permitting in principle
the purification of any DNA sequence within a complex starting DNA
Figure 6. 1 General approaches for
population. studying specific DNA sequences in
complex DNA populations. DNA molecules
are very long and complex, and any gene
or exon or other DNA sequence of interest
typically represents a tiny fraction of the
complex DNA population
sta rting DN A that we ca n isolate from cells.
Two quite different approaches ca n be
1
used to study a particular DNA sequence
detection of a o f interest. One way is to purify a specific
r
GEN ERAL APPROAC HES amplification of a
FOR STUDYING SP ECIFIC specific DNA DNA sequence by selectively replicating
specifiC DNA
DNA SEQUENCES (DNA cloning) its sequence in some way (DNA cloning),
METHODS
cell-based cell-free clon ing
!
molecular
thereby increasing its copy number
(amplification). The specific DNA can either
be cloned within bac terial or yeast cells
Cloning IPC R) hybrid ization
by using cellular DNA polymerases, or by
ESSENTIAL
!
suitable host cell
!
thermostab le DNA
1
labeled DNA, RNA,
using purified DNA polymerases in vitro, as
in the polyme rase chai n reac tion (peR). An
alternative ap proach makes no attempt to
REQUIREMENTS replicon (vector) polymerase or oligonucleotide pu ri fy the DNA seq uence, but instead seeks
meaf'1S of attaching • sequence in formation probe
to detect it specifica lly. It involves labeling
foreign DNA to vector to enable synthesis of • means of detecting
and transferring to specific oligonucleotide DNA fragments to a p re viously isolated nucleic acid sequence
host cell primers which probe binds that is related in seq uence to the desired
• selection/screening to
DNA sequence so that it can specifically
identify transformed
recognize it in a molecular hybridization
cells/recombinants
reac tion.
PRINCIPLES OF CEll-BASED DNA CLONING 165
Restriction endonucleases (also called restriction enzymes) are a class the major groove of the DNA at the binding site and so prevent the
ofbacteriai endonucleases that ,leave double-stranded DNA on both restrictio n enzyme from acting on It). The fcoRI endonuclease will,
strands, after first recognizing spec ific short seq uence elements in however, cut at unmethylated GAATIC seq uences such as in the DNA
the DNA. Wherever it encounters itsspecific recogni tion sequence, a of invading phage that has not previously been modified by the fcoRI
restriction enzyme can cleave the DNA, ofte n within the recognition methylase.
sequence or close nearby_ Together, a restriction enzyme and its cognate modification
Different bacterial strains produce different restriction enzymes, enzyme(s) form a restriction- modi fication (R·M) system. In some
and more than 250 different sequence specificities have been R-M systems, the two enzyme activities occur as se parate subu nits or
iden ti fied (see REBASE, the Restriction Enzyme Database, at http:// as separate doma ins of a large combined restriction-and-modification
rebase.neb.com/rebaselrebase.html). The convention for naming enzyme. Restriction enzymes have been classified into vario us broad
restriction enzymes is to begin with three letters, a capital for the fi m grou pings, according to the nature of their R-M system and other
lette r of th e bacterial genus followed by the first two letters of the cha racteristics. The most defined groups are listed below.
species.Therea fter, one letter is used to denote the bacterial strain Type I restriction enzymes are multisubun it, combined restriction
and finall y a roman numeral is used to distinguish different restriction and·modification enzymes tha t cut far away from their recogniti on
enzymes from the sa me bacterial strain. For example, EcaRI, which sequences at variable locations.
recog nizes the sequence GAAnc, denotes the first restriction enzyme Type I! restriction enzymes are widely used in manipula ting
to be reported for Escherich ia coli strai n RY13. and analyzing DNA. They are typically sing le-subunit restriction
The natural functio n of restriction enzymes is to protect bacteria endonucleases whose recognition sequences are often
from pathoge ns, nota bly bacteriophages (viruses that kill bacte ria 4-8 bp long, and they cut at defined positions either within the
known as phages for short). They do th is by selectively cleaving the recognition sequence or very cl ose to it. Type IIR enzymes cut
DNA of invading pathogens; the bacteria l DNA is protected from within their recog nition seq uences, whic h are often palindromes
cleavage because it has undergone site·s pecific methylation by (the sequence in the 5'---+3' direction is the same on both strands).
sequence-specifi c host cell DNA methyltransferases (modification). Type liS enzymes cut outside asymm etric recognition seq uences.
A particular strain of bacteria will produce a DNA Type liB enzymes have bipartite recognition sequences. See
methyltransferase with the same sequence specificity as a restriction Table 6.1 for examples.
enzyme produced by that strain. For example, E. co li strain RY13 Type 111 restriction enzymes are mu lti subunit combined
produces the fcaRI methyl transferase that specifically recognizes the restriction·and-modification enzymes that recognize two separate
seq uence GAATIC and methylates the central adenosine on both nonpali ndromic sequences that are inversely orientated.
strands. As a result the host cell's DN A is protected from cleavage They cut the DNA about 20- 30 bp outside their recog nition
by the feoRI restric tion enzyme (the methyl groups protrude into sequence.
To do this, the mixture of transformed cells is spread out on an agar surface and
allowed to grow to form well-separated cell colonies, each arising from a single
transformed cell. Individual cell colonies can then be picked and allowed to
undergo a second growth phase in liquid culture. resulting in very large numbers
of cell clones (Figure 6.2C).
The final step is to harvest the expanded cell cultures and purify the recombi
nant DNA (Figure 6.2D). The sections below consider these steps in so me detail.
~ (.)
purify vector and
cell-based DNA cloning_ (A) Formation of
.---. cut with restriction
bactenat cells
eozymes )
JI' t with a restriction endonuclease are mixed
with a homogeneous population of vector
molecules that have been cut wit h a similar
containing extra homogeneous
~ restriction endonuclease. The target DNA
vector DNA
chromosomal
replican (vector)
') and vector are joined by DNA ligase to form
recombinant DNA. Each vector contains an
purify DNA and
cD
cells containing
target DNA to be
enzymes +
heterogeneous
target DNA
~
(B) to undergo repeated cell division to
give a colony of identical cell clones
( '" containing one type of recombina nt DNA
0 + ( ~~
transformation that are kept physica lly separate fro m
)
,...---..... other colonies containing cell clones with
differen t recombinant DNA molecules.
~-
For convenience. we show the example
recombinant DNA host cells
of a vector whose copy number per cell is
highly restrained but many plasmid vectors
transformants can reach quite high copy numbers per
cell. constituting an additional type of
amplification. (0) Isolation of recombinant
IC) _ "'-:;;;' /1
DNA clones. after separation of the
.Jj" .....~ /I , - I ">I
recombinant DNA from the host cell DNA.
( ~ ,-
----...... /I . / ">I 11> ..... j /I
~ ">I
">I (If," /I
(If>.....\ /1 . ~
~
">I
I .J ">I (If, .... /I
- ">I
r ........
DNA fragments of a manageable size. When DNA is isolated from tissues and cul·
tured cells, the unpackaging of the huge DNA molecules and inevitable physical
shearing results in heterogeneous populations of DNA fragments. The fragments
are not easily studied because their average size is still very large and they are of
randomly different lengths. With the use of restrictio n endonucleases it became
possible to convert this hugely heterogeneous population of broken DNA frag
ments into sets of smaUer restriction fragments of d efined lengths.
Type II restriction endonucleases also paved the way for artificially recombin
ing DNA molecules. By cutting a vector molecule and the target DNA with restric
tion endonuc!ease(s) producing the same type of sticky end, vector-target asso
ciation is promoted by base pairing between their sticky ends (Figure 6,3). The
hydrogen bonding between their termini facilitates subsequent covalent joining
by enzymes known as DNA ligases. Intramolecular base pairing can also happen,
as when the two overhanging ends of a vector molecule associate to form a circle
168 Chapter 6: Amplifying DNA: Cell-based DNA Cloning and PCR
For other enzymes see the REBASE database of restriction nucleases at httpj/~ebase.neb.com/rebase/rebase.html. N = any nucleotide.
(cyclization), and ligation reactions are designed to promote the joining of target
DNA to vector DNA and to minimize vector cyclization and other unwanted
products such as linear concatemers produced by sequential end-to-end joining
(see Figure 6.3).
,
target DNA cut with Mool vector DNA cut wllh BamHI
~
Figure 6.3 Formation of recombinant
DNA. Heterogeneous target DNA sequences
,
C'l'AG
cut with a type II restriction enzyme (Mbor
in this example) are mixed with vector
DNA cut with another restriction enzyme
such as BamHI that generales the same
,
CTJI,.G type of sticky ends (here, a 5' overhanging
sequence of GATC). Ligation conditio~s are
chosen that will allow target and vector
1 molecules to combine, to give recombin ant
DNA. However, other li gation products
CTAG !
are possible. Intermol ecu lar target-target
concatemers are shown on the left, and
intramolecular vector cycliz<ltio n on the
-:r "fI. G right. Other possible ligatio n products not
shown here include intermolecular vector
concatemers, cyclization of multi mers, and
co-ligation events that involve two different
target sequences being included with a
vec[Or molecule in the same recombinant
C1AG( C:AG DNA molecule.
intramolecular
vector cycllzatlon
intermolecular vector-target
recombinant DNA
Escherichia coli isolate, for example, might have three different small plasmids
present in multiple copies and one large single-copy plasmid. Natural examples
of bacterial plasmids include plasm ids that carry the sex factor (F) and those that
carry drug-reSistance genes. Some plasmids sometimes insert their DNA into the
bacterial chromosome (integration). Such plasmids, which can exist in two
forms-extrachromosomal replicons or integrated plasmids-are known as
episomes.
A bacteriophage (or phage, for shon) is a virus that infects b acterial cells.
Unlike plasmids, phages can exist extracellularly but, like other viruses, they must
invade a host cell to reproduce, and they can only infect and reproduce within
cena in types of host cell. Phages may have DNA or RNA genomes, and in DNA
phages the genome may consist of do uble-stranded DNA or single-stranded DNA
and may be linear or circular. There is considerable variation in genome sizes,
from a few \dlobases to a few hundred \dlobases.
[n the mature phage particle, the genome is encased in a protein coat that can
aid in binding to specific receptor proteins on the surface of a new host cell and
entry into the cell. To escape from one cell to infect a new one, bacteriophages
produce enzymes that cause the cell to burst (cell lysis). The ability to infect new
cells depends on the number of phage particles produced, the burst size, and so
high copy nwnbers are usually attained. Like episomal plasmids, some bacterio
phages, such as phage ie, can also integrate into the host chromosome, where
upon the integrated phage sequence is known as a prophage.
-------
cgt,gaattc,GAG CTCfGTACCfGGfGA TCCrCTAGAfTCGACPT GCA~G[ATG CfAGC TTGG Cg t aa tc a t g gtcat (
(X-Gal)) that is converted to a non-toxic
l EcoRI Kpnl 8amHI Sail Spht reac tion product w ith an intense blue color. (
A poly ti nker inserted near the sta rt of the
vector'S lad' coding sequence provides a r
multiple cloning site (M(S} region , a series of
r
different unique restriction sites that allows
alternative possibilities for cloning restriction
fragments. The polylinker is small and
in serted so that the 10cZ' reading frame is
preserved and functional. However, cloning
ori
of a comparatively large insert into the ••
M(S typically ca uses insertional inactivation
and an absence of p-galactosidase activity,
and so cells with recombinant DNA will o
be colorless.
PRINCIPLES OF CELL-BASED DNA CLONING 171
G )~
(Ai
~ :2 ~ " "" ,
:JO
cel! ' 2 _ 2'r ...
'2)i"?:.. dM"O:" 2
,2 , , '}
o
replication J 3 3 ,.· 2 ,2
,.....
" '7' + ~ \2:i ",~
.. A:O -;r
~ 2 '
~
~:2 ) .,)I"::l"::l O ~~
!'! 2,
3
heterogeneous
:- 2
j'
::f
'."'' ....'
,
@-'2',-:o.
,.;t - .2 '} :i.' - " ;. ..
z
,.
recombinant DNA competent cells
multiple
ofonereco
d cells with
CO~bi"nt
transForme 'es (clones)
DNA
A
":3l ":l& ......,
'"
0'2;"",' _' '2''~. ' .'
,
.2
."
, ' .
. .
[\
=
plating out pipertlng and primary
amplification pick si ngle
seconda ry
very large numbers of
spreading transformed cells amplifi callon ~
onto an agar surface within ~ colony into cell clones with identical
liquid cu lture recombinant DNA
a Culture dish
to produce hlue colonies. However, recombinants with an insert in the polylinker 'FlgUNe 6 .5 Transformation, the critical
can cause insertional inactivation so that no functional ~-galactosidase is DNA fractionation step, and large-scale
produced. amplification of recombinant DNA.
(A) Cell-based DNA cloning typically starts
with a complex population oftarget DNA
Transformation is the key DNA fractionation step in cell -based
fragment s that are joined to a single
DNA cloning type of vector molecule, producing a
heterogeneous collection of recombinant
The plasma membrane of cells is selectively permeable so that only certain small DNA molecules. During transfo rmation, the
molecules are allowed to pass into and out of cells; large molecules such as long recombinant DNA molecules are t ransferred
DNA fragments are normally unable to cross the membrane. However, cells can into competent cells, so that a cell typically
be treated in different ways so that the selective permeability of the plasma mem takes up ju st one type of recombinant DNA
branes is temporarily distorted. One way, for example, is to use electroporation, molecule. As a result, the celt population
in which cells are exposed to a short high-voltage pulse of electricity. As a result acts as a fractionation syst em, allocating
of the temporary alteration to the cell membrane, some of the treated cells individual recombinant DNAs to individual
become competent, meaning that they are capable of taking up foreign DNA from cells. Within each cell, a recombinant
DNA can replicate extrachromosomally,
the extracellular environment.
often producing many identical DNA
Only a small percentage of competent cells will take up recombinant DNA to
clones. Thereahe r each transformed cell
become transformed. However, those that do will often take up just a single undergoes repeated divisions to generate
recombinant DNA molecule. This is the basis of the critical fractionation step in populations of identical cell clones that
cell-based DNA cloning. The population of transformed cells can be thought of will form cell colonies. (8) The cell colonies
as a sorting office in which the complex mixture of DNA fragments is sorted by can be physically separated by plaring
depositing indiVidual DNA molecules il1to individual recipient cells. Extra out, a procedure that involves growth of
chromosomal replication of the recombinant DNA often allows many identical transformed cells on a nutrient agar surface
copies of a single type of recombinant DNA in a cell (Figure 6.5). so as to allow separation of cell colonies.
Because circular DNA transforms much more efficiently than linear DNA, Individual colonies are picked and added to
most of the cell transformants will contain cyclized products rather than linear liquid culture in flasks to allow secondary
amplification. «(ourtesy of James Stock,
recombinant DNA concatemers. Usually vector molecules are treated to reduce
King's College. Used under the Creat ive
the chance of cyclizing and so most of the transformants will contain recombi Commons Attribution ShareAlike 3.0 license.)
nant DNA Note. however, that co-transformation, the occurrence of more than
one type of introduced DNA molecule within a cell clone, can sometimes occur.
The transformed cells are allowed to multiply. [n cloning using plasmid vec
tors and a bacterial cell host, a solution containing the transformed cells is sim
ply spread over the surface of nutrient agar in a Petri dish and the cells are allowed
to grow (a process called platingout-Figuxe 6.5). This usually results in the for
mation of bacterial colonies that consist of cell clones (all the cells in a single
colony are identical because they are all descended from a single founder cell).
'72 Chapter 6: Amplifying DNA: Cell-based DNA Cloning and peR
TABLE 6.2 PROPERTIES OF CELL ·BASED DNA CLONING SYSTEMS ARRANGED ACCORDING TO INSERT CAPACITY
PLASMID VECTORS
Plasmid vectors Standard plasmid and Usually up often high Very versatile. Routinely used in subcloning of fragments amplified
allowing phagemid vectors to 5-10 kb by PCR. Also widely used in making eDNA libraries and for expressing
replication to genes to give RNA or prote in products. Phagemids can produce
high-copy single-stranded recombinant DNA
numbers
Cosmid vectors 30-44 kb high Used for making genomic DNA libraries with modest-sized inserts.
However, high copy number makes inserts comparatively unstable
Plasmid vectors Fosmid vectors 30- 44 kb 1-2 An alternative to cosmid cloning, providing much greater insert
whose copy stability at the expense of lower yields of recombinant DNA
number is
constrained PAC (Pl artificial 130- 150 kb 1-2 Used for making genomiC DNA libraries, often as a supplementary
chromosome) vectors aid to BAC cloning
BAC (bacterial artificial up to 1-2 Used extensively to make genomic libraries for use in genome
chromosome) vectors 300kb projects
YAC (yeast artificial 0.2-2.0 Mb low Used as an intermediate cloning system for some genome projects
chromosome) vectors (e.g. Human Genome Project) but inserts are often unstable. Also
useful in studying function of large genes or gene clusters
PHAGE VECTORS
MI3 sma ll lowto Useful in the past for making single-stranded recombinant DNA for
medium sequencing templates and site-directed mutagenesis. Superseded
by phagemids
A insertion vectors e.g. Agt11 up to 10 kb low Useful for making cDNA libraries because they can accept quite large
inserts and because large numbers of recombinants can be made
Areplacement vectors 9- 23 kb' low Used for making genomiC DNA libraries in the past; rarely used now
P1 70- 100 kb low Used for making genomic libraries in the past; rarely used now
alnsert size is increased compared to insertion vectors by removing a central non-essential part of the A. genome and replacing it by the DNA to be
cloned. Note that plasmid vectors are generally preferred; phage vectors are comparatively difficult to work with.
important tools in the Genome Projects and in working out gene structure and
the relationship between genes and their transcripts. We will consider them in
detail in Chapter 8.
Vertebrate genes are often very large with a high content of repetitive DNA,
making them less amenable to cloning. Repetitive DNA is extremely rare in bac
terial genomes, and large amounts of introduced recombinant DNA that is rich in
repetitive DNA sequences are not well tolerated. As a result, standard plasmids
can rarely be used to clone large (> 10 kb) fragments of animal DNA; instead, the
inserts are usually less than 5 kb. Plasmids continue to be the workhorses of DNA
cloning for many research purposes, however. Vectors such as pUC19 (see Figure
6.4) permit high-copy-number replication and large amounts of recombinant
DNA, and, as we will see in Section 6.3, plasmid vectors are widely used to express
genes of interest. To clone larger DNA fragments, different DNA cloning systems
\I'ere needed (see Table 6.2 for an overview). In addition. some applications for
studying recombinant DNA, notably DNA sequencing and in vitro site- specific
mutagenesis, required single-stranded DNA templates. As a result, cloning vec
lOrs were also developed that could be used to generate single-stranded recom
binant DNA.
Early large insert cloning vectors exploited properties of
bacteriophage },
Bacteriophage A (lambda) has a linear 48.5 kb double-stranded DNA genome.
Within an E. coli host, A DNA can replicate extrachromosomally or it can inte
grate into the host chromosome and be replicated as a chromosome component.
174 Chapter 6: Amplifying DNA: Ceil-based DNA Cloning and peR
replicon, and so only low yields of recombinant DNA can be recovered from the
host cells. However, because of their great insert stability they were the most
widely used clones that we re selected for sequencing in the Human Genome
Project.
Cosmid-like vectors that have been adapted for low-copy-number replication
by using F-factor genes are known as fosmid vectors. The inserts have the same
size range as in cosmid cloning but are much more stable.
Bacteriophage Pl vectors and Pl artificial chromosomes
An alternative approach to cloning large inserts in E. coli was to develop vectors
based on bacteriophages that have naturally large genomes. For example, bacte
riophage PI has a 110-115 kb linear double-stranded DNA genome that, like
phage A, is packaged into a protein coat. Bacteriophage PI cloning vectors have
therefore been designed in which components of PI are included in a circular
plasmid. Subsequently, features of the PI and F-factor systems were combined in
another plasmid-based cloning system in which recombinants are known as a PI
artificial chromosomes (PACs). As in BACs, the inserts of PACs are very stable
and they have also been used as direct templates for sequencing in some genome
projects.
,
£coRI compensates for a mutation in the yeast
host cell that causes the accumulation of red
pigment. Th e host cells are normally red, and
~l
ARSl
those transformed with just the vector will
TRPl
form colorless colonies. Cloning of a foreign
,JURA3 DNA fragment into the SUP4 gene causes
insertional inactivation, restoring
AmpA
or1 ..
TEL.!.
~
~ TEL
target DNA
the mutant (red) phenotype. Target DNA
is pa rtly digested with the restriction
-TEl - CEN
ARS - --TEL
described above can be used to identify
transformed cells and those containing
recombinant DNA.
'76 Chapter 6: Amplifying DNA: Cell-based DNA Cloning and PCR
• •• -
~ 5~
.47
single-stranded
M13
mutagenic primer
C c_
CGG .~c
"G
cT G A
C G
A o A
mutant
57"ence
0 - - -......(
OT
GO
double-stranded
heteroduplex
mutant
yeast cells; instead, the external cell walls must first be removed. The res ulting homoduplex
yeast spheroplasts can accept exogenous fragments but are osmotically unstable
and need to be embedded in agar. The overall transformation efficiency is very
low, and the yield of cloned DNA is low (about one copy per cell) . Nevertheless, F1guAl6.8 Oligonucleotide mismatch
the capacity to clone large DNA fragments (up to 2 Mb) made YACs a vital tool in mutagenesis. Many different meth ods of
oligonucleotide mismatch mutagenesis ca n
the physical mapping of some complex genomes, notably the human genome.
be used to create a desired point mutation at
a unique predetermined site within a cloned
Producing single-stranded DNA for use in DNA sequencing and DNA molecule. The example illustrates
In vitro site-specific mutagenesis the use of a mutageniC oligonucleotide to
direct a single nucleotide substitution in
Single-stranded DNA clones have been preferred templates for DNA sequencing: a gene. The gene is cloned into a suitable
the sequences obtained are clearer and easier to read than when double-stranded vector to generate a single· stranded
DNA is used. We will consider aspects of DNA sequencing in Chapter 8. recombinant DNA, such as an M13 vector
Single-stranded DNA templates are also ideal substrates for site-specific (as shown here) or a phagemid vector. An
in vitro mutagenesis, an invalu able experimental tool in dissecting the func oligonucleotide primer is desi gned to be
tional contributions made by short sequences of nucleotides or amino acids. For comp lementary in sequence to a portion
example, the suspected contribution of a specific amino acid to the function of of the gene sequence encompassing the
its protein can be tested by deleting the amino acid or mutating it to give a differ nucleotide to be mutated (in this case an
adenine, A) and containing the desired
ent amino acid with a rathe r rllfferent side chain. To do this, a single- stranded
non-complementary base at that positio n
cloned DNA with the wild-type sequence is allowed to base-pair with a synthetic
(C, noe T). Despite the internal mismatch,
oligonucleotide that is designed to be perfectly complementary in base sequence annealing of the mutagenic primer is
except for som e changes that will correspond to the desired mutation. The oligo possible, and second-stran d synthesis can
nucleotide is allowed to prime new DNA synthesis to create a complementary be extended by DNA polymerase and the
full-length sequence containing the desired mutation. The newly formed wild gap sealed by DNA ligase. The resulting
type-mutant heteroduplex is used to transform cells, and the desired mutant heteroduplex can be transformed into
sequence can be identified by screening for the mutation (FIgure 6.8). E. coli. whe reupon two populations of
Single-stranded recombinant DNA is often obtained by using cloning systems recombinants can be recovered: w ild-type
based on filamentous bacteriophages s uch as phages M13, fd, and fl, which nat and mutant homod uplexes. The latter can
be identified by PCR-based allele-specific
urally adopt a Single- stranded DNA form at some stage in their life cycle.
amplification methods (see Figure 6. 17).
Alternatively, plasmid vectors are used that contain filamentous phage DNA
sequences that regulate the production of single-stranded DNA.
M13 vectors
Bacteriophage M13 has a 6.4 kb circular Single-stranded DNA genome enclosed
in a protein coat, forming a long filamentous structure. It infects certain strai ns of
E. coli. After the M13 phage binds to the wall of a bacterial cell, it injects its DNA
genome into the cell . Shortly after the single-stranded M13 DNA enters the cell it
is converted to a double·stranded DNA, the replicative form that serves as a tem
plate for making numerous copies of the ge nome. After a short time, a phage
encoded product switches DNA synthesis toward the production ofsingle strands
that migrate to the cell membrane. Here they are enclosed in a protein coat, and
hundreds of mature phage particles are extruded from the infected cell without
cell lysis.
M13 vectors are based on the double-stranded replicative form, which has
been modified to contain a multiple cloning site and a lacZ selection system for
screening for recombinants. Double-stranded M13 recombinants can be trans
fected into a suitable E. coli strain; after a certain period, phage particles are h ar
vested and used to retrieve Single-stranded recom binant DNA (Plgure 6.9).
LARGE INSERT CLONING AND CLONING SYSTEMS FOR PRODUCING SINGLE-STRANDED DNA 177
MCS
ori
lacl V ~Z'
M13 vector
-
single
stranded
-7.3kb recombinant
DNA
=1
insert
target [)fIJA
fragments
double
•
•• I
••• i
L....- i solation of DNA
recombinant phage
stranded particles exit
recombinant from cells
DNA
...,
+
)'I
-------+)
switch to single
~
)
replication of
stranded DNA
+ double-stranded
synthesis of
recombinant DNA
+ strand only
~ ,... )'I
+ -------+) +
~ transcription
).t
• • .
phage coat proteins
Flgln'. 6.9 Producing single-stranded recombinant DNA. M13 vectors use the same facZ assay fo r recombinant screening as
pUC1 9 (see Figure 6.4). Target DNA is inserted in the multiple cloning site (MCS). The double-stranded Ml 3 re combi n ant DNA
enters the normal cycle of phage replication to generate numerous copies of the genome, before switching to production of
single-stranded DNA (+ strand only). After packaging with M13 phage coat protein, mature recombina nt phage panicles exit from
the cell without lysis. Single-stranded recombinant DNA is recovered by precipitating the extruded phage particles and chem ica lly
removing the protein.
Phagemid vectors
A small segment of a single-stranded filamentous bacteriophage genome can be
inserted into a plasmid to form a hybrid vector known as a phagemid; Figure
6_10 shows the pBluescript vector as an example. The chosen segment of the
phage genome contains all the cis-acting elements required for DNA replication
and assembly into phage particles. They permit successful cloning of inserts sev
eral kilo bases long, unlike M13 vectors, in which inserts of this size tend to be
unstable. After transformation of a suitable E. coli strain with a recombinant
phagemid, the bacterial cells are superinfected with a filamentous helper phage,
such as phage fl , which is required for providing the coat protein. Phage particles
secreted from the superinfected cells will be a mixture of helper phage and
recombinant phagemids. The mixed single-stranded DNA population can be MCS
~'a'i'
Plac
used directly for DNA sequencing because the primer for initiating DNA strand
rlf
synthesis is designed to bind specifically to a sequence of the phagemid vector
adjacent to the cloning site.
Figure 6.10 pBluescript, a phagemid vector. The pBluescript series of
origin
phagemid vectors each contain two origins of replication, a standard plasmid pBluescript vector
' or;
one (or!) and a second one from the fi la mentous phage fl. The f1 origin of 2 .9 kb
replication allows th e production of single-stranded DN A, but the vector lacks
the genes for phage coat proteins. Target DNA is inserted into the phagemid
vector, which is then u sed to transform host E. coli cells. Superinfection of
transformed cells w ith a helper M13 phage allows the recombinant phagemid
DNA to be packaged within phage protein coats, and the recombinant
Single-stranded DNA can be recovered as in Figure 6.9. AmpA
'78 Chapter 6: Amplifying DNA: Cell-based DNA Cloning and peR
for example, large quantities of a specific protein . It is common to express the pfT-3
4.6kb
product in well-studied cells, but sometimes it may be sufficient to express
the product in vitro.
Large amounts of protein can be produced by expression cloning
in bacterial cells
Cloning of eukaryotic cDNA in an expression vector is often required for the pro
duction of proteins in large quantities for research purposes such as structural
studies, or for biotechnology purposes, or as medically relevant compounds such
as thera peutic proteins. Usually, a cDNA providing the genetic information spec
ifying the protein sequences is inserted into a vector along with separate e<pres
sion Signals such as suitably strong promoters and other regulatory elements.
Because the expression system is based on recombinant DNA, the resulting pro
tero s are sometimes described, rather inaccurately, as recombinant proteins.
Bacterial cells have the advantage that they grow rapidly and can be expanded
easily in culture to very large culture volumes. Escherichia coli has been a favorite
host cell for the expression of introduced proteins.
1
)
lit' Pro
l'
Gly Ser
'I
Thr
l'
__
k& ,\Is
' I
AI..
=_=~
Ala
)
Sel
SToP
for the affi nity tag glutathione S-transferase
(GST) followed by a multiple cloning site
(MCS). The object is to clone a target coding
8amHI feoRI Sma l Sail Xhol NOli
sequence for a protein into the MCS so
pGEX-4T-3
that a fusion protein is produced, with
thrombin an N-termi nal GST seq uence fused to the
Glu v..,- Pre Arg ' Cly Serl Pro J\s,.. Se1 ,o,rg Vill ~p set set Gl.v Arg lie Val Thr Asp protein encoded by the ta rget eDNA. Three
====_==W===_=_~=m GTG Acr GAt lGA alte rnative vectors, pGEX-4T-l , pGEX-4T-2,
!! Lf-=---:-tJ I ----
L------1
BamHI
I
EeoRI Smal Sail XhOl
I
Norl
!
STOP srop STOj and pGEX-4T-3, have Slight ly different MCS
seq uences that differ in the translatIonal
reading frame. By using all three alternative
vectors. cloned insertscan be expressed m
glutathione --, r each of the three amino acid readi ng frames
S-transferase I I so that in at least one of the three cases, the
target DNA should be in frame with the GST
Ptac
AmpR sequence. The expressed fusion protem can
be purified easily on a glutathione affinity
purification column such as glutathione
Sepharose 4B. Because the MCS is
engineered to contain a thrombin cleavage
site, the desired protein can be purified
after cleavage at the thrombin cleavage site.
AmpR, ampicillin resistance gene; ori, origin
~ri
of replication; STOP. termination codon.
- --
s~.....{otein lll
lr 'r transfect
E.coli
h.,,,"s'
assembled
".~/ p<o,O," '"
I
2 Phage )
--- + )
r )
mixture of cONAs
(B)
) )
phage library in
f
biolin- e
selective binding of p/1age
aqueous solution antibody ......
biotin-conjugated
eXJ)(esslng protein 1
antibody specific
with streptavld in (5)
selective bInding of
for protein 1
biotin-conJugaled phage
fusion protein that is incorporated into the phage's protein coa t, so that it is dis FIgure 6. 11 Phage display. (AI Library
played on the surface of the phage without affecting the phage's ability to infect constructiOn . A heterogeneous mixture of
cells. If an antibody is available for the expressed protein, phage displaying that target cO NAs is cloned into a pha ge vec to r,
such as one based on phage M 13 or fl ,
protein can be selected by preferential binding to the antibody: affinity purifica
to express forei g n proteins on [he phage
tion of virus particles bearing such a protein can be ac hieved from a 108 -fold
su rface. In thi s example, DNA is inserted
excess of phage not displaying the protein, using even minute quantities of the into a cloning site loca ted within the ini tia l
relevant antibody (Agure 6.13). part of the coding sequence for the gene
In protein engineering, phage display is a powerful adjunct to random muta from phag e (1 that makes one of the phage
genesis programs as a way of selecting for desired va riants from a library of (oat proteins (protein III). The idea is that
mutants. It has also proved a powerful alternative source of constructing anti recomb inants should expre ss a fu sion
bodies, bypassing normal immunization techniques and even hybridoma tech protein with the target protein fu sed to this
nology. Phage libraries can also be used to identify proteins that interact with a phage coat protein, thereby enabling the
specific protein. 1n the same way in which antibodies can be used in affinity p hage particles to display the foreign protein
screening, a known protein (or any other molecule to which a protein can bind) on their surfaces. After transformation of
host E. coli cells, the phage DNA replicates,
can be used as a bait to select phages that display any other proteins that bind 10 and pha ge particles are assembled,
the bait protein. ex truded from the host cell, and harvested.
The mixtu re of recombinant phages is
Eukaryotic gene expression is performed with greater fidelity in known as a phage expression library.
eukaryotic cell lines (8) library screening with a specific antibody.
An antibody speC ific for just one rype of
Bacterial expression systems offer the great advantage of protein expression at protein will bind only to phage particles
very high levels. However, the biological properties of many eukaryotic proteins that di splay that protein . If t he anti body
synthesized in bacteria may not be representative of the native molecules. When is complexed with biotin, phage particles
produced in bacte rial cells, they do not undergo their normal post-translational carrying the desired pro tein ca n be purified
processing, and the folding of the eukaryotic protein may be incorrect or ineffi by using screptavidin-coated Petri dishes: lhe
desired recombinants wHl have an antibody
cient. As an alternative, eukaryotic cells, including insect and mammalian cells,
co mplexed to a biotin group and the biotin
are widely used for expressing recombinant proteins. The express ion vectors are group will bind strong ly to streptavidin.
typically designed to be able to be replicated both in the desired eukaryotic cells
and in E coli, where the intention is simply to produce large amounts ofrecom
binant DNA that ca n subsequently be used for expression studies in eukaryotic
cells.
Once transfected into animal cells, the recombinant D A can be expressed
for short or long periods, according to the expression vector used. For some pur
poses, all that is required is tran.sient expression. Here, the recombinant DNA
remains as an independent extrachromosomal genetic element (an episome)
within the transfected cells. Expression of the introduced gene reac hes a maxi
mal level about two or three days after transfection of the expression vector into
th e mammalian ce ll line; thereafter, expression can diminish rapidly as a result of
ce ll death or loss of the expression construct.
The alternative is to use stable express ion systems in which the recombinant
DNA can integrate into a host cell's chromosome. The advantage here is that if
CLONING SYSTEMS DESIGNED FOR GENE EXPRESSION 181
the recombinant DNA has integrated so that the introduced gene is expressed, all
descendant cells will contain this gene and expression can be maintained over
many cell generations. Some viruses are highly efficient at integrating into chro
mosomal DNA, and we will consider them in the context of gene therapy in a later
chapter. But plasmid expression vectors can also randomly integrate at low fre
quencies, and stable cell lines can be developed with an integrated gene of
interest.
Transient expression in insect cells by using baculovirus
Baculovirus gene expression is a popular method for producing large quantities
of recombinant proteins in insect host· cells: the protein yields are higher than in
mammalian expression systems, and the costs are lower. In most cases, post
translational processing of eukaryotic proteins expressed in insect cells is similar
to the protein processing that occurs in mammalian cells, and the expression of
very large proteins is possible. The proteins produced in insect cells have compa
rable biological activities and immunological reactivities to those of proteins
expressed in mammalian cells.
Autographa califomica nuclear polyhedrosis virus (AcMNPV) is a baculovirus
that can be propagated in certain insect cell lines and is often used as a cloning
vector for protein expression systems. The viral polyhedrin protein is transcribed
at high rates and, altho ugh essential for viral propagation in its normal habitat, it
is not needed in culture. As a result, its coding sequence can be replaced by a cod
ing DNA for a eukaryotic protein. The cloning vector is designed to express the
inserted protein from the powerful polyhedrin promoter, resulti ng in expression
levels accounting for more than 30% of the total cell protein.
PRPP -+1'J
APRT . . ... ~
t'r+ PRPP
,r-......... H~RT
g',,,m,,. ~1£ * t 1
substrates. De noyo synthesis of purine
nud eotides involves the eventual synthesis
of IMP (inosi ne monophosphate), which can
~t"'-+~-'-+r IMP ----+ inosine ~ hypoxanthine be converted sepa rately to AMP and GMP. De
PRPP J novo synthesis of pyrimidine nudeotides is
carbamoyl tetrahydrofolate ~
dependent on synthesizing UMP and dUMP,
phosphate
which is converted to TMP (thym idylate) by
~~444 dUMP . TMP ~ thym;d;n. _ inhibition by thymidylate synthetase; UTP is deaminated
antifolate agents, to give CTP (not shown). An act ivated form
aspartate : e.g. aminopterin
of tetrahydrofolate is crUCially required to
Ihymidylate TK provide a methyl or form yl group at three
synthetase key steps that can be blocked by anti fol ate
agents such as aminopterin, shutting down
the de novo pathways. However. purine
In functi onal complementati on, host cells have a genetic deficiency that pre
nucleotides and thymidylate can also be
ve nts a particular function; the original function can be restored if a functional synthesized from preformed bases by
copy of the defective gene is supplied by transformation by a vector. The trans sa lvage pathways that are not blocked by
gene and the marker can be transferred as separate molecules by a process known an tifolate drug s such as ami nopterin. The
as co-transformation. An example is the use of a Tk gene marker with cells that three sa lvage pathway enzymes are shown
are genetically d eficient in thymidine kinase (TIc-). Thymidine kinase is required in pale pink bo)(es: thymidine kinase (TK),
to convert thymidine into thymidine monophosphate (TMP) , butTMP can also hypoxa nthine-guanine phosphoribosyl
be synthesized by enzymatic conversion from dUMP. The drug aminopterin transferase (HG PRT), and adenine
blocks the dUMP ....TMP reaction, so cells cannot grow in the presence of this phosphoribosyjtransferase (APRT). Th e use
drug unless they have a source of thymidine and a fun ctional Tk gene. Selection of HAT medium (containing hypo)(anthine,
for Tk+ cells is usually achieved in HAT (hypoxanthine, aminopterin, thymidine) aminopterin, and thym idine) therefore
se lects for cells with active salvage pathway
medium (Figure 6.14).
enzymes. Mutant Tk- cells are gen etically
The major disadvantage of complementation markers is that they are specific deficient in thymidine kinase and will not
for a particular mutant host cell line in which the corresponding gene is non survive if grown in HAT mediu m unless
functional. As a result, they have largely been superseded by dominant they contain vector molecules containing a
selectable markers, which confer a phenotype that is entirely novel to the cell functiona l thymidine kinase gene.
and hence can be used in any cell type. Markers of this type are usually drug-re PRPp, 5-phosphoribosyl-l -pyrophos phate.
sistance genes of bacterial origin that can confer resistance to drugs known to
affect eukaryotic and bacterial cells. For example, the aminoglycosid e antibiotics
(which include neomycin and G418) are inllibitors of protein synthesis in both
bacterial and eukaryotic cells. The neomycin phosphotransferase (neo) gene
confers resistance to aminoglycoside antibiotics, so cells that have been trans
formed by the neo gene can be selected by growth in G4l8. Figure 6. 15 shows a
typi cal mammalian expressio n vector that co ntains a neo marker, along with sev
eral other features that facilitate selection and purification.
medium that makes it difficult to isolate DNA by the standard m ethods. Successful multiple cloning s ite
peR amplification is possible from small amounts of degraded DNA extracted
from decomposed tissues at archaeological or historical sites and from formalin Hindlll
Kpnl
fixed tissue samples. BamHI
PC,,"1Y
Spel
BstXI
PCR can be used to amplify a rare target DNA selectively from pcDNA3.1/myc-His
EcoRt
within a complex DNA population pUC ori BGH pA Pstl
EcoRV
BstXI
Usually, peR is designed to permit the selective replication of one or more specific f10 ri
Nott
target DNA sequences within a heterogeneous collection of DNA sequences, .., . . .. w ~
Xhol
enabling it to increase vastly in copy number (amplification). Often the starting SV400ri BstEIl
Xbal
DNA is total genomic DNA from a particular tissue or cultured cells, and the target and
enhancer Apal
DNA is typically a tiny fraction of the starting DNA. For example, if we want to Sadl
amplify the 1.6 kb J3-globin gene from human genomic DNA (with a haploid 85t81
genome size of3200 Mb), the ratio oftarget DNA to starting DNA is 1.6 kb:3200 Mb, Pmel
or 1:2,000,000. Many peR reactions involve the amplification of even smaller tar
gets. For example, single human exons are on average about 290 bp long, and
thus represent less than 0.000001 % of a starting population of genomic DNA. Figure lS.l S A mammalian expression
As an alternative to genomic DNA, the starting DNA may be total cDNA pre vector, pcDNA3.1Imyc-His. The Invitrogen
pared by isolating RNA from a suitable tissue or cell line and then converting it pcDNA series of plasmid expression
into DNA with the enzyme reverse transcriptase. This is known as reverse tran vectors offer high-level constitutive protein
scriptase-peR or RT-peR for short. lfthe original RNA population has many cop expression in mammalian cells. Shown
ies of the transcript of the target DNA, there may be significant enrichment of the here is the pcDNA3.1/myc-His expression
target sequence in comparison with its representation in genomic DNA. Under vector. Cloned eDNA inserts can be
optimal conditions, RT-PCR can also sometimes be used to amplify target transcribed from the strong cytomegalovirus
(PCMV) promoter, which ensures high-
sequences from tissues in which they are expressed at only a basal transcription
level expression in mammalian celis, and
level.
transcripts are polyadenylated with the help
of a polyadenylation sequence element
The cyclical nature of PCR leads to exponential amplification of from the bovine grow th hormone gene
the target DNA (BGH pA). The mu ltiple clon ing site region
is followed by two short coding sequences
peR depends on synthesizing Oligonucleotide sequences to act as primers for that can be expressed to give peptide tags.
new DNA synthesis at defined target sequences within the starting DNA; the idea The my' epitope tag allows recombinant
is ro selectively replicate only the target sequencers). It consists of a series of screening with an antibody specific for
cycles of three successive reactions conducted at different temperatures, so the the Myc peptide, and an affinity tag of six
process is so metimes described as thermocycling. First, the starting DNA is heated consec utive histidine residues (6 x Hi s)
to a temperature that is high enough to break the hydrogen bonds holding the facilitates purification of the recombinant
two complementary DNA strands together. As a result, the double strands sepa protein by affinity chromatography with a
rate to give single-stranded DNA (denaturation). For hnman genomic DNA, the nickel-nitrilotriacetic acid matrix. Translation
ends at a defined termination codon (STOP) .
reaction mixture is generally heated to about 93-95°C for the denatura tion step.
A neo gene marker (regulated by an SV40
After cooling, the synthetic oligonucleotide primers are allowed to bind by base
promoter/enhancer and an SV40 poly{A}
pairing to a complementary sequence on the single-stranded DNA (annealing!. sequence) permits selection by the g ro'Nth
Next, in the presence of the fo ur deoxynucleoside triphosphates dATp, dCTP, of transformed host cells on the ant ibio tic
dGTP, and dTTp, a purified DNA polymerase initiates the synthesis of new DNA G4 18. Components fo r propagation In E. coli
strands that are complementary to the individual DNA strands of the target DNA include a permissive o rigin of replicatio n
segment. fro m plasmid ColEl (pUC ori) and an
The orientation of the primers is deliberately chosen so that the direction of ampicillin resistance gene (Amp!!) p lus an
synthesis of new DNA strands is toward the other primer-binding site. As a result, origin o f replication from phage (1 (11 ori),
the newly synthesized strands can, in turn, serve as templates for new DNA syn which provides an option for producing
single-stranded recombinant DNA.
thesis, causing a chain reaction with an exponential increase in product (Figure
6.16). The three successive reactions of denaturation, primer annealing, and
DNA synthesis are repeated up to about 30- 40 times in a standard PCR reaction.
The DNA synthesis step ofPCR typically occurs at about 70-75°C, but the DNA
polymerase used also needs to withstand the much higher temperature of the
denaturation step. The requirement fot a heat-stable enzyme drove researchers
to isolate DNA polymerases from microorganisms whose natural habitat is hot
springs. An early and widely used example is Taq polymerase from Thermus
aqu.aticus. However, Taq polymerase lacks an associated 3'->5' exonuclease activ
ity to provide a proofreading activity. All polymerases make occasional errors by
inserting the wrong base, but such co pying errors can be rectified by a proofread
ing 3'->5' exonuclease activity. As a result, o ther thermophilic bacteria were
sought as sources of heat- stable polymerases with a 3'->5' proofreading activity,
such as Pfu polymerase from Pyrocaccu.s furiasLis.
184 Chapter 6: Amplifying DNA: Cell-based DNA Cloning and PCR
l ONA s~thesis
cycle results in two new DNA strands whose
5' ends are fixed by the position of the
oligonucleotide primer but whose
•• U l l i n·
3' ends extend past the other primer
(indicated bydoued lines). (8) Afterthe
second cycle, the four new strands consist
first-cycle products
of two more products with variable 3' ends
but now two strands of fixed length with
both 5' and 3' ends defined by the primer
(8)
sequences. After the third cycle, six out of
the eight new strands consist o f ju st the
target and fl anking primer sequences; after
u .. 30 cycles or so, with an exponential increase,
.Ullit this will be the predominant product.
..... n
second-cycle products
4 ....
• 1 II,
• II
.......
.......
t ......
thlrd-cyc le products
30th cycle
its binding site. A useful measure of the stability of any nucleic acid duplex is the
melting temperature (Tm ), the temperature corresponrling to the mid-point in
observed transitions from the double-stranded form to the single-stranded form.
To ensure a high degree of specificity in primer binding, the primer annealing
temperature is typically set to be about 5°C below the calculated melting tem
perature; if the annealing temperature we re to be much lower, primers wo uld
tend to bind to other regions in the DNA that had partly complementary
sequences.
The melting temperature, and so the temperature for primer annealing,
depends on base co mposition. This is because GC base pairs have three hydro
gen bonds and AT base pairs only two. Strands with a high percentage ofGC base
pairs are therefore more difficult to separate than those with a low composition
of GC base pairs. Optimal priming uses primers with a GC content between 40%
and 60% and with an even distribution of all four nucleotides. Additionally, the
calculated Tm values for two primers used together should not differ by greater
than 5°C, and the Tm of the anlplified target DNA should not differ from those of
the primers by greater than 10°C.
Primer design is aided by various commercial software programs and certain
freeware programs that are accessible through the web at compilation sites such
as http://www.humgen.n1/primer_design.htm!.Aswewillseebelow.itis par
ticularly important that the 3' ends of primers are perfectly matched to their
intended sequences. Various modifications are often used to reduce the chances
of nonspeCific primer binding, including hot-start PCR, touch-down PCR, and
the use of nested primers (Box 6.2).
peR 15 disadvantaged as a DNA cloning method by short lengths
and comparatively low yields of product
PCR is the method of choice for selectively amplifying minute amounts oftarget
DNA, quickly and accurately. It is ideally suited to conducting rapid assays on
genomic or cDNA sequences. However, it has disadvantages as a DNA cloning
method. In particular, the size range of the amplification products in a standard
PCR reaction are rarely more than 5 kb, whereas cell-based DNA cloning makes it
possible to clone fragments up to 2 Mb long. As the desired product length
increases. it becomes increasingly difficult to obtain efficient amplification with
PCR. To overcome this problem, long-range PCR protocols have been developed
that use a mixture of twO types of heat-stable polymerase in an effort to provide
optimal levels of DNA polymerase and a proofreading 3' -->5' exonuclease activity.
Using the mod ified protocols. PCR products can sometimes be obtained that are
tens of kilo bases in length: larger cloned DNA can be obtained only by cell-based
DNA cloning.
PCR typically involves just 3~0 cycles, by which time the reaction reaches a
plateau phase as reagents become depleted and inhibitors accumulate.
Microgram quantities may be obtained ofthe desired DNA product, but it is time
consuming and expensive to scale the reaction up to achieve much larger DNA
quantities. In addition. the PCR product may not be in a suitable form that will
permit some subsequent studies. Thus, when particularly large quantities ofDNA
are required, it is often convenient to clone the PCR product in a cell-based clon
ing system. Plasmid cloning systems are used to propagate PCR-cloned DNA in
bacterial cells. Once cloned, the insert can be cut out with suitable restriction
endonucleases and transferred into other specialized plasmids that, for exanlple,
permit expression to give an RNA or protein product.
Several thermostable polymerases routinely used for PCR have a terminal
deoxynucleotidyl transferase activity that selectively modifies PCR-generated
fragments by adding a single nucleotide, generally deoxyadenosine, to the 3' ends
of amplified DNA fragments. The resulting 3' dA overhangs can sometimes make
it difficultto clone PCR products in cells, and various methods are used to increase
cloning efficiency. Specialized vectors such as pGEM-T-easy can be treated so as
to have complementary 3' T overhangs in their cloning site that will encourage
base pairing with the 3' dA overhangs of the PCR product to be cloned (TA clon
ing). Alternatively, the overhanging nucleotides on the PCR products are removed
with polishing enzymes such as T4 polymerase or Pfu polymerase. PCR primers
can also be modified by designing a roughly 10-nucleotide extension containing
186 Chapter 6: Amplifying DNA: Cell-based DNA Cloning and PCR
Allele-specific peR. Designed to amplify a DNA sequence while linker-primed PCR. A form o f indiscriminate amplification that
excluding the possibillty of amplifying other alleles. Based on the involves attac hing oligon ucleotide linkers to bo th end s of all DNA
requirement for precise base matc hing between the 3' end of a peR fragments in a starting DNA and amplifying all fragments by using a
primer and the target DNA. See Figure 6.17. linker-specific p rimer (see Figure 6.18).
Anchored PCR. Uses a sequence-specific primer and a universal Nested primer PCR. A way of increasing the specificity of a PCR
primer for amplifying sequences adjacent to a known seq uence, The reaction . The products of an initial amplification reaction are diluted
universal primer recognizes and binds to a common sequ ence that is and used as the starting DNA source for a seco nd reaction in w hich a
artifi cia lly attached to all of the different DNA mol ecules. different set of primers is used, corresponding to sequences located
oop-peR (degenerate oligonucleotide-primed peR). Uses partly close, but internal, to those used in the first reaction.
degenerate oligonucleotide primers (sets of oligonucleotide sequences RACE-PCR. A form of anchor-primed PCR (see above) for rapid
that have been synthesized in parallel to have the same ba se at certain amplification of cDNA ends. This w ill be described in a later chapter.
nucleotide positions, w h ile differing at other positions) to amplify a Real-time PCR (also called quantitative PCR or qPCR). Uses
variety of related target DNAs. a fluorescence-detecting thermocycler machine to amplify
Hot-start peR. A way of increasing the specifici ty of a PCA reaction. specific nucleiC acid sequences and simu ltaneously measu re their
Mixing all PCR reagents before an initial heat denaturation step allows concentrations. There are two major research application s: (1) to
more opportunity for nonspecific bind ing of primer sequences. qua nti tate g ene expression (and to confirm differential expression of
To reduce this possibility, one or more com ponents of the peA are genes detected by microarray hybridization analyses) and (2) to screen
physically separated until the first denaturation step. for mutations and si ngle n ucleotide polymorph isms. In analytical labs
Inverse PCR. A way of accessing DNA that is immediately adjacent to a it is also used to measure the abu ndance of DNA or RNA sequences in
known sequence. In this case, the starting DNA population is digested clinical and ind ustrial sampl es.
with a restriction endonuclease, diluted to low DNA concentration, RT-PCR (reverse transcriptase PCR). peR in which t he starting
and the n treated with DNA ligase to encourage the formation of population is total RNA or pu rified poly(A)+ mRNA and an initial
circular DNA molecules by intramolecular ligation. The peR primers reverse transcriptase step to produce cDNA is requi red.
are positioned so as to b ind to a known DNA sequence and then Touch-down peR. A way of increasing the specificity of a PCR
initiate new DNA synthesis in a d irect ion lead ing away from the known rea ction. Most th ermal cycl ers can be programmed to perform runs
sequence and toward the unknown adjacent sequence, lead ing to in wh ich the annealing temperat ure is lowe red incrementally during
amplification of the unknown sequence. See Figure 1 (X and Ya re the PCR cycli ng from an in itial val ue above t he expected Tm to a value
uncharacterized sequences flanking a known sequ ence). below the Trn. 8y keeping the stringency of hybridization initia lly very
- -
hig h, the formation of spurious products is disco uraged, allowing the
x y
expected sequence to predominate.
R R
Whole genome peR. Indiscriminate peR. Formerly, this was
i performed by using comprehensively degenerate primers (see
known DNA degenerate PCR) or by attach ing oligonucleotide linkers to a complex
m digest with restriction endonuclease at R
1
DNA popu lation and then using linker-specific oligonucleotide primers
(Ii) circularize using DNA ligase to amplify all seq~ences (linker-primed peR). However, these methods
(Iii) denature, anneal primers-?PCR
do not amplify all sequences because seco nda ry DNA structures pose
a problem for the standard polymerases used in peR, ca using enzyme
slippage or dissocia tion of the enzyme from the template, resulting in
nonspecific amplification artifacts and incomplete coverage of loci. To
offset these difficul ties. a non-PCR isothermal amplification procedure
is now commonly u sed, known as multiple displacement amplification
(MDA). ln the MDA method a strand-displacing DNA polymerase from
phage phi29 is used in a rolling-circle form of DNA amplifi cation at a
. constant temperature of about 30°C (see Further Reading) .
.
.~~ .............. ~..
x ! y
a suitable restriction site at its 5' end. The nucleotide extension does not base
pair to target DNA during amplification, but afterward the amplified produci can
be digested with the appropriate restriction enzyme to generate overhanging
ends for cloning into a suitable vector.
Allele-specific PCR
Many applications require the ability to distinguish between two alleles that may
differ by only a single nucleotide. Diagnostic applications include the ability to
distinguish between a disease-causing point mu tation and the normal aUele.
Allele-spe,cific peR takes advantage of the crucial depende nce of correct base
pairing at the extreme 3' end of bound primers. [n the popular ARMS (amplifica
tion refractory mutation system) method, allele -specific primers are designed
with their 3' end nucleotides designed to base-pair with the va riable nucleotide
that distinguishes alleles. Under suitable experimental conditions amplification
will not take place when the 3' end nucleotide is not perfectly base paired, thereby
distinguishing the two alleles (Figure 6.1 7).
Multiple target amplification and whole genome PCR methods
Standard peR is based on the need for very specific primer binding, to allow the
selective amplification of a desired known target sequence. However, peR can
also be deliberately designed to simultaneously amplify multiple different target
DNA sequences. Sometimes there may be a need to amplify multiple members of
a family of sequences that have some common sequence characteristic. Some
highly repetitive elements such as the human Alu sequence occur at such a high
frequency that it is possible to design primers that will bind to multiple Alu
sequences. If two neighboring A1u sequences are in opposite orientations, a sin ·
gle type of A1u-specific primer can bind to both sequences, enabling amplifica ·
tion ofthe sequence between the two A1 u repeats (A1u-peR).
For some purposes, all sequences in a starting DNA population need to be
amplified. This approach can be used to replenish a precious source of DNA that
is present in very limiting quantities (whole genome amplification). As we will see
in a later chapter, it has also been used for indiscriminate amplification to pro
duce templates for DNA sequencing. One way of performing multiple target
amplification is to attach a common double·stranded oligonucleotide linker
covalently to the extremities of all of the DNA sequences in the starting
(A)
allele 1 5' A
L
C "5'
allele 2
"
A
allele 2 specific
primer (ASP2)
Figur. 6.17 Allele-specific peR is
dependent on perfect base-pairing of the
(8) aUele 1 3' end nucleotide of primers. (Al Alleles
1 and 2 differ by a single base (A~C). Two
digest targel DNA with restriction endonuclease. syntheslZe two oligonucleotides so that th Figure 6 ,'8 Using linker oligonucleotides
e.g. Mool (which gives 5' GATe overhangs) have complementary sequences except tt to amplify multiple target DNAs
each has the same palindromic seQuenc
5' (;.'.IC
! 3' 5'
population and then use linker-specific primers to amplify all the DNA sequences
(Figure 6. 18), An alternative to using linker oligonucleotides is to use compre
hensively degenerate oligonucleotides as primers, That is, when oligonucleotides
are synthesized one nucleotide at a time, all four nucleotides can be inserted at
specific positions instead of a single one, to give a parallel set of many different
oligonucleotides, As a result, very large numbers of different oligonucleotides are
synthesized and can bind to numerous different sequences in the genome, ena
bling peR amplification of much ofthe genomic DNA, Note, however, that whole
genome amplification is often now not performed by peR but by an alternative
form of in vitro DNA cloning (see Box 6,2),
peR mutagenesis
peR can be used to engineer various types of predetermined base substitutions,
deletions, and insertions into a target DNA, In 5' add-on mutagenesis a desired
sequence or chemical group is added in much the same way as can be achieved
by ligating an oligonucleotide linker, A mutagenic primer is designed that is com
plementary to the target sequence at its 3' end, and the 5' end contains a desired
novel sequence or a sequence with an anached chemical group, The extra 5'
sequence does not participate in the first annealing step of the peR reaction but
subsequently becomes incorpora ted into the amplified product, thereby gener
ating a recombinant product (Figure 6. 19Aj , The additional 5' sequence may
contain one or more of the following: a su itable restriction site that may facilitate
subsequent cell-based DNA cloning; a modified nucleotide containing a reporter
group or labeled group such as a biotinylated nucleotide or fluorophore; or a
phage promoter to drive gene expression,
In mismatched primer mlllagenesis, the primer is designed to be only partly
complementary to the target site, but in such a way that it will still bind specifi
cally to the target, Inevitably this means that tlle mutation is introduced close to
the extreme end of the peR product, This approach may be exploited to intro
duce an artificial diagnostic restriction enzyme site that permits screening for a
FURTHER READING 189
1
-- at the 5' end to introduce a desired novel
3'
s' 1_ sequence or chemical group (red bars) that
l
2M does not take part in the initial annealing
peR reaction 1 peR ""tion 2
with primers 1 with p rime~ 2 to the larget DNA but will be copied in
and 1M and2M subsequent cyc les. The products of the
second and final PCR cycles are shown
product 1 in this example. (B) Mismatched primer
~ mutagenesis. A specific predetermined
Ji&N product 2 mutation located in a central segment can be
introdu ced by two separate PCR reactions,
known mutation. Mutations can also be introduced at any point within a chosen
sequence by using mismatched primers. Two mutagenic reactions are designed
in which the two separate PCR products have partly overlapping sequences con
taining the mutation. The denatured products are combined to generate a larger
product with the mutation in a more central location (Figure 6.19B).
Real-time PCR (qPCR)
Real-time PCR (or qPCR, an abbreviation of quantitative PCR) is a widely used
method to quantitate DNA or RNA. In the latter case, RNA is copied by using
reverse transcriptase to make complementary DNA strands. Standard PCR reac
tions, in which the amplification products are analyzed at the end of the proce
dure, are not well suited to quantitation. Samples are removed from the reaction
tubes and are usually size -fractionated on agarose gels and detected by binding
ethidium bromide, which fluoresces under ultraviolet radiation. This procedure
is both time-consuming and not very sensitive. In real-time PCR (qPCR), how
ever, quantitation is done automatically, while the reaction is actually occurring,
in specially designed PCR equipment. The reaction products are analyzed at an
early stage in the amplification process when the reaction is still exponential,
allowing more precise quantitation th an at the end of the reaction. There are
many applications, both in absolute quantitation of a nucleic acid and in com
parative quantitation. One of the most important is in tracking gene expression,
and we will describe in detail in Chapter 8 how qPCR works and is used in gene
expression profiling.
FURTHER READING
General cell-based DNA cloning 5ambrookJ & Russell OW (2001) Molecular Cloning. A Laboratory
Manual. Co ld Spring Harbor Laboratory Press.
Primrose 5B, Twyman RM, Old RW & Bertola G (2006) Principles
Vector databa ses: fo r compilation of websites see http://mybio
Of Gene Manipulation A nd Genomics, 7th ed. Blackwell
.wikia.com /wikiN ecror_ databases
Publi shing Professional.
Watson JO, Caudy AA, Myers RM & Witkowski JA (2007)
Roberts RJ (2009) Official REBA5E Homepage. http://rebase
Reco mbinant DNA: Genes And Genomes, 3rd ed . Freeman .
.neb.com/re base/ rebase.html [database of restriction
endonucleases.]
Chapter 7
KEY CONCEPTS
DNA molecules are very large and break easily when isolated from cells, making
them difficultto study. Chapter 6 looked atthe way in which DNA cloning, either
in cells or in vitro, allows individual DNA molecules to be selectively replicated to
very high copy numbers. When amplified in this way, the DNA is effectively puri
fied, and DNA cloning enables individual DNA sequences in genomic DNA to be
studied, and also individual RNA molecules after they have been copied to give a
eDNA. In this chapter, we consider an altogether different approach. Instead of
trying to purify individual nucleic acid sequences, the object is to track them spe
cifically within a complex population that represents a sample of biological or
medical interest.
At the basic research level, nucleic acid hybridization is often used to track
RNA transcripts and so obtain information on how genes are expressed. It is also
used to identify relationships between DNA sequences from different sources.
For example, a starting DNA sequence can be used to identify other closely related
sequences from different organisms or from different individuals within the same
species, or even related sequences from the same genome as the starting
sequence. Nucleic acid hybridization is also often used as a way of identifying
disease alleles and aberra nt transcripts associated with disease.
Anneal. To allow hydrogen bonds to form between two sing le single-stranded and used to search for complementary target
strands. If two single-stranded nucleic acids share sufficient base sequences within an imperfectly or poorly understood popu lation of
complememarity, they will form a double-stranded nucleic acid duplex. nucleic acid sequences by annealing to form heteroduplexes.
An nea l has [he opposite mean ing to that of denature. Hybridization stringency. The degree to which hybridization
Antisense RNA. An RNA sequence that is complementary in sequence conditions tolerate base mismatc hes in heterodup lexes. At high
to a transcribed RNA, enabling base pairing between the two hybridization stringency, only the most perfectly matched sequences
sequences. can base pair, but heteroduplexes with sig nifica nt mismatches can
Base complementarity. The degree to which the sequences of two be stable when the stringency is lowered by reduc ing the annealing
sing le-stranded nucleic acids ca n form a double-stranded duplex by temperature or by increasing the sa lt concentration.
Watson-Crick base pairing (A binds to T or U; C binds to G). In situ hybridization. A hybridization reaction in which a labeled
Denature. To separate the individual strands of a double-stranded probe is hybridized to nucleic acids so that their morphological
DNA duplex by breaking the hydrogen bonds between them. This can location ca n be detected within fixed ce ll s or chromosomes.
be achieved by heating, or by exposing the DNA to alkali or to a hig hly Melting temperature (Tm)' The tempera ture co rresponding to the
polar solvent such as urea or formamide. Denature has the opposite mid-point in the observed transitIon from double-stranded to single
meaning to that of anneal. strand ed forms of nucleic acids.
DNA chip. Any very-h igh-density gridded array (microarray) of DNA Microarray. A solid surface to wh ich molecules of interest can be
clones or oligonucleotides that is used in a hybridization assay. fixed at specific coordinates In a hIgh-density grid format for use In
Feature (of a DNA or oligonucleotide microarray). The larg e number of some assay. An oligonucleotide or DNA microarray has numerous
identical DNA or oligonucleotide molecules at anyone pos ition in the unlabeled DNA or oligonucleotide molecules affixed at precise
microarray. positions on the array, to act as probes in a hybridization assay.
Heteroduplex. A double-stranded nucleic acid formed by base pai ring Each specific position has many thousands of identical copies
between two single-stranded nucleic acids that do not originate from of a particular type of oligonucleotide or DNA molecule, constituting
the same allele. The re may be 100% base matching between the two a feature .
sequences of a heteroduplex, notably when one of the hybridizing Probe. A known nucleic acid or oligonucleotide popu lation that is
sequences is an oligonucleotide probe or when the sequences are used in a hybridization assay to query an often complex nucleic acid
from different alleles of the same gene or are evolutionarily closely population so as to identify related Targer sequences by forming
related. heteroduplexes.
Homoduplex. A doub le-st randed DNA formed when two single Riboprobe. An RNA probe.
stranded sequences originating from the same allele are allowed to Sequence similarity. The degree to which two sequences are identical
re-anneal. A homoduplex has 100% base matching and so is normally in sequence or in base complementarity.
more stable than a heteroduplex. Target. Nucleic acid seque nce that shows sufficient sequence
Hybridization assay. An assay in which a population of well Sim ilarity to a probe that it can base-pair with it in a hybridization
characterized nucleic acids or oligonucleotides (the probe) is made assay to form a stable heteroduplex.
PRINCIPLES OF NUCLEIC ACID HYBRIDIZATION 193
l-~
sequences are both made single-stranded,
then mixed and allowed to anneal.
Sequencesthat had previously bee n base
paired in the test sample and probe will
! denature re-anneal to form homoduplexes (bonom
left and bottom rig ht). In addition, new
heteroduplexes will be formed between
probe and target sequences that have
complementary or partially complem enta ry
sequences (bonom center). The conditions
of hybridization can be adjusted to favor
the formation of heteroduplexes. In this
way, probes selectively bind to and identify
related nucleic acids within a complex
nucleic acid populatio n.
mix probe and sample
annea l
probe-target re-annealed
re-annealed sample nucleic acids heteroduplexes ""obe
nucleotide
assays typically involve binding either the
===l1.li.
:
I
test sample population to the surface of a
solid support (as shown here) or fixing the
probe population to the solid support. The
! wash
probe-target heteroduplexes, and can lead
to identification of the target sequence.
I
11
I
DNA- RNA or RNA-RNA 79.8 + 18.5(1og 10[Na+]a) + 0.58(%GC b) + 11.8 (%GC b)2 8201L'
~ Or for other monovalent cation, but only accurate in the range 0.01-0.4 M. bOnl y acc urate for
%GC from 30% to 75% Gc. cL, length of d uplex in base pairs. dOlig o, oligonucleotide; In. effective
length of primer == 2)( (no. of G + C) + (no. of A + TJ. For each 1% formamide, Tm is decreased by
length of the duplex. There are more hydrogen bonds in longer nucleic acid mol·
ecules and so more energy is required to break them. The increase in duplex sta
bility is not linearly proportional to leng1h, and the effect of changing the leng1h
is particularly n oticeable at shorter leng1h ranges. As will be described in the neX1
section, base mismatching reduces duplex stability.
Base composition and the chemical envi ronment are also important fa ctors
in duplex stability. A high percentage of GC base pairs means greater difficulty in
separating the strands of a duplex because GC base pairs have three hydrogen
bonds a nd AT base pairs have only two. The presence of monovalent cations (e.g.
Na+) stabilizes the hydrogen bonds in double-stranded molecules, whereas
stro ngly polar molecules (such as formamide and urea) disrupt hydrogen bonds
and thus act as chemical denatu rants.
A progressive increase in te mperature also makes hydrogen bonds unstable,
eventually disrupting them . The temperature corresponding to the mid-point in
observed transitions from double-stranded form to single-stranded form is
known as the melting temperature (TOI ) . For mammalian genomes, with a base
composition of about 40% GC, the DNA denatures with a Tm of about 87°C in
buffers whose pH and salt concentration approximate physiological condition s.
The Tm of perfect hybrids formed by DNA, RNA, or oligonucleotide probes can be
determined by using standard formulae Cfable 7. 1).
(A) nucleic acid probe (B) oligonucleotide probe Fig"'. 7.1 Exploiting probe-target
mismatches during nucleic acid
probe hybridization. (A) Longer nucleic acid
"----- - perfect
match
with target
mismatch
of distantly related members of a gene
stable family. (B) Short oligonucleotide probes
unstable at high
hybridization stnngency are less tolerant of sequence mismatches.
By using a high-stringency hybridization, it
is possible to select for perfectly matched
significant
mismatching
duplexes only, and so distinguish between
allelic sequences that differ by just a single
stable at reduced hybridization
long (more than 100 bp). Conversely, if the region of base complementarity is
short, as when oligonucleotides are involved (typically 15-20 nucleotides),
hybridization conditions can be chosen such that a single mismatch renders a
heteroduplex unstable (Figure 7.3 ).
5' pApGpCpTpApCp Gp ApCpGpC p'rp.....p TpT ... .. 3 ' Flgur. 7.-4 DNA labeling by nick
3 ' p l' pCpGpApTpGpCpTpGpCpGpApTpApA. . .. 5' translation. Pa ncreatic DNase I introduces
Single-strand ni cks by cleaving inte rnal
nick
I
activities: a 5' ~ 3' exonuclease (exo) attacks
E. coli DNA polymerase I
(5' ~ 3' exo and 5' ~ 3' pol) the exposed 5' terminus of a nick and
cCTP ~ dAlP, dGTp, dTIP sequentially removes nucleotides in the
001 ----, ..-exo 5'4 3' direction; and a ONA polym erase (pol)
HO , ,P adds new nucleotides to the exposed 3' OH
pApG CpTpApCpGpApCpGpCpTpApTp T.
group, continu ing in the 5' ~3' direction,
pTpCpGp.....pTpGpCpTpGpC pGpApTpApA.
thereby replac ing nucleotides removed
, , "
1
rHq
nick
by the exonuclease and cau sing lateral
displacement of the nick. At least one of
the dinucleotide triphosphates (dNTPs) is
labeled (red circle on CTP in this example),
p.....p GpCp'I'p A?C p Gp.....pCpGpcp T ApTpT ... . and thus label is in corporated at each C in
p TpCpGp ApTpGpC pTpGpCpG p ApTpApA. the new ly synthesized strand.
length of the newly synthesized strand. Typically a DNA or RNA polynlerase uses
a suitable cloned DNA to make DNA or RNA copies that are labeled by using a
labeling mix in which at least one of the four dNTPs or rNTPs is labeled.
Labeling DNA by nick translation
In nick translation. single-strand breaks (nicks) are introduced in the DNA to
leave exposed 3' hydroxyl termini and 5' phosphate termini. The nicking can be
achieved by adding a suitable endonuclease such as pancreatic deoxyribonu
clease I (DNase I). The exp osed nick serves as a starting point for introducing new
nucleotides with E. coli DNA polymerase L a mul tisubunit enzyme that has both
DNA polymerase and 5'-->3' exonuclease activities. At the same time as the DNA
polynlerase activity adds new nucleotides at the 3' hydroxyl side of the nick. the
5'-->3' exonuclease activity removes existing nucleotides from the other side of
the nick. Thus. the nick will be moved progtessively along the DNA in the 5'-->3'
direction (FIgure 7.4). If the reaction is conducted at a relatively low temperarure
(about 15°C). the reaction proceeds no further than one complete renewal of the
existing nucleotide sequence. Although there is no net DNA synthesis at these
temperatures, the synthesis reaction allows the incorporation of labeled nucle
otides in place of a previously existing unlabeled nucleotide.
Random primed DNA labeling
The random primed DNA labeling method uses a mixture of many different
hexanucleotides to bind randomly to complementary hexanucleotide sequences
in a DNA template and p rime new DNA strand synthesis (figure 7.5). Synthesis
of new complementary DNA strands is catalyzed by the Klenow suburrit of
E. coli DNA polymerase I (containing the polymerase activity but not the associ
ated 5'-->3' exonuclease activity).
PeR-based strand synthesis labeling
The standard PCR reaction can be modified to include one or more labeled nucle
otide precursors that become incorporated into the PCR product throughout its
length.
RNA labeling
RNA probes (riboprobes) can be obtained by in vitro transcription of DNA cloned
in a suitable plasmid expression vector. Expression vectors contain a promoter
sequence imm ediately adjacent to the insertion site for foreign DNA. allowing
any inserted DNA sequence to be transcribed. Strong phage promoters are com
monly used. such as those from phages SP6, T3, and T7. and the corresponding
phage polymerase is provided to e nsure specific transcription of the cloned DNA.
The inc/usion oflabeled ribonucleotides enSLUes that the newly synthesized RNA
is labeled (Plgure 7.6).
LABELING OF NUCLEIC ACIDS AND OLIGONUCLEOTIDES 199
1 denatuce DNA
primed labeling. Double-stranded DNA is
denatured and then a mixture of a nearly
random set of hexanucleotides is added,
The hexanucleotides will bind to the
S 3' single-stranded DNA at complementary
sites, serving as primers for synthesis of new
5' labeled DNA strands. The Klenow subunit
3'
-i'
of E. coli DNA polymerase I is used and has
.... miX with random 5'--73' polymerase but not exonuclease
~ hexanucleotides activity.
~ and anneal
- 5'
5' _ 5' _
1
Klenow subunit of E. coli
DNA polymerase I (5'---; 3' pol)
dATP, dGTp, dTTp, cCTP
U ",'1 .
-
provides a two-dimensional representation of the distribution of the radiolabel
clone DNA into MCS
1
Mes Pvull
sP6promot~
pSP64
/
, ori
vector
Flgurtill7.6 RNA probes are usually made
by transcribing cloned DNA inserts with
a phage RNA polymerase. The plasmid
vector pSP64 contains a promoter sequence
for phage SP6 RNA polymerase linked to the
I MWi,hl';UII
multiple cloning site (MCS) together with
an origin of replication Cori) and ampicillin
resistance gene (AmpR). A suitable DNA
fragment is cloned into the MCS and then
the purified recombinant DNA is linearized
-
by cutting with the restriction enzyme Pvull.
SP6 RNA polymerase and a mixture of NTPs,
at least one of which is labeled, are added,
initiating transcription from a specific site
within the SP6 promoter and continuing
through the inserted DNA. Multiple labeled
RNA transcripts of the inserted DNA are
! t! produced. Many similar expression vectors
multiple labeled
RNA transcripts
offer two different phage promoters flanking
the multiple cloning system, such as those
1..ll for phages T3 and T7, and the respective
phage polymerases are used for generating
riboprobes from either of the two DNA
strands.
200 Chapter 7: Nucleic Acid Hybridization: Principles and Applications
Autoradiography records the position of a radioac tively labeled unaltered silver halide crystals are then removed by the fixing process.
compound within a solid sample by producing an image in The dark areas on the photographic film provide a twO-dimensional
a photogra phic emulsion. In molecular genetic applications, the representation of the distribution of the radiolabel in the original
radlolabeled compounds are often DNA molecules or prOteins, and the sample.
solid sample may be fixed chromatin or tissue samples mounted on Direct autoradiography is best suited to detection of weak
a glass slide. Alternatively, DNA or protein samples ca n be embedded to medium-strength !l-emitting radionuclides (such as 3H or 355).
within a dried electrophoresis gel or fixed to the surface of a dried However, high-energy p-partides (such as those from 32p) w ill pass
nylo n membrane or nitrocellulose filter. through the film, wasting most of the energy (Tlbl. 1). For samples
The solid sample is placed in intimate contact with an X-ray emitting high-energy radiation, a modification is needed in which
film, a plastic sheet with a coating of photographic emulsion. The the emitted energy is converted to light by a suitable chemica l
photographic emulsion consists of silver halide crysta ls in suspension (scintillator or fluor). Indirect autoradiography uses sheets of a
in a clear gelatinous phase. Radioactive emissions from the sample solid inorganic scintillator as intensifying screens, which are placed
travel through the emulsion, converting Ag+ ions to Ag atoms. behind the photographiC film .Those emission sthat pass through the
The position of altered silver halide crystals can be revea led by photographic emulsion are absorbed by the screen and converted
development, an amplification process in which the rest ofthe Ag+ to light, which then also redu ces the Ag+and inten sifies the direct
fons in the altered crystals are reduced to give metallic silver. Any autoradiographic image.
---
I TABLE 1 CHARACTERISTICS OF RADIOISOTOPES COMMONLY USED FOR LABELING DNA AND RNA PROBES
Isotope Half-life Decay type Emission energy (MeV) Exposure time Suitability for high-
resolution studies
Fluorescence labeling of nucleic acids was developed in the 19805 and figUA:1 Structure of two
has proved to be extremely valuable in many different applications, common fluorophores_
including chromosome in siru hybridization, tissue in situ hybridization, TRITe and a variety of other
an d automated DNA sequencing. fluorop hores have been derived
A fluorophore is a chemical grou p that absorbs energy when from rhodamine.
exposed to a specific wavelength of light (excitation) and then
re-emits it at a specific but longer wavelength (fable 1 and
~19ure 1). Direct labeling of nucleic acids with fluoropho res is achieved o
by incorporating a modified nucleotide (often 2' deoxyuridine
5' triphosphate) containing an appropriate fluorophore. flUorescein
Indirect labeling systems are also used. In this technique, the
tluorophore is used as a marker and is attached to an affinity molecule
(such as streptavidin or a digoxigenin-speciAc antibody) that binds
specific ally to mod ifi ed nucleotid es contai ning a reporter group (such
as biotin or digoxigeninJ (see Figure 7.7).
AMCA
DAPI
350
358
I 450
461 slig htly longer wavelength, the emission wavelength. The light emitted
by the fluorop hore pa sses back up and straight through the dichroic
Green mirror, through an appropriate barrier filter and into the microscope
eyepiece. A second beam-splining device can also permit the light to
FIlC 492 520 be recorded in a CCO (charge-coupled device) camera.
Fluorescein (see Figure 1) 494 523
Red 4
CV3 550 I 570
~
TRiTe 554 575
barner filter
Rhodamine (see Figure 1) 570 590 excitation filter
objective
Detection of fluorophore-Iabeled nucleic acids --===- i _ labeled sample
00 slide
Fluoropho re-Iabeled nucleic acids ca n be detected with laser scanners
or by fluorescence microscopy. The fluoroph ores are detected by
passing a beam of light from a suitable light source (an argon laser is FlguI"IIl Fluorescence microscopy. The excitation filter allows light
used in automated DNA sequencing, a mercury vapor lamp is used of only an appropriate wavelength (blue in this example) to pass
in fl uo rescence microscopy) through an appropriate color fllter. The through. The transmitted blue light is re flected by the dichroic (beam
fil ter is d esigned to transmit light at the desired excitation wavelength . splitting) mirror onto the labeled sample, which then fluoresces
In fluorescence microscopy systems, this light is reflected onto the and emits light of a longer wavelength (green light in this case).
fluorescently labeled sample on a microscope slide by using a dichroic The em itted green light passes straight through the dichroic mirror
mirror that reflects light of certain wavelengths while allowing light of and then through a second barrier filter, which blocks unwanted
other wavelengths to pass straight through (fig·l,.lre 2). The light then fluores cen t signals, leaving the desired green fluorescence emission
excites the fluorophore to fluoresce; as it does so, it emits light at a to pass through to the eyepiece of the microscope.
There are two widely used indirect label detection systems. The biotin- strep
Ia,~din system depends on the extremely high affinity between two naturally
occurring ligands. Biotin (a vitamin) acts as the reponer, and is specifically bound
by the bacterial protein streptavidin with an affinity constant (also known as the
jjssociation constant) of 10- 14 , one of the strongest known in biology. Biotinylated
202 Chapter 7: Nucleic Acid Hybridization: Principles and Applications
1
be labeled with chemical groups that are not
1abel DNA with
CD nucleotldes carrying
detected directly. Instead, the incorporated
reporter ( ) groups serve as reporter groups that are
bound with high speciticity by an affinity
molecule carrying a detectable marker. The
marker can be detected in various ways. If
it carries a specitic fluorescent dye, it can
be detected by fluorescence microscopy.
A common alternative involves using an
enzyme such as alkaline phosphatase to
binding of affinity convert a substrate to give a colored product
molecule to reporter
that is measured colorimetrically.
1
detection of marker
a
biotin-16-dUTP n
C
C16 spacer HN.... ...... f\,H
1
0 0 0 0 1 \1
II II "C - <;" II II
W..... C 'c""'" CH=CH-CH -NH-C-{CH) -NH-C-{CH) -NH-C-(CH) - 6H S ....... Cti
2 23
23 2~ ....
II
~C, ....... CH biotin
o N
~r~o;; I
I'c-~~
I I a
0" H
o
digoxigenin-l1<lUTP
Cl l spacer
dlgoxigenin
Figur. 7.8 Structure of biotin- and digoxigenin-modified nucleotides. The biotin and digoxigenin reporter groups shown here are linked to the
5' carbon atom of the uridine of dUTP by spacer groups consisting, respectively, of a total of 16 carbon atoms (biotin-16-dUTP) or 11 carbon atoms
(digoxigenin-l1-dUTP). The spacer groups are needed to ensure physical separation of the reporter group from the nucleic acid backbone, so that the
reporter group protrudes suffiCiently far to allow the affinity molecule to bind to it.
HYBRIDIZATION TO IMMOBILIZED TARGET NUCLEIC ACID S 203
dot-blols
pA·ASO
probe
~ & .AS O
• • normal individu als and for heterozygotes,
but negative for sic kle-cell homozygotes
(un filled circle). With the mutant probe
probe (~s-ASO), the results are positive for the
sickle-cell homozygotes and heterozygotes,
genotype pApA pAp'" f}S~ !I
but negative for norma l individu als.
204 Chapter 7: Nucleic Acid Hybridization: Principles and Applications
The phosphodiester backbone of nucleic acids means that they Very large DNA fragments are very poorly separated by
carry numerous negatively charged phosphate groups, so that w hen conventional agarose gel electrophoresis. Instead, specialized
present in an electric field they will migrate in the direction of the equipment is used in which a discontinuous electric field is used.
positive electrode. By migrating through a porous gel, nucleic acid During electrophoresis the direction or polarity of the electric field
molecules can be separated according to size. This happens because is intermit(ently changed. The electric field may be pulsed so that a
the porous gel acts as a sieve, and small nucleic acid molecules pass particular negative-positive orientation is establ ished for a short time
easily through the pores or the gel but larger nucleic acid mol ecu les before being reversed for another short interval and the switching of
are impeded by frictional forces. the field Is performed recurrently. In pulsed-field gel electrophoresis,
Traditionally, q uite small to moderately large nuc leic acid the intermittent switching of the electric field forces the DNA to
molecules have been separated by using agarose slab gels with reorient to be able to migrate in the new electriC field; the time
individual samples loaded into cut-out wells so that different lanes are taken to reorient depends on the length ofthe molecule. As a result,
used by the migrating sa mple ~ (Figure 1). After electrophoresis, the it is possible to separate fragments of DNA by size up to several
gels are stained with chemicals, such as ethidium bromide or SYBR megabases in length.
green, that bind to nucleic acids and that can fluoresce w hen exposed
to ultraviolet radiation.
samples loaded into wells
For fragments between 0.1 kb and 30 kb long, the migration
speed depends on fragment length, but scarcely at all on the base 1 1 1 1 r-
composition. However, conventional agarose gel electrophoresis is
not well suited to sepa rating rather small or very large nucleic acid slab gel
1
small fragments
molecules. The percentage of agarose can be varied to increase
migrate faster
reSOlution-decreas ing the percentage helps to separate larger
fragm ents. and increasing it ma kes it easier to separate smaller
fragments. To get superio r resolution of smaller nucleic acids, however,
polyacrylamide gels are normally used, and they are used in DNA
sequencing to separate fragments that differ in length by just a single
nucleotide. Fi!Jur'@ 1 Gel electrophoresis.
HYBRIDIZATION TO IMMOBILIZED TARGET NUCLEIC ACIDS 205
Figur. 7.10 Southern blQt hybridization. A complex DNA sam ple is digested
with restriction endonucleases. The resulting frag ments are applied to an
agarose gel and separated by size using electrophoresis (see Box 7.4) , The gel
is treated w ith alkali to denature the ONA frag ments and is then p laced against
- -
l
a nitrocellulose o r nylon m embrane. DNA will be transferred from the gel to digesl ON A w ith
rest riction endonucleases
the membrane, which is then soaked in a solutio n contai nin g a radlolabeled
and separate fragments
sing le-stranded DNA probe. After hybridization, the membrane is washed to by size on an agarose gel
rem ove excess probe and the n dried. The membrane is placed aga inst an X-ray
film and the position of the labeled probe will ca use a latent Image on the film
tha t can be revealed by development of the autoradiog raph (see Box 7.2) as a
hybridizatio n band.
in situ.
Chromosome in situ hybridization
,~, ' -,:-'
..... simple procedure for mapping genes and other DNA sequences is to hybridize
a suitable labeled DNA probe against chromosomal DNA that has been dena· --, ,:':'
tured in situ. To do this, an air· dried microscope slide chromosome preparation
is made, typically using metaphase or prometaphase chro mosomes from periph·
eral blood lymphocytes or lymphoblastoid ceil lines. RNA and protein are hYbfidiZe labeled probe 10
re moved fro m the sa mple by treatment with RNase and proteinase K, and the
remaining chromosomal DNA is dena tured by exposure to formam ide. The dena
Hued DNA is then available for in situ hybridization with an added solution con·
taining a labeled nucleic acid probe, overlaid with a coverslip.
.....--.
step. As a result, the Signal obtained after the removal of excess prob e can be cor ..---.. ~
related with the chromosome band pattern to identify a map location for the
.!!!
Ji
~
waSh
'"
.....~ ~ ____
to remove
unhybridlzed probe
/! '" ~
~.tq;&8 ~ and apply X·ray film
~~#:Flf. ,&
= .. .
','
->
-.' .-:>
.;.:> .:::>
l
deVelOPfilm to revea l
autoradiograph of
labeled target DNA
figure 7.1 1 Northern blot hybridization. No rthern blotting uses samples of total RNA o r purified
poty(A)+ mRNA from tissues or ce lls of interest. The RNA is size-fractionated by electropho resis,
~ nsferred to a membrane, and hy bridized with a suitable labeled nu cleic acid probe. In the example
Sflo wn here, the probe was a labeled cDNA from th e FMRI (fragile X mental retardatio n syn drome)
ge'le, Th e results show comparative expression of FMRJ in different tissu es: stron gest expression was
~ n d in the brain and testis (4.4 kb), with decrea sing gene expressio n in the placenta, lung, and kidney,
--espectively, and almost undetectab le expression in live r, skel etal muscle, and panc reas. Although no
4:..4 kb transcript was evident in heart tissue, smaller transcripts (a bout'.4 kb) were evident that could
~,:!'.'e been generated by alternative sp licing o r aberrant transcripti on. [From Hinds Hl, Ash ley (T,
5-t...t cliffe JS et al. (1993) Nar. Genet. 3, 36-43. With permiSSion fro m Macmillan Publi shers ltd.)
206 Chapter 7: Nucleic Acid Hybridization: Principles and Applications
'.
,
, zona limitans
..,:...r, ,_
,. l
alkaline phosphatase-coupled blue color
. tongue palate
assay. Strong expression was seen in very
':'< specific region s of the developing central
nervous system, as indicated, and also in the
secondary palate. (Courtesy of Mitsushiro
Nakatomi & Heiko Peters, Newcastle
University.)
MICROARRAY-BASEO HYBRIOIZATION ASSAYS 207
place membrane on aga~l \.. . .- ----.---~./ .~~ on top of discrete bacleri al colonieS on an
sunace of master plate 2 1: peel 0ff ~ ~-- ~---
agar plate, some cells from the colonies
adhere to the membrane. The m embrane
membrane
10 - __ 3 Invert membrane is peeled off and then uansferred to a new
n
~
.,"'lfY ~-~~
~_ _.'._ and place on top wlture dish to allow the transferred cells
ofnew aga,p'ate to grow into colonies. The resulting replica
. positiOn of _______ ~- (replica plate)
membrane will have the same physica l
• positIVe clone master agar plate _.' _
(
\
00 mast« ptate containing ,;epa,a'ed
Metenal colonies
~~
_ _ _
representation of bacteria ! colonies as the
original cu lture dish (the master agar plate).
"--~
J 11
I pld< colo~ and grow
allow colonies
The replica membrane is treated to denature
1
the attached DNA and fix it to the surface
in liQuid culture 4 to regrow on
lower surtace o f the membrane. The dried membrane
develop 't 9 of membrane is then hybridized with a labeled nucleic
,T\
aut()(sdiograph \
acid p robe. After a wash to remove any
~- -~
~::,--) unbound labeled probe, X-ray film is used
l;~~~: .: ,~;J
.
to create an autoradiograph that reveals the
position of labeled colonies on the replica
k~)" replica membrane
5 I peeloff membrane that can then be used lO identify
place membrane "'-
against X-ray film
8
'\
with bacteria!
colonies on surface
tt!' membrane the equivalent colony on the o riginal master
(~~~~-. - ..-) ;,;------:--:::-._
( .... plate. The positive colonies can be picked
and grown individually in liquid culture.
---------~ 6 '- - - - )
~--- --~
hybridize with ~enature and
labeled probe fiX bacterial DNA.
to membrane
After hybridization, the probe solution is removed, and the filter is washed
extensively, dried, and processed to detect the bound labeled probe. Th e position
of strong signals from bound probe is related back to a master plate contai ning
the original pattern of colonies, to identify colonies containing DNA related to
the probe. These can then be individually picked and grown in liquid culture to
obtain sufficient quantities for extraction and purification of the recombinant
O:-iA.
Once it became possible to create complex DNA libraries, more efficient auto
m ated methods of clone screening were required. Now robotic gridding devices
call be used to pipette from clones arranged in micrOliter dishes into predeter
mined linear coordinates on a membrane. The resulting high-density clone fil
letS can permit rapid and efficient libraty screening, as described in Chapter 8.
hybrld lzation'......~
signal
strong
hybridization
signal
spot on the array is analyzed with digital imaging software that converts the sig
nal into one of a palette of colors according to its intensity.
We will consider practical examples of microarray use in subsequent chap
ters, such as in Chapter 8 where we consider how gene expression is analyzed.
Here we are concerned with the basic principles and technologies. The object of
microarray hybridizatio n is for each of the thousands of different probes to fo rm
heteroduplexes with target sequences in the labeled nucleic acids of the test sam
ple. At anyone position on the array nu mero us identical copies of just one type of
oligonucleotide or DNA will be available for hybridization (FIgure 7.14).
Target sequences will be present at different amoun ts in the sample
under stud y. The amount of specific target DNAbound by each probe will depend
on the frequency of that specific target DNAin the test sample. A probe that binds
a relatively rare target nucleic acid will have less fluorescent label bound to it,
resulting in a weak hybridization signal, whereas positions on the array that are
strongly fluo rescent after washing indicate probes that have bound an abundant
target nucleic acid. Quantitating the amou nt of fluorescent label bound across all
the positions in a microarray gives a huge data set that can reflect the relative
frequencies of numerous target sequences in the sample population.
High-density oligonucleotide mlcroarrays offer enormously
powerful tools for analyzing complex RNA and DNA samples
The construction of first-generation DNA microarrays involved robotic spotting
of DNA clones or synthetic oligonucleotides onto chemically treated glass micro
scope slides. However. modern DNA microarrays often use oligodeoxynucleotide
probes becaLlse they are versatile. Oligonucleotides can be designed to be spe
cific fo r just one target, and the synthesis of oligonucleotides to order means that
high- density microarrays can be produced rapidly and efficiently. Data reliability
can also be more effiCiently assessed in many cases; for example, the expression
MICROARRAY-BASED HYBRIDIZATION ASSAYS 209
of a sin gle gene can be followed simultaneously with many different oligonucle
otide probes, each of which is specific for that gene.
Most modern applications use microarrays produced by alternative ingen
ious teclmologies developed by two commercial companies. Affymetrix's
approach was modeled on silicon chip manufacture, prompting the term DNA
chip (although the term DNA chip can now mean any high-density DNA or oligo
nucleotide microarray). Illumina's technology depends on a random assembly of
arrays of beads in wells.
Affymetrix oligonucleotide microarrays
Affymetrix chip technology involves light -directed in situ syn thesis of oligonucle
otide probes on the microarray, and can produce more than 6 million different
oligonucleotide populations on a!. 7 cm' surface. Each type of oligonucleotide is
assembled by adding mononucleo tides sequentially to a linker molecule that ter
minates with a photo labile protecting group. A photomask (photolithograph) is
used to determine which positions on the microarray will be exposed to an exter
nallight source, resulting in the destruction of photolabile groups. A chemical
coupling reaction is then used to add a particular type of nucleotide to newly
deprotected sites. The process is repeated sequentially with different masks for
each of the other three nucleotides to achieve the addition of a single nucle
otide-A, C, G, or T-to each linker molecule. Then the process is repeated to
extend oligonucleotide synthesis by a further 24 nucleotides or so. Depending on
the arrangement of holes cut in the masks used at each step, specific predeter
mined oligonucleotide sequences can thus be assembled at predetermined loca
tions on the micro array (Figure 7.15).
ilium ina oligonucleotide microarrays
Illumina's BeadArrayT" technology depends on a random assembly of arrays of
beads in wells. The beads are silica microspheres 3 ~m in diameter and form the
array elements. The oligonucleotides are pre-synthesized to be more than 70
nucleotides long (consis ting of a 25-nucleotide address code and a 50-nucleotide
gene-specific sequence). Each oligonucleotide is coupled to a particular batch of
beads, and individual beads carry more than 100,000 identical oligonucleotides.
The same principle underlies a new sequencing technology marketed by the
Illumina company as will be described in Chapter 8.
Beads carrying different oligonucleotides are pooled and then spread
across prefabricated micro arrays with regularly spaced microwells designed to
accept the bead size. The beads are immobilized within the cavities, and the
25-nucleotide addresses are decoded, allowing each bead to be identified. Thus,
the arrangement of probes is random and positions are determined later.
Hybridization oflabeled target nucleic acids to complementary oligonucleotides
ill the bead arrays can then be tracked by scanning the array for the presence of
bound labeL
Microarray hybridization is used mostly in transcript profiling or
assaying DNA variation
..>Jthough the technology for establishing DNA micro arrays was developed only
very recently, numerous important applications have already been developed,
and their impact on future biomedical research and diagnostiC approaches is
e"'Pected to be profound. Most applications involve tracking gene transcripts or
assaying DNA variation.
The first application-and still a very important one-was in simultaneously
assessing the abundance of transcripts of multiple genes. Initially, small subsets
of the total number of genes in complex genomes, such as the human genome,
were assayed, but whole genome gene sets can now be assayed. This application
will be explored furth er in Chapter 8. Transcript profiling has found many appli
cations, no tably in comparing expression profiles of closely related cell types,
ceUs that are otherwise identical but differ in the presence or absence of a patho
genic mutation, and identical cells that are differently exposed to defined envi
ronments, such as the presence of a drug. Other as pects of transcript expression
can also be assayed, such as the use of oligonucleotide probes to assay splice
variants.
210 Chapter 7: Nucleic Acid Hybridization: Principles and Applications
(A)
phOtOlithograph
(maSk)
'"',
quartz wafer
mlcroarray
(8)
STEP l -e c- X
I
nucleotide
J, addition
J,
J,
J,
J,
J,
STEP 25-T ) )
addition
Flgu.r. ".15 Construction of oligonucleotide microarrays by using light-directed oligonucleotide synthesis in situ.
(A) Affy metrix oligonucleotide microarrays are constructed by synthesizing thousa nds of oligonucleo tides at predetermined
positions on a quartz wafer microarra y, using a photoli thograph (mask) to determine the position of each added nucleotide.
(B) Each oligonucleotide is assembled starting from a linker molecule that is fixed to the array and whose free end carries a
protective photolabile group. A mask w ith holes cut into it at specific positiOns is placed above the array and light from an external
source is shone through the holes. At the POS itions exposed to light the photolabile groups are destroyed (deprotection). A spe<ific
mononucleotide coupled to a pho to labile group is added to the array and w ill com bine with deprote«ed stra nds. The process
is repeated three times w ith another three masks w ith holes cut in different positions, introducing in tu rn one o f the other three
nucleotides. The net result is that each linker molec ule will have a sing le nucleotide attached, either A, C, G, orT, in a specific
pre-programmed way. The sequential use of four different masks to allow coupling of the four different nucleotides is repeated for a
further 24 cycles. The end result is an assem bly of oligonucleotides 2S nucleotides long, each with a precisely determined sequence
at a specific coordinate in the microa rray.
markers. As we will describe later, large-scale DNA variation analyses often use
micro arrays that have bacterial artificial chromosome DNA clones as probes that
seek to identify large subchromosomal deletions and duplications.
In addition to micro array hybridization, it should be noted that micro arrays
have many other applications. Applications include the use of protein arrays,
arrays that assay binding between nucleic acids and proteins, and many others.
FURTHER READING
General Affymetrix and IIlumina oligonucleotide microarray
Sam brookJ & Russell DW (2001) Molecular Cloning. A Laboratory technology
Manual. Cold Spring Harbor Laboratory Press. Affymetrix GeneChip·... http://www.affymetrix.co m/tech nologyl
index.affx
Fluorescence labeling
Gunderson KL, Kruglyak 5, Graige MS et al. (2004) Decoding
Invitrog en Molecular Probes-The Ha ndbook: A Guide To rand om ly ordered DNA arrays. Genome Res. 14, 870-877.
Fluorescent Probes And Labeling Techn ologies. http://www lIIumina BeadArrayTM. http://www.illumina .comJpages.ilmn ?10=5
.invitrogen,com/site/us/en/ home/Referen ces/Molecular
-Probes·The-Handbook.htm l Microarray hybridization applications
Falciani F (ed.) (2007) MicroarrayTechnologyThrough
In situ hybridization App lications. Taylor and Franci s.
Wilkin son 0 (1998) In Situ Hybridization: A Practi cal Approach, Hoheisel)O (2006) Microarray technology : beyond transcript
2nd ed.IRL Press. profiling and genotype analysis. Nat. Rev. Genet. 7, 200-21 0.
KEY CONCEPTS
• DNA libraries are collections of cloned DNA fragments that collectively represent the
genome of an organism (genomic DNA libraries) or genome subfractions.
• cDNA libraries are derived from reverse-transcribed RNA and so contain only the expressed
part of the genome.
• DNA sequencing usually relies on synthesizing new DNA strands from single-stranded
templates. The most modern methods sequence huge numbers of DNA samples in parallel
without electrop horesis.
• Genome resequencing entails using an initial genome sequence as a reference for rapid
resequencing of the genomes or genome subfractions of other individuals from the same
species.
• DNA markers are DNA sequences that have a unique subchromosomallocation and can be
conveniently assayed. Marker maps provided important framewo rk maps for the
sequencing of complex genomes that have numerous repetitive DNA elements.
• Polymorphic DNA markers are needed to make genetic maps. The different markers are
genotyped in individuals spanning different generations of a common pedigree.
Both no n-polymorphic and polymorphic DNA markers can be used to make physical maps
of chromosom es, often by genotyping panels of cells that have different fragments of
individual chromosomes.
• A clone contig is a series of cloned genomic DNA fragments, arranged in the same linear
order as the subchromosomal region from which they were derived. Overlaps between each
DNA fragment allow them to be placed in order, providing important framework maps for
genome sequencing.
• Computer-based gene prediction often relies on identifying Significant sequence Similarity
between a test DNA sequence (or derived protein sequence) and a known gene sequence or
protein.
• The terms transcriptomeand proteome describe, respectively, the complete set of RNA
transcripts and the complete set of proteins produced by a cell. Whereas the different
nucleated cells of an orga nism have stable, almost identical genomes, transcriptomes and
pro teo mes are dynamic and each can vary very significantly from cell to cell.
High-resolution expression analyses are designed to obtain detailed gene expression
patterns in cells or tissues but are limited to analyzing one or a few genes at a time.
• Highly parallel expression analyses provide gene expression inform ation for many
thousands of genes at a time in two or more cellular sources.
214 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
The advent of DNA cloning and DNA sequencing technologies in the 1970s ush
ered in a new era in which it became possible to characterize genes in a system
atic way. By sequencing a gene and its transcr;p t(s) it became possible to work
out exon-intron organization and to predict the sequence of any likely protein
products.
Initial progress was slow but quickly picked up as technologies improved. By
the early 1980s, researchers began to make plans for sequencing all the different
DNA molecules in a cell (collectively known as a genome), and by the start of the
1990s, internationally coordinated projects had been launched to determine the
complete sequence of the human genome and that of several model organisms.
The genome sequences paved the way for comprehensive analyses of gene struc
ture and gene expression.
.. ~'
DNA LIBRARIES 215
"n
-J
-- ,
(i) cell lysis
(ii) DNA extraction
(iii) partial digestion
with restriction
endonuclease ~
III t 1.1 ' II
IIIII
I II
i:i 111111
'111'.11 111
;k..
IU II ' I *' 1111 II
III A2
' * "1 II A3
Flgu.r. 8. 1 Making a genomic DNA library.
All nucleated cells of an individual will have
essentially the same genomic DNA content
so that any easily accessible cells (such as
blood cells) can be used as source material.
Because DNA is extracted from numerous
1
of identical DNA sequences, such as AI, A~,
A), and A4 shown here. However, partial
digestion with a restriction endonuclease
1 2 3 _ _ _4, - _
A, will cleave the DNA at only a fraction of
5 6 7 the available restriction sites (short vertical
A, blue bars indicate all restriction sites the
8 9 10 11 enzyme could cut if the enzyme were used
--A,
under normal conditions; vertical red arrows
12 13 14 15 ~ mark the positions of the small number of
restriction sites that are cut). As a result, the
@ et"
pattern will differ between identical copies
of the same chromosomal DNA molecules,
resulting in almost random cleavage.
This will generate a series of restriction
fragments that, if they derive from the same
genomic DNA library
subchromosomal region, may share some
common DNA sequence (e.g. fragment 6
another, The starting material is usually total RNA from a desired tissue, from a from original molecule A2 partly overlaps
cell line, or from an embryo at a specific developmental stage, fragments 2 and 3 from AI, fragments 9
and 10 from A3, and fragments 13 and 14
Until recently, the polypeptide-encoding mRNA fraction was considered to be
from~).
virtually the only RNA class of interest, and because almost all mRNA is poly
adenylated, poly(A)+ mRNA would be selected by specific binding to a comple
mentary oligo(dT) or poly(U) sequence connected to a solid matrix, Reverse
transcriptase would then be used to convert the isolated poly(A)+ mRNA to a
double-stranded cDNA copy (figure 8.2), More recently, it has become clear that
a substantial fraction of functional RNA is not polyadenylated, and so total cel
lular RNA is prepared and random hexanucleotide primers are used with reverse
rranscriptase to make cDNA
An initial rationale for cDNA libraries was to provide a quick route toward
gene clones (human coding DNA sequences typically account for only just over
1% of the DNA in a human cell), Because of differences in gene expression pat cells f rom specific organ,
ti ssue. or developmental
terns, a wide variety of cDNA libraries have been constructed from different tis stage (e.g . fetal brai n cell s)
1
(1 ) cell lysis
To be useful, DNA libraries need to be conveniently screened and (I i) RNA extraction
(iIi) oligo(dT) cellulose
disseminated column chromatography
Like any other cell-based DNA cloning system, DNA library screening requires
~AAAAAA
the initial selection of transformed cells, often using antibiotic resistance con
'VV\JVV'\A,.AMA.<A
ferred by the vector molecule, As described in Section 6,1, recombinants are ofren "\..f\.J"\f\..AAAM·'
the 51 nuclease, which specifically cleaves regions of DNA that are single-stranded. cDNA library
216 Chapter 8: Analyzing the Structure and Expre ssion of Gene s and Genomes
- +1 5
1 12 20
DNA sample analyZed by PCR
1
subsets of the 864 YACs in a positive master
pool, in this case master pool 12. Three
dimensional screening of each of nine plate
pools (96 YA(s each), eight row pools (106
-+ ABCD EF G HI
YA(s each), and 12 column pools (72 YACs
each) identified a positive YAC in plate 12G
-
(top panel), row E (middle panel), column S
- +ABCDEFGH (bottom panel), [Courtesy of Sandie Jones,
Newcastle University and adapted from
Jones MH, Khwaja as, Briggs H et al. (1994)
Genomics 24, 266-27S. With permission from
Academic Press Inc.]
SEQUENCING DNA 217
DW'i ng the amplification stage, different colonies may grow at different rates,
however. For examp le, the growth of particular cells may be retarded because
they contain a recombinant DNA that affects their metabolism in some way_
Amplification may the refore result in a distortion of the original representation
of cell clones in the unamplified library.
continues, but occasionally the dideoxynucleotide will be incorporated in the phosphate group 5' He,
II N "'" C'0
growing chain, ending polymerization and so causing chain termination. Each ~O - CH, O~ I
reaction is therefore a partial reactio n beca use chain termination VJilI occur ran I-{ H C,.
d o my at one of the possible choices for a specific type of base in anyone DNA .l'T" c/ I
H C:T"'7 H
~
strand. 1
H
Beca use the DNA sample is a population of identical molecules, each of the
fo ur base -specific reactio ns will generate a collection of labeled DNA fragments Figure 8.5 Structure of a
of different lengths. Each of the fragments in one reaction will have a common 5' dideoxynucleotide: 2', 3'-dideoxy CTP
end (defined by the sequencing primer). However, the 3' ends are variable because (ddCTP). The hydroxyl group attac hed to
th e insertion of the selected dideoxyn uc!eotide occurs randomly at one of the carbon 3' in normal nucle otides is re placed
many different positions that will accept that specific base (Figure 8.6). by a hydrogen atom (shown by shadi ng).
218 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
, length of
(6) A reaction lttrnl ln8leS wI th (J rP') synthesized (C) A C G
ONA
n
- - --
: ~ 11 ... )
®GTA.
~ l CC OG TA O T AQ 11 .. 13
---
~ G A G1 CC GG TA G TA G r1 16 • II:.
, . 11
--
r - 15
ACGAO"'CCGO TAGTAG (\ .. 19
,[
n
each w ith aU four dNTPs plu s one ddNTP
0<0 that competes with its dNTP counterpart
[J\OTAO
!!)cCOO TAO l AG for insertion into the growing chain. For
!}\CGAG.....CCGOTAGTAG. example, the A reactiOn uses dATp, dCTp,
(jJrACOAO ccce f AG f AG dGTP, dTIP, and also ddATp, which competes
with dATP and causes chain termination
when it is incorporated. Because there are
many identical copies of the template, a
Fragments that differ in size by even a single nucleotide can be size-fraction
range of different fragment s is produced,
ated on a denaturing polyacrylamide gel. a gel that contains a high concentration depending on the point at w hich the
of urea or other denaturing agent so that the migrating DNA remains ddATP has been inserted.The fragments
single-stranded. wi ll have a comm o n S' end (defined by the
sequencing primer) but variable 3' ends,
Automation of dideoxy DNA sequencing increased its efficiency depending on where the ddA (shown as
A enclosed within a box) has been inserted.
Automated DNA sequencing machines that used fluorescence labeling of DNA For example, in the A-specific reaction chain,
were developed in the early 1990s. Four different fluorescent dyes are used in the extension occurs until a ddA nucleotide is
four base-specific reactions. By selecting dyes with different emission wave incorporated. Th is wilt lead to a population
lengths, all four reactions could be loaded into a single sample well on the gel. of DNA rragments of leng ths n + 2, n +
During electrophoresis, the DNA fragments pass by an excitation source such as 5, n + 13. n + 16 nucleotides, etc. (C) Size
a laser, while a monitor detects and records the fluore scence signal as the DNA fractio nation on a po lyacrylamide gel
passes through a fixed point in the gel (Figure 8,1). This allows an output in the enables the sequence to be ascertained,
giving for n + 1 to n + 20 the sequence
form of intensity profiles for each of the differently colored fluorophores while
GATGATGGCCTGAGCAnCG, the reverse
simultaneously storing the informatio n elecuonically.
complement of the sequence shown in (A).
Early automated DNA sequencers used slab polyacrylamide gels. but greater
DNA sequencing capacity became possible with capillary sequencing. In this
technique, DNA samples migrate through long and extremely thin (0. 1 mm in
diameter) glass capillary tubes containing a polyacrylamide gel. Just like the slab
gel machines, capillary machines read the base sequence as DNA moves through
the gel. but a higher degree of automation can be achieved by avoiding the need
to cast large slab gels.
SEQUENCING DNA 219
~IJI i\i VI/~NI I~III I~ laser causes the dyes to fluoresce. Maximum
fluorescence occurs at different wavelengths
for the four dyes; the information is recorded
I electronically and the interpreted sequence
is stored in a computer database.
(e) Example of DNA sequence output,
Iterative pyrosequencing records DNA sequences while DNA showing a succession of dye-specific (and
molecules are being synthesized therefore base-specific) intensity profiles. The
example illustrated shows a cDNA sequence
With minor changes, the dideoxysequencing method underpinned molecular from the human polyhomeotic gene PHO.
genetics for three decades. The method is, however, disadvantaged by relying on
gel electrophoresis to fractionate newly synthesized DNA fragments. Not only
does this make the method a laborious one, but more importantly it makes it dif
ficult to sequence large numbers of DNA fragments at a time-the most advanced
dideoxy DNA sequencing machines could sequence only up to 96 samples at a
time. As a result, sequencing output is limited to -30-60 kb of sequence per 3-4
hour electrophoresis run.
Fundamentally different DNA sequencing technologies that did not require
gel electrophoresis were not developed until the early to mid-2000s. An impor
tant breakthrough was to develop methods that could record the DNA sequence
while a DNA strand was being synthesized by a DNA polymerase from a single
stranded DNA template. That is, the sequencing method was able to monitor the
incorporation of each nucleotide in the growing DNA chain and to identify which
nucleotide was being incorporated at each step.
At the heart of the first such approach was a method known as pyrosequenc
ing, which had been developed initially to assay single nucleotide polymor
phisms. For use in sequencing, a series of successive pyrosequencing reactions
(iterative pyrosequencing) are performed to record the DNA sequence as it is syn
thesized. DNA chains are synthesized from dNTP (deoxynucleoside triphos
phate) precursors and the DNA polymerase reaction naturally causes cleavage
between the a and p phosphates so that a dNMP (containing the a phosphate) is
incorporated into DNA, leaving behind a pyrophosphate (PPi) residue made up
of the p and r phosphates. Pyrosequencing exploits the release of pyrophosphate
each time a nucleotide is successfully incorporated into a growing DNA chain.
Sequential enzyme reactions are used to detect released PPi. First, ATP sulfu
rylase quantitatively converts PPi to ATP in the presence of adenosine 5' phos
phosulfate. Then, the released ATP drives a reaction in which luciferase converts
luciferin to oxyluciferin, a product that generates visible light in amounts that are
proportional to the amount of ATP. Thus, each time a nucleotide is incorporated,
a light signal is detected.
In iterative pyrosequencing the individual dNTPs are provided sequentially
(unlike in dideoxysequencing, in which a mix of dNTPs is used). If the selected
dNTP is the one that can provide the required dNMP to continue chain elonga
tion, PPi is released, and light is produced and recorded by a CCD camera. Any
unused dNTPs and excess ATP are degraded by the enzyme apyrase, which is
included in the reaction mixture. Thus, if the selected dNTP is not the one needed
for the n ext synthesis step, no light is produced and apyrase will degrade the
dNTP (Figure 8.8).
220 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
were developed to sequence single DNA molecules that had not been amplified 11 + dn p ~ degraded
in any way, and the first commercial machine for this purpose became available 21 + dATP ~ degraded
in 2008. 3) + (jCTP ~ degraded
4) + d O I P ~ PP ~ ~ LIGHT
Massively parallel DNA sequencing is transforming molecular genetics. Such 1
methods are being extensively used in rapid genome resequencing for various da MP
purposes, including personal genome sequencing and the identification of
mutations. In addition, they are providing comprehensive analyses of the !
transcriptome, the collection of different RNA molecules in ceUs. Furthermore, ~
as we will see in Chapter 12, they are providing rapid endpoints for various analy TTTn
ses tl,at seek to identify DNA-protein interactions and binding sites for regula . GTCA
tory proteins. 11 + dTTP ~ PP, ~ ~ L1GHT
Massively parallel sequencing of amplified DNA dTMP
The first massively parallel sequencing methods to be developed used PCR
amplification and involved sequ encing-by-synthesis. Massively parallel pyro !
sequencing was developed in the 2000s by the 454 Life Sciences company and
subsequently commercialized by Roche. It involves breaking the sample DNA
.. C A G T
11'1 II
>
into short fragments (300-500 bpi and preparing single-stranded templates. Two GTCA
different types of oligonucleotide adaptor are ligated to the ends of the DNA frag
ments to provide universal priming sequences for amplification. After denatura tBI
tion, those single-stranded fragments that have the different adaptors ligated at
the different ends are selected. The next step is to attach the DNA molecules to
beads (Figure B.9A) and takes advantage of biotin-streptavidin binding. As
described in Section 7.2, the bacterial protein streptavidin has an extraordinarily
PP, 7b\ 7b\'
sulfurylase
ATP
luciferase
tiGHT
. ~ifli
~. .rpi~[" ~
load enzyme
+
beads and
centrifuge
1
.-- r
~ k ... ,.\
" ;~.
- "
Figure 8.9 Sample preparation for massively parallel pyrosequencing. (A) Genomic DN Ais isolated, fragmented, ligated to
oligonucleotide adaptors, and separated into single strand s. (6) Fragments are bound to beads under conditions that favor one
fragment per bead. Thereafter the beads are captured in the droplets of a micra reactor (an emu lsion containing a peR reaction
mixture in oil). Clonal pe R amplification occurs w ith in each droplet, resulting in beads each carrying 10 million copies of a unique
DNA template. The emulsion is broken to release the beads, and the DNA strands are denatured, Single beads car rying single
stranded DNA clones are deposited into individual wells of a fiber-<>ptic slide (PicoTite rPlate''"'j that contains'.6 million wells.
Each well w ill conta in a picoliter sequencing reaction. Smaller beads carrying immObilized enzymes required fo r pyrophosphate
sequencing (Figure 8.8B) are deposited into each well to enable the pyrosequ encing reaction to take place. [Adapted from Margulies
M, Egholm M, Altman WE et at. (2005) Nature 437, 376-380. With permi ss ion from Ma cmillan Publishers Ltd.]
Sequencing-by-synthesis RocheGS >300 >0.45 Gb per run of long read lengths but hig h reagent costs
of peR-amplified DNA FlX sequencer 7 ho urs and limited output
Illum ina Genome -100 18 Gb for a run of 4 days Cu rrently most widely used but high
Analyzer Ilx reagents cost
2x-1ooa 35 Gb for a run of 9 days
Sequencing-by-ligation Applied Blosystems -SO Up to 30 Gb for a run of Short read lengths, long runs and high
o f PCR- amplified DNA SOliD 3 7 days reagent costs
2 x - 50 a Up to SO Gb for a run of
14 days
Single molecule Helicos Biosciences -30 -40 Gb per run of 8 days Low reagent cos ts but sho rt read lengths
sequenCing Heli Scope and comparatively hig h error rates
Pacific - 1000 Not currently available Not expected to be released until late
Biosclences 2010/2011 . Great potential (see text)
3When usin g a mate-pair run that involves ob taining sequences from both ends of the DNA molecules.
222 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
,
(A)
• •
_'1""-"1
SMRT
chip
•• , . •
~
aluminium -
detection
~./eXCltat
~I on\..~iSSion
system
SMRT Chip glass - -
...
(B)
1 2
.
t 3
~ .,~
4
t ...
T v r .....
..'" · C" pulse "A" pulse
Figur. 8.1 1 Principles of SMRT (single-molecule real-time sequencing) technology. (Al Th e SM RT chip and Zfo(IW nanophotonic
visua lization cha mbe rs. The SMRT chip shown on the left is an aluminum film 100 nm thick deposited on a silicon dioxide substrate that
extends just over 40 !-1m x 30).lm in size. The metal film contains thousands of perforations known as zero-mode waveguides (ZMWs)
that can act as tiny visua lizati on chambers, in each of which a DNA polym erase molecule performs DNA seq uencing with a single DNA
molecule template, The chip is mounted next to a fluorescence detection system. By viewing through the glass support it can visually
record real-time DNA sequencing reactions in each of the individual ZMWs (rig ht). (8) Th e phosphollnked dNTP incorporation cycle plus
the corresponding expected time trace of detected fluorescence intensity. SMRT is a sequ encing-by-synthesis method that uses dNTPs
labeled w ith one o f four different f1uorop hores, d epend ing on which base is present. However, beca use the fluorophore is attached to
the y-phosphate of dNTPs (see Figure 8.10), it does no t get incorporated into the growing ONA chain. Never theless, a DNA polymerase
molecule will hold an incoming dNTP steady at its active site for a brief time before the triphosphate group is cleaved and the dNMP is
inserted. During this time, the engaged f1uorophore emits fluorescent light w hose color corresponds to the base identity. The steps shown
are: (1) the p hospholinked nucleotide begins to associate w ith the temp late in the active site of the polymerase, (2) causing an el evation of
the fluorescence output on the corresponding color channel. (3) Phosphodiester bond formation liberates the d ye-linker-pyrophosphate
product, whic h diffuses out of the ZMW, thus ending the fluorescence pulse. (4) The polymerase translo cates to the next position, and (5)
the next nucleotide to be inserted binds the active site beginning the subsequent pulse. (Adapted from Eid J, Fehr A, Gray J et at (2009)
Science 323, 133- 138. Wi th permission from the America n Associa tion for the Ad vancement of Science.]
change in the current represents a direct reading of the DNA sequence. See http://
www.nanoporetech.com/sequences for a video explanation of one such method.
This is a fast moving area, but like SMRT has great potential for offering fast very
high throughput DNA sequencing at low cost.
(A)
target
target
targe t
region C .-Jj ...!t
(8)
.-Jj
-
region A
region B
r, ' fragmentation _a--.O\.."Y/ ' linker ,~~~ f\.
~ '1 fJ-.-"U and polishing
)
ligation
~
)
~~ "
-V
~~~ ~
," V /"
(E) (C)
l denaturing and
hybridization
( ( ( ( (
( .
(
amplification
<'<
target fragment
elution
<</
(
< (
(0)
array
washing
( ( (
- -
- '" - problematic. Large amo unts of repetitive
1111
contlg of large Insert clones
DNA in complex genomes makes it difficult
to locate sequences unambiguously to
J. .J, J. .J, spe<:ilic subc hromosom al regions. (B) For
~ .
first-time seq uencing of complex genomes
'~'-'I'
-1",.' ~G-::~ 'S~"_'~ ~I~~~
I{'!.l~="'=- ,... ~~t, !.\~,\:
random
. ~ "I.J.
~/ tj
it is more efficient to assemble contigs of
fragmentation
;i~?,...t_'_ \~:' ~:~ ~'!.I~: ,( ;;, large insert clones for each ch romosome
and then to fragment individual clones into
J. .J, J.
1.......
1 1. 1 pieces that are sequenced to reconstruct the
ifmilii
sequencing sequence of the parent clone. [Adapted from
and assembly Waterston RH, Lander ES & Su lston JE (2002)
..,. ' . .aII-: Proe. Nat! Acad. Sci. USA 99, 37 12-3716. With
anchoring
permiSSion from the National Academy of
genome assembly Sciences, USA.]
226 Chapter 8: An alyzing the Structure and Expression of Genes and Genomes
BOX S.l FRAMEWORK MAPS BASED DN DNA MARKERS AND CLONE CONTIGS
First time sequencing of complex genomes (which invariably have DNA markers mayor may not be polymorph ic (Table 1), but to
large amounts of repetitive DNA) is assisted by first constructing be useful for mapping purposes a marker needs to have a unique
framework maps for each chromosomal DNA molecule. The chromosomallocation. Different methods have been used to map
framework maps best suited for DNA sequenci ng are based on a human DNA marker to a specifi c su bchromoso malloca liza tion. A
clone contigs, linearly organized series of clo ned overlappi ng DNA common approach is to type pane ls of artificially co nstructed hybrid
fragments that collectively represent chromosomal DNA sequences cells that contain a fu ll set of rodent chromosomes pl us one or more
(Fluu'.l ). specific human chromosomes or multipl e human ch romoso me
Assembling clone contigs de pends on be ing able to identify fragments. The latter are generated by exposing human cells to
overlaps between cloned DNA fragments. The re are various ways controlled radiation, causing chromosome breakage, and then fusing
in which th is can be done but often clones are assayed for the the human ce lls with rodent cells so that variable sets of human
presence of certain DNA markers known to map in the approximate chromosome fragments can in se rt into the rode nt ch romosomes
su bchromosoma l region . A DNA marker is any short DNA sequence (radiation hybrids). Alternative ly, DNA markers have been mapped on
that can be assayed in some way, either by a hybridization assay or, chromosomes by labeling clones known to contain the marker with
more conveniently, by a peR assay. a fluorophore and hyb ridizing them to suitably treated fixed human
.--_------.... BL....
_ _ __ FJgure 1 A schematic clone contig. The
chromosomal DNA sequence from positions A to
~ ~ B is represented by overlapping DNA inserts in a
linear series of genomic DNA clones. Clones w ith
overlapping inserts are generated by the ra ndom
1"2=- a-2......-W- lo.ll....---"Ts fragmentation of DNA when a genomic DNA
34 = - 6~ l l U = - 14 16 - 19 ~_
5----- 13 ----- 21 ---- libra ry is constructed (see Figure 8. 1).
GENOME STRUCTURE ANALYSIS AND GENOME PROJECTS 227
fragmentation (see Figure B.1). Identical copies of the same chromosomal DNA
molecules will be cleaved at different locations on the DNA, and so any unique
short sequence will be represented by a series of overlapping DNA fragments of
different sizes produced by differential DNA cleavage. The different fragments
are then attached to identical vector m olecules and randomly introdu ced into
cells.
Early attempts to assemble a series of human genomic DNA clones with over
lapping inserts wo uld proceed from specific starting clones. A hybridization
probe prepared from the insert DNA of a chosen genomic clone of interest would
be used to screen all the cells of the DNA library to identify those that contained
a strongly hybridizing DNA fragment, which could indicate an overlapping DNA
fragm ent. Newly identified clones could tben be used to prepare new hybridiza
tion probes to identify more distally overlapping clones. This p rocedure, known
as chromosorne walking, was extremely laborious.
More efficient, genome-wide methods of identifyjng clones with overlapping
inserts were subsequently developed using clone fingerprinting. The idea was to
submit all clones in the library to some assay that could help identify clones with
chromosome preparations o n a m icroscope slide (chromoso m e FISH; it is possible to sequence the ends of numerous large insert clones
see Figures 2.16A and 2.17A). from a genomic DNA librar y, thereby generating thousands
DNA marker maps can be bu ilt in different ways. Polymorph ic of markers.
markers can simply be assayed in members of multigenerationaf If large n umbers of STS markers are available, they can be used
pedigrees to identify groups of linked markers that must map to the to build framework marker maps that allow clo nes to be assembled
same ch rom osome. Non-polymo rphic markers can be placed on into clone contig ma ps. An early example of a short clone cOntig fTiap
marker m aps by using peR to assay the markers in panels of radiatio n that was assembled from 5TS m arker mapping is shown in Fi gure 2.
hybrid celi s, or in panels ofYAC clones or other genomic DNA clones The idea is to arrange clones in a linear order according to whether
with large inserts. Common markers are sequence-tagged site they type positively or negatively for STS markers. Yeast artificial
(STS) marker s originating from clones that have been at least partly chromosome clones were ini tially a popular choice for assemb lin g
sequenced, includ ing previously identified and seq uenced DNA clone conti9 maps, but YACs containing large human inserts were
clones, such as gene clones or other DNA clones of interest that have found to be unstab le, being prone to internal rearrangem ents and
already been well cha racterized and mapped . More genera lly, deletion s (see Figure 2).
TABLE 1 COMMON MARKERS USED IN CONSTRUCTING FRAMEWORK DNA MAPS OF COMPLEX GENOMES
Microsat ell ite A sequence contai ning several (usually 10 or more) tandem repeats of a sequence of
1-4 nucleotides. Frequently found in complex ge nomes and often highly polymorphic
Single nucleotide A single nucleotide cha nge that is polymorphic, The polymorphism is usually limited
polymorphism (SNP) (typically, there are just two significantly f requen t alleles), but SNPs are very suited to
automated genotyping such as the use of pyros~quenci n g (see Figure 8.8)
No n-polymorphic Sequence-tagged site (STS) Any DNA sequence that has been mapped to a specific subchromosomallocation and
ca n be assayed by peR amphficatlon
Expressed sequence tag (EST) A subset of sequence-tagged sites that hap pen to be located within DNA sequences
known to be t ranscribed
228 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
overlapping DNA inserts. For example, clone insert DNAs were submitted to
restriction mapping and assays ofrepetitive DNA content to look for characteris
tic patterns of spacing of sites for specific restriction endonucleases or of particu
lar repetitive sequences.
Clone fingerprinting based on restriction mapping and/or repetitive DNA
content is still quite laborious. A yet more efficient method depends on the avail
ability of a high-density DNA marker map. Various DNA markers can be used.
They may be polymorphic (and may be localized on a genetic map) or non
polymorphic. The onl y requirements are that they should have a unique subchro
mosomallocalization and be able to be assayed in some way (see Box8.1).
The first human genetic maps were of low resolution and were
constructed with mostly anonymous DNA markers
In genetic mapping different polymorphic markers are assayed in all members of
a variety of multi generation families. The resulting genotypes are then analyzed
by computer to work out how alleles of the different markers segregate at each
meiotic event connecting parents to their offspring, and to identify markers
where specific alleles co-segregate. The discovery of appare ntly random DNA
polymorph isms in the human genome prompted the idea of constructing a com
prehensive, nonclassical human genetic map, one that was based predominantly
on anonymous DNA markers rather than on genes.
The first genetic linkage map of the human genome was published in 1987
and was based on restriction fragment length polymorphisms (RFLPs). An
GENOME STRUCTURE ANALYSIS AND GENOME PROJECTS 229
~-
figure 8.14 Restriction fragment length
allele 1 b a D
b polymorphism (RFlP). An RFLP is any DNA
polymorphism that causes a change in
the recognition sequence for a restriction
~
endonuclease. Often it results from a
allele 2 b a
~ b
single nucleotide change such as the T----7C
transition shown here, which causes allele
1 to have an additional Mbol recognition
sequence (GATC) that is lacking in allele 2.
• • probe
I
• The hybridization probe shown would detect
sequences a and b, and so if genomic DNA
hybridization .- b
a
a+b
is digested with Mbol before hybridization,
the probe can distinguish allele 1 (cutting
of probe to produces two fragments a and b) and
digested DNA
allele 2 (a and b are on the same fragment).
allele 1 allele 2
Alternatively, but not shown here, the alleles
can be conveniently typed by PCR by using
RFLP is any DNA polymorphism, often a single nucleotide change, that results in an upstream primer derived from the a
sequence and a downstream primer from
the creation or destruction of one recognition sequence for a specific restriction
the b sequence. Amplification products are
endonuclease (Figure 8.14). It can be assayed by digesting DNA with the restric· digested with Mbol and size-fractionated to
tion endonuclease and using a hybridization probe spanning the variable distinguish between the alleles.
sequence (or immediately adjacent to it), or more conveniently by using primers
flanking the variable sequence to amplify the required sequence and then digest·
ing with the restriction endonuclease.
Although an outstanding achievement, the first human genetic map was of
limited use. There were too few markers-only 393 RFLPs were used. In addition,
the markers were not very polymorphic because RFLPs have only two alleles
either the restriction site is present or it is absent. To create a useful genetic map,
a higher· density map was needed with markers that were more polymorphic.
Second-generation human genetic maps were based on a different class of
DNA marker. Microsatellite DNA is the name given to short tandemly repeated
DNA sequences, typically with a repeat unit one to four nucleotides long. They
are common in the human genome. Individual micro satellites quite often have a
large number of repeats, making them unstable so that the number of repeat
units varies.
Microsatellite instability arises because during DNA replication the DNA
polymerase often makes mistakes in copying long tandem repeats. Sometimes
the polymerase skips past an individual repeat unit and fails to copy it; it can also
go back by mistake to copy the same individual repeat unit twice. As a result,
microsatellites are often highly polymorphic and the multiple alleles can be dis·
tinguished by using a peR· based assay (Figure 8.15). The first human linkage
map based on micro satellite markers was published in 1992 using 814 markers
(corresponding to roughly one marker per 3.5 Mb). By 1994 an integrated genetic
map (mostly based on microsatellites but containing some other markers) had a
marker density of close to one per megabase.
r---40bp~
,
allele 1 == {CAh6
~
!
'~
C'CA~CAC'CACAChC'C'ChCA'AChtACA
GT GT GTGT GTOij'(jTGTG"lm'GT r... m 'tiTGTGT
,1.
L-
40bP J
peR product == 80 + 32 = 112 bp Figur.S.1S Designing a peR assay for
a microsatellite DNA polymorphism.
A microsatellite DNA marker is a type of
, - 40 DNA polymorphism in which variation
!
in the number of short tandem repeats
i..E4
bP j [in this case, a (CA)I(TG) dinucleotide
repeat) produces length polymorphism.
peR product == 108 bp In this example the 5' ends of upstream
primer Pl and downstream primer P2 are
imagined to be each located 40 bp from
r--40 bp ~, the microsatellite array, and so the length
I.£4 40bp~
I of amplified fragment will be 80 + the array
length (32, 28, or 22 bp in this example).
The different alleles can be separated by
peR product = 102 bp polyacrylamide gel electrophoresis.
230 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
The 1994 map was of sufficiently high resolution to achieve the HGP's goals
for genetic mapping, and from then onward the major fo cus was on developing
and refining physical maps leading to the ultimate physical map, the complete
DNA sequence of each chromosome. However, genetic mapping of the human
genome has continued , in tvv'o maj or ways. First, microsatellite maps of even
higher resolution have been developed. Second, high-density maps have been
developed for single nucleotide polymorphisms (SNPs) by the International
HapMap (haplotype mapping) Consortium. The major motivation has been to
aid the identification of common disease genes. SNPs are not particularly poly
morphic (they mostly have two alleles) but they are well suited to automated typ
ing, by methods such as pyrosequencing (see Figure 8.8). The most recent such
SNP map has over 3 million SNP markers, or roughly one SNP per kilobase.
Chromosome somatic cell hybrid panels with human chromosome fragments distance between adjacent chromosomal breakpoints
breakpoint maps derived from natural, translocation. or deletion chromosomes on a chromosome is usually several Mb
Restriction maps created with restric tion endonucleases that cut only rarely several hundred kb
Clone contig maps overl apping YAC clones average YAC Insert has several hundred kb
STS maps requi res previous seq uence information from clones that have less than 1 kb is poss ible, but standard STS maps have
been mapped to subchromosornal regions resol utionsin tensof kb
EST maps req uires cDNA sequencing, then mapping of cDNAs back to other ave rage resolution in the human nuclear genome is
physical maps -90 kb
DNA seq uen ce maps complete nucleotide seq uence of DNA of each chromosome , bp
markers with an average spacing ofjust less than 200 kb. In this landmark achieve
ment, the chromosomal locations ofthe non-polymorphic STS markers had been
obtained by two approaches. One approach used STS content mapping. YAC
clones from the CEPH YAC library were assayed for large numbers of STS markers
to identify clones sharing certain markers, and to assign them to YAC contigs with
a known subchromosomallocation. The second approach used hybrid cell map
ping. Different human STS markers were assayed in panels of human-rodent
hybrid cells. The hybrid cells each contained a full complement of rodent chro
mosomes but differed in that they contained different whole human chromo
somes or different sets of fragments from human chromosomes that had inte
grated into the rodent chromosomes.
Although the YAC mapping achievements were impressive, there was a prob
lem. YACs containing large human DNA inserts (which will contain many repeti
tive DNA sequences) proved to be prone to rearrangements and deletions (see
Figure 2 in Box 8.1). Because the YAC inserts were often not faithful representa
tions of the original starting human DNA, YAC clones could not be the template
for the final genome sequencing effort.
Alternative large insert cloning systems were developed. Bacterial artificial
chromosome (BAC) and phage PI artificial chromosome (PAC) libraries have
smaller inserts (80-250 kb) than YACs but, crucially, human inserts are compara
tively stable in BACs and PACs. To provide even more STS markers, clones in the
BAC libraries were routinely subjected to end-sequencing in which a few hun
dred nucleotides were sequenced at either end of the insert. Eventually, large
BAC/PAC clone contigs were established for each human chromosome, p3\~ng
the way for the final genome sequencing phase (see overview in Figure 8.16).
~ ~
L
f----»
phYSical
map
J
genetic
m,p
mouse
on mapping and identifying human disease
genes. The data produced were channeled
I I map into mapping and sequence databases,
:j. ,J; permitting rapid electronic access and data
~
human genome map analysis. Ancillary projects (not shown here)
included studying genetic variation, genome
L - . seqUenCing . - J , L gene
identification
projects for model organisms, and research
on ethical, legal, and social implications.
EST, expressed sequence tag; CEPH, Centre
d'ttudes du Polymorphisme Humaine; lods,
physical methods o inputs
lod (logarithm of the odds) scores; STS,
sequence-tagged site; FISH, fluorescence in
genetic methods black cloud
situ hybridization.
232 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
BOX 8.2 MAJOR MILESTONES IN MAPPING AND SEQUENCING THE HUMAN GENOME
1956 The first physical map of the human genome is determined. Using light microscopy o f stained tissue, JH Tjio and A Levan (Hereditas 42,
1-6) reveal that ou r cells normally contain 46 chromosomes and that t here are 24 different types of human chromosome.
1977 Fred Sanger and colleagues publish the did eoxy DNA sequencing method (Proc. Nat! Acod. Sci. USA 74,5463- 5467). With some further
re fi nement s (fluorescence labeling and automation) it will be the method used to sequence the human genome and ma ny other
genomes.
1980 David Botstein et al. (Am.). Hum. Genet. 32, 314-331) propose that a human genetic map can be constructed by using a set of random DNA
markers such as restriction fragment leng t h polymorph isms (RFLPs).
1981 Sanger and colleagues pu blish the complete sequence of h uman mitochondrial DNA (Anderson S et al. Nature 290, 457-465).
1984 A workshop is held in Alta, Utah, to evaluate methods of mutation detection and characterization and to project future technologies.
A principal conclusion is that an enormously large, complex, and expensive sequencing program is required to perm it high-efficiency
mutation detection.
1987 The US Department of Energy publishes a repo rt on a Human Genome Initiative, which is the fi rst of its kind.
1987 Helen Don is-Keller and colleagues report the first genetic linkage map ofthe human genome (Cell 51, 319-337). The map is based o n
1995 The first detailed phYSical map of the human genome is published, based on seq uence-tagged sites (Hudso n TJ et al. Science 270,
1945 - 1954).
1996 A high-density hu man BAC library is published (Kim UJ et al. Genomics 34, 213-218).
1998 GeneMap '98, the first reasonably comprehensive map of gene-based markers, is published (Delou kas P et al. Science 282, 744- 746).
1999 The first essentially complete DNA sequence for a human chromosome is reported, for chromosome 22 (Dunham I et al. Nature 402,
489-495).
2001 Draft sequences o f the human n uclear genome, comprising rough ly 90% of the total euchromatic component, are published by the
International Human Genome Sequencing Consortium (IHGSC) (Nature 409,860-921) and by Celera (Science 291, 1304-1351).
2001/2 Publication of a d raf t sequence of the mouse nuclear genome (Waterson B et al. Nature 420, 520-562). Human- mouse comparisons he lp
human gene identification/characterization.
2003/4 The essentially completed sequence (about 99%) o f the euchromatic componel1t of the human genome is reported, comprising 2.85 Gb of
DNA, or roughly gook> of the tota l of 3.2 Gb (a nalyses published in 2004 by the IHGSC, Nature 431, 93 1-945).
2004 Over 21 ,000 human genes are validated by full-length cDNA clones (Imanishi T et al. PLoS BioI. 2, e1 62).
2005-7 The International HapMap Consortiu m reports increasingly detailed single nucleotide polymorphism (SNP) maps for the hu man genome
in 2005 (Nature 437, 1299-1320) and 2007 (Nature 449,851 - 861). The latter map has more tha n 3.1 million SNPs, roughly one SNP per
kilobase.
2007-8 The age of personal genome sequencing beg ins with delivery of euchromatic gel10me sequences for James Watson and Craig Venter and
the launch of the 1000 Genomes Project (http://www.1000genomes.org/page.php).
For the publicly funded HGp, most of the sequence was contributed by large
genome centers, notably the Wellcome Trust Sanger Institute, Washington
University, Baylor College of Medicine, MIT/Harvard, the DoE Joint Genome
Institute, the Japanese RlKEN Institute, and the French Genoscope center. To
ensure efficiency, it was agreed that specific centers would take primary respon
sibility for assembling clone contigs and subsequent sequencing of individual
chromosomes: for example, the Sanger Institute for chromosome 1, and
Washington University for chromosome 2.
The publicly funded HGP met with aggressive competition from privately
funded genome sequencing. In 1999 the Celera company announced that it
intended to produce a draft human genome sequence in 2 years and that it would
do this by whole genome shotgun sequencing (see Figure 8.13A) rather than the
slower large conlig assembly approach of the HGR
The ensuing race between the publicly funded International Human Genome
Sequencing Consortium (lHGSC) and the privately funded Celera accelerated
the HGP timetable. In 2001, both sides published a draft sequence of the human
genome that covered about 90% of the euchromatic genome sequence. The
euchromatic component is about 90% of the total genome, and so the draft
sequences actually represented about 80% of the total genome but 90% of the
total gene sequence.
Although the race was perceived to have ended in a draw in 2001, it had not
been a fair race, and the finishing line had not been reached. The IH GSC had
made their data freely available, posting sequence data updates on the Web every
24 hours. Celera made use of huge blocks of the IHGSC's sequence data, re
processed it, and fed the data back into its own sequence compilation . The Celera
sequence was therefore notan independently obtained human genome sequence.
Unlike the IHGSC, Celera denied external access to their sequence data. Long
after they had published their analyses, Celera required expensive subscription
charges to view their sequence data .
To complete the sequence olthe human genome, the IHGSC continued alone
with the hard work of filling in the gaps in the sequence. By 2003, the lHGSC had
produced an essentially complete sequence of the euchromatic component of
the human genome that was available on the Web and followed up by publishing
analyses of the finalized sequence in 2004.
The2.85 Gb of sequence reported in 2003 was extremely accurate and arranged
in contigs of close to 9 Mb on average. The sequence was interrupted by 341 gaps
234 Chapter B: Analyzing the Structure and Express ion of Genes and Genomes
that remained. The gaps are of two types. Structural gaps result from problems
with ambiguous map assignment due to repetitive sequences (often duplications
with near-identical sequences), but can be solved by standard (clone-based)
DNA sequencing methods. No n- structural gaps are ones where the missing
sequence can not be cloned in bacterial cells. In such cases, the sequence alter
nates between pyrimidine and purine nucleotides, making the DNA twist in a
reverse orientation to form a left-handed helix that is toxic to bacteria. They can
be closed only by uSing sequencing-by-synthesis methods. The number of gaps
in the euchromatic genome has been progressively reduced to less than 200 in
the most current sequence (NCB! build 37; released 2009).
Because a variety of DNA libraries had been used to generate the sequence,
the final sequence was a composite one, representing various donor cells of dif
ferent genotypes. This sequence acted as a reference sequence that hugely facili
tated subsequent sequencing of the euchromatic genomes of individual people.
An early demonstration was a 2007 report of the sequencing of Jim Watson's
genome in a few months using massively parallel pyrosequencing. The age of
personal genome sequencing began to take off with the international 1000
Genomes Project that was launched in early 2008. The aim is to obtain a detailed
catalog of human genetic variation by sequen cing the genomes of at least 1000
individuals representing a variety of different ethnic groups (see Box 8.2).
Once the initial draft genome sequence had been obtained, another major
international effort focused on functionally important sequences with a priority
of identifying and characterizing all human genes and regulatory elements. By
2004 more than 21,000 human genes were reported to have been validated by
determining fuU-length cDNA sequences. However, as described in the sections
below, there are still considerable uncertainties about exactly how many genes
we have.
Genome projects have also been conducted for a variety of model
organisms
From the outset, the goals of the HGP included sequencing of five model organ
isms, partly as technology test beds. The genome sequences for four of the five
model organisms prioritized by the HGP-E. coli, S. cerevisiae, C. elegans, and
D. melanogaster-were obtained at an early stage in the project and were helpful
in guiding the mapping and sequencing strategies that would be applied to the
more complex human genome.
Draft mouse genome sequences were obtained in 2001 12002. Celera produced
a mouse genome sequence that was a composite of multiple mouse strains, and
the publicly funded Mouse Genome Sequencing Consortium derived the genome
sequence of the widely used C57BLl6J mouse strain. The latest C57BLl6J
sequence update represents more than 90% of the total mouse genome. As
detailed in Chapter 10, comparison of the human and mouse sequences was to
prove extremely important in identifying genes and in establishing exon-intron
organizations.
At an early stage, additional genome projects were launched for other organ
isms in addition to those that the HGP focu sed on. The first cellular genome was
sequenced in 1995 (from the bacterium Haemophilus influenzae), and since then
a large number of genomes from the three kingdoms oflife have been sequenced.
By late 2009, close to 1100 complete genome sequences had been published, and
an additional 4500 genome projects were ongoing. Various archaeal and bacterial
genomes have been sequenced to understand general and evolutionary aspects
of prokaryo tes. The principal motivation for genome sequencing of many bacte
ria has been to understand their involvement in pathogenesis or applications in
biotechnology. For eukaryotes, varied motivations prompted genome sequenc
ing: they include general research models, models of disease and development,
models for evolutionary and comparative genomic studies, farm animals and
crops, and pathogenic protozoa and nematodes. Relevant genome data can be
accessed at certain websites that provide compilations of genome databases (see
the next section).
Analysis of the sequences has shown that the number of genes in a genome is
correlated with complexity of an organism, but the correlation is a poor one. For
example it was a surprise that C. elegans, a 1 mm roundworm that has only
GENOME STRUCTURE ANALYSIS AND GENOME PROJECTS 23S
TABLI... MAJOULImotacDm'AlASliS1'HATSIRVE~GINlRALIIUCLI01'JDI
~ ... . .. . . . _. . . OIt4'IIOIIDfSfQlJlW!
". ..•.. '
Nucleotide sequence GenBank US Natio nal Center for Biotechnology Information (NCBI) hup:ll www.ncbLnlm.nih.gov
dbEST NCBI (division of GenBank that stores sing le-pass cDNAJESTs) http://www.ncbl.nlm.nih.gov/dbEST/
TRE MBl ESI (translations of coding sequences from the EMBL database that http://www.ebi .ac.uk/lrembl/
have not yet been deposited in SWISS-PROT) htl:p :11ca .expasy.org/sprot
Note that the databases have subsets that are devoted to particular sequence cla sses. For transcribed sequences and genes, the dbEST database
{http://www.ncbLnlm. nih.g ov/dbEST) a nd the UniGene database {http://www.ncbi.nlm.nih.gov/unigene) havebeenwidelyused.
959 cells or 1031 cells depending on its gender, should have lfiore than 20,000
protein-coding genes, about as many as a human. The correlation between gene
content and organism complexity will be considered more fully in Chapter 10.
Powerful genome databases and browsers help to store and
analyze genome data
From the earliest stages of genome and gene analysis, central computer reposi
tories were established for storing mapping data and sequence data produced in
laboratories throughout the world (Table 0.3). After major genome mapping and
sequencing centers developed, a parallel data storage effort began when the indi
vidual genome centers developed dedicated in-house databases to srore map
ping and sequencing data produced in their own laboratories. The data were
made freely available through the Web.
As genome data began to be produced in very large quantities, strenuous
efforts were devoted to developing new genome databases (TableS.4) and design
ing new software that would permit the huge amounts of mapping and sequence
data and associated information to be searched in a systematic and user-friendly
way. As we will see, a major new focus on in silica (computer-based) analyses
made vital contributions to our understanding of the structure of genes and
genomes.
An important advance was the development of genome browsers with graph
ical user interfaces to portray genome information for ind ividual chromosomes
and subchromosomal regions. Users of genome browsers can quickly navigate
the sequence ofa selected human chromosome moving from large scale to nucle
otide scale, identifying genes and associated exons, RNAs, and proteins (Figure
O.IS). As more and more information is obtained for genes and other functional
units, more informative and precise gene annotation will be available infrequent,
periodical updates of the genome browsers and databases.
Different wmputer programs are designed to predict and
annotate genes within genome sequences
Genome sequence data can be interrogated by a suite of computer programs to
seek novel genes by screening for particular characteristics of genes. For exam
ple, genes are transcribed into RNA, and they contain sequences that show strong
evolutionary conservation (because of the need to conserve important gene
functions). Genes are also associated with certain sequence motifs and sequence
characteristics.
236 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
GENOME BROWSERS
GENOME COMPILATIONS
NCBI Human Genome Resources US National Center for Biotechnology http://www.ncbi .nlm.nih.gov/ projectslgenome/ guide/
---
CO ruI~
I""" .",," 1'-"1_ f>.J~' " Q~" 'lH" ndl... 0 ..... ' ..".
..
1;oc.nin. II... ~~ i$p' i"Y'I
""'~
t;;I"qlllM.I!t.1 . nl.B01.2~11J~
_Jj
I OW !!' • .••••
eM W 8 M we .. H I _WM
Chwo.oW"l"l'~ ""'i'
o
~~~
~ 11""'1011
I-
......1
)f'ifO( ", . . . ., , : .
!,~,(I
<:""" II~ " ,, "" "
- ,,_ ....O"'j t')
.- ~~
Lo.i. O...
I ·.....,
.. .....
~ ...... ~<O""
eco-s ...
~
(cystic fibrosis transmembrane regulator)
gene; the overview at the top shows the
position of the CFTR gene (highlighted in
pale green) in relation to neighboring genes
~"
~.~ ..."' ... _ .. .....:>n.1 I · -tI'J\
"'-" . ................
and DNA markers on chromosome 7q31. The
detailed view at the bottom shows a few of
very many features that can be switched on
or off according to the user's needs. Central
to the navigation are click-over tools that
C.nfI"'tinCl1t>odl~
enable numerous Internet connections to
t·'""·~ ...... rl\... .. 'J'·.·.,~""i\~ .."'__ ~ .,~ ., . I'$ ,,~,j. ."' ..~ .-~_''''M<I~ 1 ."'.n~ I h< I ''' ~ . other databases and programs (includ ing
t"' ... ,~<pI. \''''l ',.. .:~ .. T.'"'fi~'''. 'I ... !'. !I . ~ 0/' 1"' 1f~
direct links to other assem bled genome
seq uences), allowing users to follow through
~ . "\h "'''" • on requested information.
GENOME STRUCTURE ANALYSIS AND GENOME PROJECTS 237
The predicted translation products can in turn be used to search for homol
ogy against all known protein sequences (using BLASTP), and against the pre
dicted translation products of all known nucleotide sequences (usingTBLASTN).
Suspected candid ate protein sequences can also be used as queries in homology
searching against databases of protein domains and short molifs, as detailed in
Chapter 10.
different and so they may give djfferent results. The BLAST and nucleotide sequence database, o r an amino acid
FASTA programs are widely available, for example from the US sequence against a protein sequence database
The detection of novel genes can also be aided by computer algorithms that BOX 8 .4 GENE ONTOLOGY AND
recognize certain elements commonly found in genes. Exon prediction programs THE GO CONSORTIUM
such as GENSCAN, for example, test for conserved consensus sequences at splice
junctions (see Figure 1.17) and assign high probability to a predicted exon if there A5 the trickle of genome sequence data
is a large open reading frame that differentiates the sequence from noncod ing began to become a flood, biologists
started to grapple with the difficult
DNA (where on average one of the three terminatio n codon sequences will occur
challenge of integrating sequence
every 60 bp or so). data with the vast and rapidly growing
Integrated gene-finding software packages have been developed that com body of data from gene function
bine programs designed to identify exons and gene-associated motifs with gen analyses. Sea rchi ng across the broad
eral sequence-homology-based database searching programs. One example is scientific Ilterature and databases has
the Genotator program (http://www.fruitfly.org/ - nomi/genotator/). been hampered by wide variations
Another characteristic that helps identify genes in vertebrates is base compo in terminology. What was needed
sition. Vertebrate genes are very often associated with small regions, often abo ut was a standardized system of gene
1 kb long, that are both GC-rich and have a significantly lUgher frequency of the ontology to represen t gene function
dinucleotide CpG than the bulk of the genome, where the CpG dinucleotide is across genomes and species. The Gene
Ontology (GO) Consortium was formed
statistically under-represented. Because these regions can be viewed as islands of
in 1998 and the GO project began to
normal CpG frequ ency in a sea of genomic DNA that is otherwise CpG-deficient, develop a controlled vocabulary to
they are referred to as CpG islands. describe the attributes of genes and
As we will see in Chapter 9, the CpG dinucleotide is a target for DNA methyla gene products in any organism, to
tion that can cause local condensation of chromatin and inhibit gene expression. allow searching across mul ti ple gene
The upstream regions and other important sites that regulate gene ex products and species for common
pressio n need to be able to adopt an open chromatin conformation for gene characteristics.
expression, and so high frequencies of CpG are not tolerated in these regions. To describe gene products, the GO
CpG islands can be identified both experimentally and by computer programs Consortium developed three separate
that scan the sequence for variation in base com position. ontologies-bio logi ca l process, cel lular
component, and molec ular function.
Obtaining accurate estimates for the number of human genes is These ontologies permit the annotation
of molecular characteristics across
surprisingly difficult species and may be broad or more
Analysis of the human genome sequence has led to the identification of many focused. For example, the biological
thousands of previously unstudied genes. As we will see in later chap ters, the process can be as broad as signal
transd uction or more restricted, for
functions of predicted genes are being intensively studied in the post-genome
example Cl~gl u coside transport. Each
sequencing era, and systematic and hierarchical vocabularies are being devel
vocabulary is structured so that any
oped to define gene function (Box 8.4). Surprisingly, however, many years after term may have more than one parent as
the complete euchromatic sequence was obtained and despite intense investiga well as zero, one, or more children. This
tions, the total number of human genes is still not accurately known, and it may makes attempts to describe biology
be some time before a precise figure can be obtained. much richer than would be possible
Before the human ge nome was sequenced, the number of human genes had with a hierarchica l graph. Currently t he
been expected to be in the range 70,000-100,000, or close to three to four times GO vocabulary co nsists of more than
the number of genes in C. elegans. By 2001, experimental evidence was available 17,000 terms that will, in time, al l have
for about 11,000 human genes, but analyses of the draft genome sequence strict definitions for their usage. See
reported in that year raised the predicted number to only about 30,000-40,000 http://www.geneonro!ogy.orgformore
information.
mostly protein-coding genes. The essentially complete sequence of the euchro
matic component of the human genome was obtained by 2003. After analysis of
the new sequence, the number of predicted protein-coding genes had fallen to
somewhat less than 25,000, and current estimates suggest close to 21,500 human
protein-coding genes.
The difficulty in establishing the precise number of genes is not confined to
the human genome but to all moderately complex genomes. As more and more
genomes were sequenced, however, comparative genomics has been extremely
helpful in gene identification. In particular, comparative genomics has helped in
the id entification of protein-coding genes, and the recent estimate of about
21,500 human protein-coding genes is likely to be quite accurate. RNA genes are,
however, much more difficult to identify for two major reasons. First, they lack
any open reading frame, and second, their sequences tend to be much less con
served than protein- coding sequences.
During the course of the human genome project, little attention was devoted
to RNA genes, and the various estimates of gene number were dominated by
expectations of the number of protein-coding genes. Remarkably, the 2001
Science paper in which Celera reported their analyses ofthe draft human genome
sequence does not mention RNA genes at all' Since then, tllere has been a revolu
tion in our understanding of the importance of RNA genes. As detailed in Chapter
9, there are close to 1000 human miRNA genes, and many entirely new classes of
BASIC GENE EXPRESSION ANALYSES 239
TAIL!e.s"IFI';IC~IN~""UMaalN(~MPLIIt~
RNA gene predictio n. Genes encoding un translated RNAs may be d ifficul t to identify in t he
absence of a sizeable o pen reading frame, especially if t hey are sm all. RN A gene sequences are
o ften comparatively poorly conserved d uring evo lutio n. It is often difficult to disti ng u ish between
fun ctional noncod ing RNA and related pseudogene sequences.
Genes that are expressed at low levels and/or <It un usual cellula r locations and stages of
development may not be w el l represented in avaita b le eDNA libraries, and may not be evident if
they have sma ll exan5.
Very large genes have widely d ispersed exons and may be misinterpreted as a clust er of two or
more sma ller genes. This type of overestim ate of gene number arises because m any of th e cDNA
li b raries used to val ida te g enes and their exon- intron o rga niza tion have com parative ly sho rt in sert s.
• Some nonfunctional gene co p ies (pseudogenes) are t ranscri bed and m ay initi ally be m istake n for
true genes.
RNA have recend y been identified, many with a regulatory function. It may take
some time before we have a clear id ea ofthe number of human RNA genes. Table
8.5 lists some of the difficulties in estimating ge ne numbers.
TAlLIL6bIFFlRENTUVlLSOF EXPIIISSIOIllMAl'PlNG
Study Resolution Gene expressio n Exa mples
material throughput'
RNA high low tissue in situ hybrid izatio n (Figu res 7.12
and 8.20)
low low to medium northern blot hyb ri d izat ion (Figure 7.11)
RT-PCRl qPCR
Nonspecific detection
.,.-------
SYBR Green I When (ree In solution, this dye shows relatively little fluorescence, but when it binds to double-stranded DNA, its
dye fluorescence increases more than 1OOO-foid. It is, however, nonspecific In its binding of double-wanded DNA and so can
bind to primer dimers or nonspecific amplifica tion products (which is why it is not used in standard PCR). To control for this, a
melting curve is conducted at the end of the run by increa sing the temperatu re slowly (rom 6Q°C to 95 c C while continuously
monitoring the fluorescence. At a certain temperature, the whole amplified product w ill dissociate fully, resulting in a
decrease in fluorescence as the dye dissociates from the DNA. The temperature of dissoc iation is dependent on the length
and compositio n of the amplicon, allowing different DNA fragments to be distinguished (s hould there be nonspecific
amplilication or significant primer dimers).
Molecular
~ --------------------------------------------------
Molecular Beacon probes are designed to have a stem-loop structure key:
o
Beacon probes that brings a f1uorophore at the 5' end in very close proximity to a fluorophore
quencher at the 3' end, severely limitin g the fluoresc ence emitted
by the fluorophore. However, in the presence of a complementary activated
fJuorophore
target sequence, the probe unfolds and hybridizes to the target,
causing the fluorophore to be displaced from t he quencher (i) quencher
( F l go~~ 1). As a resu lt, the quencher no longer absorbs the
photons emitted by the fluorophore and the probe starts to fluoresce. unbound probe
1
_. (i)
.,..,..-.;
Figure 3 The hybridized TaqMan probe is displaced by Taq
polymerase and degraded by the associated exonuclease,
•
thereby activating the probe's fluorophore.
244 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
Antibodies that can specifically detect a protein of imerest can be raised in different ways.
... _ _ ~ 400 kD
200
A comparatively recent approach has been to use phage display, an expression cloning system dystrophin
(blot) 100
in which recombinant phages are used to display heterologous proteins on bacterial cell
surfaces (see Section 6.3). However, the main method of obta inin g antibodies has been the
traditional route of repeatedly injecting animals (such as rodents, rabbits, or goats) with a suitable myosin
immunogen that represents a protein under investigation. One approach is to design a synthetic (gel) ~~""'-------
peptide, often 20-50 amino acids long, that is then conjugated to a suitable molecule (such as 1 2 C 3 4 5 C 6 7 8
(8)
keyhole limpet hemocyanin) that will help maximize the immunogeniciry. The hope is that the
~
peptide will adopt a conformation resembling that of the native polypepti de seq uence, but 400 kD
success is not guaranteed and several djfferent peptides may need to be designed. 200
dystrophln
An alternative type of immunogen is a fusion protein that contains most or a large portion (blol) 100
of the target protein sequence fused to another protein that confers some advantages, notably in
. . --- ---
assIsting protein purification. The fusion protein is synthesized by cloning a cDNA for the desired
target protein into a plasmid expression vector that also contains a cDNA for the companion myosin
protein and selecting for recombi na nts in which the two cDNAs are in frame (see Figure 6.12). (gel )
If the animal's immune system has responded, specific antibodies should be secreted into the
serum. The antibody-rich serum (antiserum) contains a h eterogeneous mixture of antibod ies, each Flgur.8.l1 Immunoblotting (western
produced by a d ifferent B lymphocyte. The different antibodies recognize different parts (epitopes) blotting).lmmunobloning involves
of the immu nogen and are known as polyclonal antibodies. the detection of polypeptides after size
A homogeneous preparation of antibodies with a defined specifici ty can be prepared,
fractionation in a polyacrylamide gel and
transfer (blotting) to a membrane. Here, the
however, by propagating a clone of cells (originally derived from a single B lymphocyte).
Because B cells have a limited lifespan in culture, it is preferable to establish an immortal cell
blots show dystrophin detection in muscle
line: antibody-producing cells are fused with cells derived from an immortal B-cell tumor. From
samples from individual patie nts with
the resulti ng heterogeneous mixture of hybrid cells, those hybrids that have both the ability to
Duchenne or Becker muscular dystrophy
make a particular antibody and the ability to mUltiply indefinitely in culture are selected. Such
(lanes 1-8) and in normal COntrol muscle
hybridomas are propagated as individual clones, each of which can provide a permanent and
(lanes C). (A) Blot using the Dy4!6D3
stab le sou rce of a sing le type of monoclonal antibody (mAb).
antibody, which is specitic for the dystrophin
rod domain. {B} Blot using the Dy6!C5
antibody, which is speCific far the C-terminal
region of dystrophin. MyOsin was used as
Protein expression in cultured cells is often analyzed by using a loading control to provide compa rative
quantitation of the amount of muscle
different types of fluorescence microscopy
samples in the different lanes. [Courtesy
The subcellular location of a protein of interest can also be tracked by antibodies of the late Loui se Anderson (formerly
in cultured cells with the use of different types of fluorescence microscopy. The Nicholson) from Nicholson LV, Joh nson MA,
protein may be followed directly by using a specific antibody raised against it. Bushby KM et al . (1993) J. Med. Genet. 30,
737-744. With permission from the BMJ
Alternatively, a peptide tag or other protein is coupled to the protein of interest,
Publishing Group.)
and expression of the protein is followed indirectly. In immunofluorescence
microscopy, antibodies track protein expression directly. The antibodies are
labeled with a suitable fluorescent dye, such as fluorescein or rhodamine, and
after antibody has bound to the protein of interest, the expression of the protein
is directly monitored within cells by using fluorescence mictoscopy (see Box 7.3
Figure 2 for the principle of fluorescence microscopy).
In epitope tagging, the relevant protein is tagged by attaching to its N- or
C·terminus an iU1munogenic marker peptide sequence for which a specific anti
body already exists. To achieve this, a cDNA for the protein of interest is cloned
into an expression vector at a site adjacent to, and in frame with, a coding
sequence for the relevant peptide. Recombinant plasm ids are transfected into
the appropriate cells, and the expressed protein is tracked with a fluorescently
labeled antibody specific for the peptide tag. Commonly used epitope tags are
shown in Table 8.7.
TMLIi.7~"EPITOPETAGsFORTRACKl... "ROTEIltt.Qct4J:tz'"
Sequence of tag Peptide origin linked to protein at Monoclonal antibody
More recently, fluorescence microscopy has been used to track protein expres
sion in cells without the use of antibodies. Instead, a naturally fluorescent pro
tein is coupled to the protein of interest to provide a marker of its expression
pattern. One such protein is the green fluorescent protein (GFP), a 238-amino
acid protein originally identified in the jellyfish Aequoria victoria. Similar pro
teins are expressed in many jellyfish and seem to be responsible for the green
light that they emit, being stimulated by energy obtained after the oxidation of
luciferin or another photoprotein.
When the gene encoding GFP was cloned and transfected into target ceils in
culture, expression of GFP in heterologous cells was also marked by emission of
the green fluorescent light. This means that GFP is an auto.f/uorescent protein.
Because it is a natural functional fluorophore, it can serve directly as a reporter
that can be readily foilowed by conventional and confocal fluorescence micros
copy, making it a popular tool for tracking gene expression in cell lines and even
in some whole organisms.
Various genetically engineered mutants of GFP have improved its efficiency
in different ways, and various color mutants have been generated, notably blue
fluorescent protein, cyan fluorescent protein, and yellow fluorescent protein.
Another autofluorescent protein, the red fluorescent protein, has been derived
from the Discosoma coral.
To track the expression of a protein within living cells, the GFP-coding DNA
sequence has been inserted at the beginning or end of the cDNA for the protein
of interest within a suitable expression vector. Transfection and expression of the
resulting hybrid cDNA sequence produces a fusion protein with GFP linked to
the N- or C-terminus of the protein of interest. In many cases, the fusion protein
is expressed in the same way as the original protein, and so helps to reveal its
location (Plgure 8.2.3).
25 nucleotides long are synthesized in situ on an array, and IIIumina micro arrays,
in which pre-synthesized oligonucleotides with a gene-specific component
about 50 nucleotides long are attached to beads. More recently, high-density
arrays of longer oligonucleotides up to 60 nucleotides long have also become
available.
Microarray-based expression analyses are typically organized so as to com
pare two Or more highly related cellular or tissue sources that differ in an inform
arive way. For example, gene expression can be tracked in specific embryonic
structures at a series of developmental time points, or expression can be moni
tored in cultured cells at various times after they have been exposed to a range of
drugs and other chemicals. The same type of tumor tissue can be profiled at dif
ferent stages of malignancy. Expression profiles for diseased and normal pheno
types can also be co mpared, and in mouse models the two cellular sources under
comparison may be genotyp ically identical except for the presence of a defined
pathogenic mutation.
To perform transcript profiling, the cellular RNA sample is normally reverse
transcribed en masse to form a representative complex cDNA population. The
cDNA can be labeled as it is synthesized by the inclusion of a fluorophore
conjugated nucleotide in the reaction mix. Alternatively, a two-step procedure is
used- first , unlabeled cDNA is made, and then it is converted into a labeled com
plementary RNA (cRNA) by the incorporation of biotin (which will later be
detected with fluorophore-conjugated streptavidin) .
The labeled target cDNA or cRNA is then applied to the array and allowed to
hybridize. Each individual feature or spot on the array contains millions to bil
lions of copies of the same DNA sequence and is therefore unlikely to be com
pletely saturated in the hybridization reaction. Under these conditions, the
intensity of the hybridizing signal at each feature on the array is proportional to
the relative abundance of that particular cDNA or cRNA in the target popUlation,
which in turn reflects the abundance of the corresponding mRNA in the original
source population. The relative abundance of thousands of different transcripts
can therefore be monitored in one experiment. Multiple oligonucleotides are
also used to help distinguish between closely related transcripts from individual
genes. It is possible to monitor splice variants, for example, and to design oligo
nucleotides specific for every single known exon.
The huge amo unt of expression data generated by the micro array data requires
careful statistical analyses (Dox 8.7). Stringent controls are also required to nor
malize expression data for cross-experiment variation. One way of avoiding such
problems is to hybridize cD NA populations labeled with different fluorophores
to the same array simultaneously. Under nonsaturating conditions, the signal at
each feature will represent the relative abundance of each transcript in the sam
ple. If two samples are used, the ratio of the signals from each fluorophore pro
vides a direct comparison of expression levels between samples fully normalized
for variations in signal -to-noise ratio even within the array. The array is scanned
at two emission wavele ngths, and a computer is used to combine the images and
render them in false color. Usually, one fluorophore is represented as green and
the other as red. Features representing differentially expressed genes show up as
either green Or red, whereas those representing equivalently expressed genes
show up as yellow- see the example of using spotted cDNA arrays in Agure
8_241\_
Expression analyses with microarrays that have short oligonucleotide probes
are similar in principle to those in which cDNA probes or long (more than 50
nucleotide) Oligonucleotides are used. However, when using oligonucleotides
that are only 25 nucleotides long, the hybridization specificity is not so great.
There is a greater tendency for probes to hybridize to other sequences in addition
to their expected target sequences, and so additional controls are needed.
Accordingly, in Affymetrix GeneChip arrays, each gene is represented by 20 or so
different oligonucleotide probes that are selected from different regions along
the transcribed sequence. In addition to 20 perfect match (PM) oligonucleotides
per gene, a corresponding series of 20 mismatch (MM) oligonucleotides are
designed to control for nonspeCific hybridization by changing a single base in
each ofthe PM sequences (Figure 8.24B). To determine the signal for a particular
gene, the signals of all 20 PM oligonucleotides are added together and the signals
from all 20 MM oligonucleotides are subtracted from the total.
HIGHLY PARALLEL ANALYSES OF GENE EXPRESSION 247
Microarray expression data are revo lutionizing biological and Nonhierarchical clustering
biomedical research, but data interpretation remains a challenge. The A disadvantage of hierarchical clustering is that it is time-consuming
raw expression data from microarray experiments are signal intensities and resource hungry. As an alternative, nonhierarchical methods
that must be corrected for background effects and inter-experimental partition the expression data into a certain predefined number of
variation (normalized) and checked for errors caused by contaminants clusters. As a result, the analysis is speeded up considerably, especially
and extreme outlying values. when the data set is very large. In the k-means clustering merhod,
The data are summarized as a table of norma lized signal several points known as cluster centers are defined at the beginning of
intensities: rows on the table represent ind ividual genes and the the analysis, and each gene is assigned to the most approp riate cJuster
columns represent different conditions under which gene expression center.
has been measured. In the simplest cases, the table has two columns On the basis of the membership of each cl uster, the means are
(e.g. control and disease samples) and these may represent the signal recalculated (the cluster centers are repositioned). The analysis is the n
intensities from two samples hybridized simultaneously to the array. repeated so that all the genes are assigned to the new cluster centers.
However, there is no theoretical limit to the number o f conditions that This process is reiterated until the membership of the various clusters
can be used. no longer changes. Self-organizing maps are similar in concept, but the
Next, genes with similar expression profiles are grouped. algorithm is refined through the use of a neural network.
Generally, the more conditions under which gene expression is
tested, the more rigorous is the analysis. Two types of algorithm are
used to mine the gene expression data, one in which simi lar data are
clustered in a hierarchy and one in wh ich the clusters are defined in a
nonh ierarchical manner. biological
samples
Hierarchical clustering
The general approach in hierarchical clustering is to establish a
distance matrix that lists the differences in expression levels between
each pair of features on the array. Those showing the smallest
differences, expressed as the distance function d, are then clustered
in a progressive manner. Agglomerative dustering methods begin with
the classification of each gene represented on t he array as a singleton
cluster (a cl uster containing one gene). The distance matrix is searched,
and the two genes with the most similar expression levels (the
smallest distance function) are defined as neighbors; these are then
merged into a sing le cluster. The process is repeated until there is only
one cluster left. There are variations in how the expression va lue of the
merged cluster is calculated for the purpose of further comparisons.
In the nearest-neighbor (single linkage) method, the distance is
minimized. That is, where two genes i andj are merged into a single
cluster ij, the distance between ij and the next nearest gene k is
defined as the lower of the two values dU,k) and dU,k). In the average
linkage method, the average between d{i,k) and d(j,k) is used. In the
farthest-neighbor (complete linkage) method, the distance is maximized.
These methods generate dendrograms with different structu res
(Figure 1). Less frequently, a divisive clustering algorithm may be
used in which a single cluster representing all the genes on the array is
progressively split into separate clusters.
Hiera rchica l clustering of microarray data is often represented as a
heatmap- Agu:re 2.
1 2 3 4
- 5 8 6 7
experiment
~5
RNA 1 + fluor
.~
RNA 2+ fluor
~?
RNA 1 + biotin RNA 2 + biotin
spotted cDNA arrays. Here, comparative
expression assays are usually performed by
differentially labeling two mRNA or cDNA
l J
1 1
samples with different fluorophores that
! hybridize
! elevated expression in sample 2 (Y, green),
and similar expression in samples 1 and 2
! hybrid ize ! wash
! (Z, yellow). (8) Using Affymetrix GeneChips.
Here RNA is labeled in a two-step process
1 wash
! stain
J. to produce biotinylated cRNA. After
1
! scan
! hybridization and washing, biotin-cRNA
bound to the array is stained by binding
1 .)
y
Z
are preferentially expressed in sample 1 (X),
preferentially expressed in sample 2 (Y), or
combined t"
image in
"
') 'I
l I showing equivalent expression in samples
software
" XY
0
0
Z
1
combined data In software
1 and 2 (Z). (Adapted from Harrington CA,
Rosenow C & Retief J (2000) Curro Opin.
Microbio/. 3, 285-291. With permission from
Elsevier.]
Modern globillgene expression profiling increasingly uses
sequencing to quantitate transcripts
Microarray hybridization provided an exciting new technology that in the late
1990s was the first to offer the ability to conduct large-scale and genome-wide
transcript profiling. Global gene expression profiling is, however, also possible by
direct sequencing of cDNA copies of the transcripts, so that transcript frequen
cies can be quantitated for individual genes.
cDNAlibraries have been characterized by partial sequencing of clone inserts,
generating numerous EST sequences, or sometimes full-length inserts have been
sequenced. But because of the labor involved in generating libraries and retriev
ing recombinant clones, and the small number of samples that could be
sequenced at a time, standard dideoxy sequencing of library clones did not seem
an attractive way of quantitating transcripts.
As a result of the above difficulties, various sequence sampling methods were
developed to reduce the effort involved in retrieving cDNA sequences. The meth
ods would obtain RNA from cellular sources, convert it into cDNA, and then
retrieve short sequences (sequence tags) from individual cDNAs that would be
representative of transcripts and whose frequencies would be a quantitative
measure of the expression of that transcript.
A popular, and ingenious, sequence sampling method is known as serial
analysis of gene expression (SAGE) .In this technique, sequence tags are collected
from numerous cDNA copies, and ligated in a long series to form concatemers
that are then sequenced. In the alternative method, massively parallel signature
sequencing (MPSS), very large numbers of cDNAs are attached to microbeads in
a flow cell and sequencing of short sequence tags is performed in parallel rather
than in series as in SAGE.
The recent advent of massively parallel DNA sequencing means that both
micro array-based expression profiling and serial sample sequencing (SAGE) will
HIGHLY PARALLEL ANALYSES OF GENE EXPRESSION 249
43
S05
PAGE
30
20
14
The sensitivity of 2D-PAGE is dependent on the detec tion limit fo r very scarce
proteins, but the class of SYPRO®dyes can detect protein spots in the nanogram
range. Sensitivity is also influenced by the resolution of the gel, because spo ts
representing scarce proteins can be obscured by those representing ab undant
proteins. Pre-fractionation can remove abundant proteins and so simplify the
initial sample loaded onto the gel, and the resolution can be improved by using
gels with a very narrow pH range (see Figure 8.25).
Important developments to improve the efficiency of 2D-PAGE include soft
ware for spot recognition and quantitation, algorithms that can compare protein
spots across multiple gels, and robots that can pick interesting spots and process
them for later identification by mass spectromet ry as described below. However,
a remaining major limitation of2D-PAGE is that it is not highly suited to automa
tion, making it difficult to perform high-throughput analyses of many samples.
Alternative separation methods based on multidimensional high-performance
liquid chromatography (HPLC) may eventually displace 2D-PAGE as the major
platform for protein separation.
Mass spectrometry
Mass spectrometry (MS) is used to determin e the accu rate masses of molecules
in a particular sample. In expression proteomics it helps identify the proteins in
selected spots that have been resolved by 2D-PAGE. It does this by accurate deter
mination of molecular mass, a procedure that can be completed more quickly
than direct sequencing of pro teins by Edman degradat ion. MS is also easy to
automate for high-throughput sample analysis, and so it can conveniently proc
ess thousands of spots from a two-dimensional gel or hund reds of HPLC
fra ctions.
Until recently, MS could not be applied to large molecules such as proteins
and nucleic acids because these were broken into random fragments during the
ionization process. This limitation has been overcome using soft-ionization
methods such as matrix- assisted laser desorption / ionization (MALDI) (Figure
0.26 and BoxB.8) and elecrrospray ionization (ES!).
HIGHLY PARALLEL ANALYSES OF GENE EXPRESSION 251
sample paths of ionize d peptides Fig ....... 8.26 Principle of MALOI-TOF mass
matrIX spectrometry. The anafyte (the sample
I
I I
under study-usually a collection of tryptic
peptide fragments) is mixed with a matrix
I low
compound and placed near the source of a
la ser. The laser heats up the analyre-matrix
~ mass·to-charge
peptide crystals, causing the analyte to expand
into the gas phase without significant
. high
fragm entation. Ions then travel down a
,< mass·to-charge
laser flight tube to a reflector, which focuses the
/ peptide
ions ontO a detec{Qr. The time of flight (the
time taken for ions to reach the detector) is
dependent on the mass/charge ratio and
detector allows the mass of each molecule in the
analyte to be recorded.
The masses of proteins, or more usually the peptide fragments derived from
them by digestion with proteases, can be used to identify the proteins by correlat
ing the experimentally determined masses with those predicted from database
sequences. As described below there are three differen t ways to annotate a pro
tein by MS (Flgure 8.27):
Peptide m ass fingerprinting (PMF) is best suited for simple proteomes. A
simple protein mixture (typically a single spot from a two-dim ensional gel) is
digested with trypsin to generate a collection of tryptic peptides. These are
subject to MALDI-MS using a time-of-flight (TOF) analyzer (see Figure 8.26
and Box 8.8), which returns a set of mass spectra. The spectra are used as a
search query against protein sequence databases. The search algorithm per
form s virtual trypsin digests of all the proteins in the database and calculates
the masses of the predicted tryptic peptides. It then attempts to match these
predicted masses against those determined experimentally.
A mass spectrometer has three components. An ionizer converts the sample to be analyzed (the
analyte) into gas-phase ions and accelerates them toward the mass analyzer. The latter separates
the ions according to their masS/cha rge ratio on their way to the ion detecwr, which records the
impact of individual ions, presenting these as a mass spectrum of the ana lyte.
The ion ization of la rge molecules without fragmentation and degradation is know n as soft
(e.g. the tryptic pep tides derived from a particular protein sample) with a light-absorbing
matri x com pou nd in an organic solvent. Evaporation of the solvent produces analyte/ matrix
crystals, which are heated by a short pulse of laser energy. The desorption of laser energy as
heat causes expansion of the matrix and analyte into the gas phase. The analyte is then ionized
and accelerated toward the detector.
In electrospray lonlution (ESI), the analyte is dissolved and the solution is pushed through
a narrow capillary. A potential difference, applied across the aperture, causes the analyte to
emerge as a fine spray of charged particles. The droplets evaporate as the ions enter the mass
analyzer.
Mass analyzers
A quadrupole mass analyzer comprises four metal rods, pairs of which are connected electrically
and carry opposing voltages that can be controlled by the operator. Mass spectra are obtained
by varying the potentia l difference applied across the ion stream, allowing ions of different
mass/charge ratios to be directed toward the d etector. A time-of-flight (TOF) analyzer measures
the time ta ken by ions to travel down a flight tube to the detector, a factor that depends on the
mass/Charge ratio.
Two or more mass analyzers are operated in series. Various MS/MS instruments have been
described, including triple quadrupole and hybrid quadrupole/ t ime-of-flight instruments. The mass
analyzers are separated by a collision cell that contains in ert gas and causes ions to dissociate into
fra gments. The first analyzer selects a particular peptide ion and directs it into the collision cell,
where it is fragmented. A mass spectrum for the fragments is then obtain ed by the second analyzer.
These two function s may be combined in more sophisticated instruments, such as the ion-trap and
252 Chapter 8: Analyzing the Structure and Expression of Genes and Genomes
exact match
Jlm
LGM DGYR
) GI$LANWMCLAK ~ human lysozyme
WE$GYNTR
1 MALDl·TOF
r
analysis peptide mass
trypsin fingerprinl
~J
~
digestion )
..
,':;J'
ESI--MSj MS
$LANW
RATNYN
FQINS
l
human lysozyme
sheep lysozyme
check
collection
of EST hits
rat lactalbumin
analysis WESGYN
~ GI$LAN
GISLANW
GISLANWM
fragment ion GI$LANWMC ) peptide ladder
masses GI$LANWMCL
GISLANWMCLA
GISLANWMCLAK
Figure 8.27 Protein annotation by mass spectrometry. Individual protein samples (such as spots from two ·dimensio nal gels) are
digested with trypsin, which cleaves at the (-terminal side of lysine (K) or arginine (R) residues as long as the next residue is
not proline. The tryptic peptides can be analyzed as intact molecules by MALDI-TOF, and the masses used as sea rch queries against
protein databases. Algorithms are used that take protein sequences, cut them with the same cleavage speCificity as trypsin, and
compare the theoretical masses of these pep tides with the experimental masses obtained by MS. Idea lly, the masses of several
peptides should identify the same parent protein, in this case human lysozyme. There may be no hits if the protein is not in the
database or, more probably, that it has been subject to post-translational modification or artlfactual modification during the
experiment. In these circumstances, electrospray ionization coupled with tandem mass spectrometry (ESI-MS/ MS) can be used
to fragment the ions. The fragment ion masses can be used to sea rch EST databases and obtain partial matChes, which may lead
eventually to the correct annotation. Alternatively, the masses of peptide ladders can be used to determine protei n sequences de novo.
FURTHER READING
DNA sequencing TuckerT, Marra M, Friedman JM (2009) Massively parallel DNA
Clarke J, Wu HC, Jayasinghe Let al. (2009) Continuous base sequencing: the next big thing in genetic medicine. Am.) .
identification for single·molecule nanopore DNA sequencing. Hum. Genet. 85, 142-154.
Nat. Nanotechnol. 4, 265-270.
Eid J, Fehr A, Gray Jet al. (2009) Real-time DNA sequencing fro m
single polymerase molecules. Science 323, 133-138. Human Genome Project and human genetic
Gupta P (20081 Single-molecule DNA sequencin g technologies maps (see also Box 8,2 for additional historical
for future genomics research. Trends Biotechnol. 26, 602-61 , . references)
Hodge s E, Xuan Z, Balija V et al. (2007) Genome-wide in situ
exon capture for selective resequencing. Nat. Genet. 39, All AboutThe Human Genome Project (HGP). http://www
1522-1527. .genome.govl1 0001772 [An educational resource maintained
Mamanova L, Coffey AJ, Scott CE et al. (2010) Target-enrichment by the US National Human Genome Research Institute.)
strategies for next generation sequencing. Nat. Meth. Garber M, Zody MC, Arachchi HM et al. (2009) Closing gaps in the
7,111-118. human genome using sequencing by synthesis. Genome Bioi.
Mardis ER (2008) The impact of next-generation sequencing 10, R60.
technology on genetics. Trends Genet. 24, 133-141. Genome Reference Consortium. http://www.ncbi.nlm.nih
Metzker M (2010) Sequencing technologies-the next .gov/projects/genome/assembly/grcl [Gives updates and
generation. Na t. Rev. Genet 11 , 31-46. graphical overviews of the human and mouse genomes.]
Chapter 9
Organization of the
Human Genome
KEY CONCEPTS
• The human genome is subdivided into a large nuclear genome with more than 26,000
genes, and a very small circular mitochondrial genome with only 37 genes. The nuclear
genome is distributed between 24 linear DNA molecules, one for each of the 24 different
types of human chromosome.
Human genes are usually not discrete entities: their transcripts frequently overlap those
from other genes, sometimes on both strands.
• Duplication of single genes, subchromosomal regions, or whole genomes has given rise to
families ofrelated genes.
• Genes are traditionally viewed as encoding RNA for the eventual synthesis of proteins, but
many thousands of RNA genes make functional noncoding RNAs that can be involved in
diverse functions.
• Noncoding RNAs often regulate the expression of specific target genes by base pairing with
their RNA transcripts.
• Some copies of a funct.ional gene come to acquire mutations that prevent their expression.
These pseudo genes originate either by copying genomic DNA or by copying a processed
RNA transcript into a cDNA sequence that reintegrates into the genome
(retrotransposition) .
• Occasio nally, gene copies that originate by retrotransposition retain their function because
of selection pressure, These are knovm as retrogenes.
• Transposons are sequences that move from one genomic location to another by a cut-and
paste or copy-and-paste mechanism. Retrotransposo ns make a cDNA copy of an RNA
transcript that then integrates into a new genomic location.
• Very large arrays of high-copy-number tandem repeats, known as satellite DNA, are
associated with highly condensed, transcriptionally inactive heterochromatin in human
chromosomes.
256 Chapter 9: Organization of the Human Genome
nuclear genome
The human genome comprises two parts: a complex nuclear genome with mOre
than 26,000 genes, and a very simple mitochondrial genome with only 37 genes
(Figure 9.1). The nuclear genome provides the great bulk of essential genetic
information and is partitioned between either 23 or 24 different types of chromo
somal DNA molecule (22 auto somes plus an X chromosome in females, and an
additional Y chromosome in males).
Mitochondria possess their own genome-a single type of small circular
DNA-enCOding some of the components needed for mitochondrial protein syn
thesis on mitochondrial ribosomes. However, most mitochondrial proteins are
encoded by nuclear genes and are synthesized on cytoplasmic ribosomes before
being imported into the mito chondria.
As detailed in Chapter 10, sequence comparisons with other mammalian
genomes and ve rtebrate genomes indicate that about 5% of the human genome
has been strongly conserved during evolution and is presumably functionally
important. Protein-coding DNA sequences acco unt for just 1.1 % of the genome.
The other 4% or so of strongly conserved genome sequences consists of non
protein-coding DNA sequences, including genes whose final products are func
tionally important RNA molecules, and a variety of cis-acting sequences that
regulate gene expression at DNA or RNA levels. Although sequences that make
non-protein-coding RNA have not generally been so well conserved during evo
lution, some of the regulatory sequences are much more strongly conserved than
protein-coding sequences.
Protein -coding sequences frequently belong to families of related seq uences
that may be organized into clusters on one or more chromosomes or be dispersed
th.roughout the genome. Such families have arisen by gene duplication during
evolution. The mechanisms giving rise to duplicated genes also give rise to non
functional gene-related sequences (pseudogenesJ.
One of the big surprises in the past few years has been the discovery that the
human genome is transcribed to give tens of thousands of different noncodin.g
RNA transcripts, including whole new classes of tiny regulatory RNAs not previ
ously identified in the draft human genome sequences published in 2001.
Although we are close to obtaining a definitive inventory of human protein-cod
ing genes, our knowledge of RNA genes remains undeveloped. It is abundantly
clear, however, that RNA is functionally much more versatile than we previously
suspected. In addition to a rapidly increasing list of human RNA genes. we have
also become aware of huge numbers of pseudo gene copies of RNA genes.
A very large fraction of the human genome, and other complex genomes, is
made up of highly repetitive no ncoding DNA sequences. A sizeable component
is organized in tandem head-to-tail repeats, but the majority consists of inter
spersed repeats that have been copied from RNA transcripts in the cell by reverse
GENERAL ORGANIZATION OFTHE HUMAN GENOME 257
Lys
starts from common promoters in the CR/D-Ioop region and continues round
the circle (in opposing directions for the two different strands), to generate large
multigenic transcripts. The mature RNAs are subsequently generated by cleavage
of the multigenic transcripts.
Almost two-thirds (24 out of 37) of the mitochondrial genes specity a func
tional noncoding RNA as their final product. There are 22 tRNA genes, one for
each of the 22 types of mitochondrial tRNA. In add ition, two rRNA genes are ded
icated to making 16S rRNA and 12S rRNA (components of the large and small
subunits, respectively, of mitochondrial ribosomes). The remaining 13 genes
encode polypeptides, which are synthesized on mitochondrial ribosomes. These
13 polypeptides form part of the mitochondrial respiratory complexes, the
enzymes of oxidative phosphorylation that are engaged in the production of ATP
However, the great majority of the polypeptides that make up the mitochondrial
oxidative phosphorylation system plus all other mitochondrial proteins are
encoded by nuclear genes (Table 9,) ). These proteins are translated on cytoplas
mic ribosomes before being imported into the mitochondria.
Unlike its nuclear counterpart, the human mitochondrial genome is extremely
compact: al137 mitochondrial genes lack introns and are tightly packed (on aver
age, there is one gene per 0.45 kb) . The coding sequences of some genes (notably
those encoding the sixth and eighth subunits of the mitochondrialATP synthase)
show some overlap (Figure 9.4) and, in most other cases, the coding sequences of
neighboring genes are contiguous or sepatated by one or two uoncoding bases.
Some genes even lack termination codons; to overcome this deficiency, UM
codons have to be introduced at the post-transcriptional level (see Figure 9.4).
The mitochondrial genetic code
Prokaryotic genomes and the nuclear genomes of eukaryotes encode many hun
dreds to usually many thousands of different proteins. They are subject to a uni
versal genetic code that is kept invariant: mutations that could potentially change
GENERAL ORGANIZATION OF THE HUMAN GEN OM E 259
NADH dehydrogenase 7 42
rRNA 2 0
,RNA 22 0
Ribosomal proteins 0 79
alncl udes mitochondrial DNA and RNA potymerases plus numerous other enzymes, structural and
the genetic code are likely to produce at least some critically misfunctional pro
teins and so are strongly selected against. However, the much smaller mitochon
drial genomes make very few polypeptides. As a result, the mitochondrial genetic
code has been able to drift by mutation to be slightl y different from the uni versal
genetic code.
In the mitochondrial genetic code there are 60 codons that specify amino
acids, one fewer than in the nuclear genetic code. There are four stop co dons:
UAA and UAG (which also serve as stop codons in the nuclear genetic code) and
AGA and AGG (which specify arginine in the nuclear genetic code; see Figure
1.25). The nuclear stop codon UGA encodes tryptophan in mitochondria, and
AUA specifies methionine not isoleucine.
The mitochondrial genome sp ecifies all the rRNA and tRNA molecules needed
for synthesizing proteins on mitochondrial ribosomes, but it relies on nuclear
encoded genes to provide all other components, such as the protein components
of mitochondrial ribosomes and aminoacyl tRNA synthetases. Because there are
only 22 different types of hwnan mitochondrial tRNA, individualtRNA molecules
need to be able to interpret several different codons. This is possible because of
third -base wobble in codon interpretation. Eight of the 22 tRNA molecules have
anticodons that each recognize families of four codons differing only at the third
base. The other 14 tRNAs recognize pairs of codons that are identical at the first
two base positions and share either a purine or a pyrimidine at the third base. Figure- 9 .4 The MT·ATP6 and MT-ATPB
Between them, therefore, the 22 mitochondrial tRNA molecules can recognize a genes are transcribed in different
to tal of 60 codons ((8 x 4) + (14 x 2) 1. reading frames from overlapping
segments of the mitochondrial H
strand. MT·ATP8 is transcribed from
nu cleotides 8366 to 8569, and MT-ATP6
f - - - - - - - - MT47P8--------- - - ----> from 8527 [09204. Ahe r transcription,
8366 8522 8577 92029206 the RNA encoding the ATP synthase 6
,
lI.~t
I
53
P'O
p. -If- n~A7.~_-JUc :: .;, ;.;~Q::r.;,t8Q~~~89~Q~~?-,r~lk.
I_o'-..J(_ L-JL-....!L....lL-lL-..J_..
_~~M~~~~_I~~Mlli~~~~
..
06
L~ T~~ lie ~ $er Leu H,s Sel Leu Pt~ Pro Gin Ser Stop
I
-Ir '.--J
A(~h~
~
1 subu nit is cleaved after position 9206
and polyaden ylated, resu lting in a
C-termina l UAA codon where the fi rst
two nucleotides are derived ultimately
from the TA at positions 9205- 9206
1 17 :'126 and the third nucleotide is the first A of
-------~fr"rn; ---------- - - - ' l the poly(A) tail.
260 Chapter 9: Organization of the Human Genome
Number of different DNA molecules 23 (in XX cells) or 24 (in XY cells); all linear one circular DNA molecule
Total number of DNA molecules pe t cell varies according to ploidy; 46 in diploid ce lls often several thousand copies (but copy number
varies in different cell types)
Associated protei n several classes of histone and nonhistone protein IClrgely free of protein
Repetitive DNA more than 50% of genome; see Fi gure 9.1 very little
Transcription genes are often independently transcribed mu[tigenic transcripts are produced from both
the heavy and light stra nds
Codon usage 6 1 amino acid codons pl us three stop codonsa 60 amino acid codons pl us four stop codons~
Recombination at least once for each pair of homologs at meiosis not evident
Chromosome sizes are taken from the EN SEMBl Human Map View (hnp;J!www.ense mbl.org/ Homo_sa piens/Location/Genome). Heteroc hromatin
ligures are estimates abstracted from International Human Genome Sequencing Consortium (2004) Nature 431 ,93 1-945. The size of the total human
genome is estima ted to be about 3.1 Gb, with euchromatin accounting for close to 2.9 Gb and heterochromatin acco unting for 200 Mb.
centromere. Certain chromosomes, notabl y 1, 9, 16, and 19, also have significant
amounts of heterochromatin in the euchromatic region close to the centromere
(pericentromere), and the acrocentric chromosomes each have two sizeable het
erochromatic regions. Bu t the most sign ificant representation is in th e Y chromo
some, where most of the DNA is organized as heterochromatin.
The base composition of the euchromatic component of the human genome
averages out at 41 % (G+C), but there is considerable variation between chromo
somes, fro m 38% (G+C) for chromosomes 4 and 13 up to 49% (for chromosome
19). It also varies considerably along the lengths of chro mosomes. For example,
the average (G+C) content on chromosome 17q is 50% for the distal 10.3 Mb but
drops to 38% for the adjacent 3.9 Mb. There are regions of less than 300 kb with
even wider swings, for example from 33.1% to 59.3% (G+C).
The proportion of some combinations of nucleotides can vary considerably.
Like other vertebrate nuclear genomes, the human nuclear genome has a con
spicuo us shortage of the dinucleotide CpG. However, certain small regions of
transcriptionally active DNA have the expected CpG density and, significantly,
are un methylated or hypomethyJated (epG islands; Oox9.1).
The human genome contains at least 26,000 genes, but the exact
gene number is difficult to determine
Several years after the Human Genome Project delivered the first reference
genome sequence, there is still very considerable uncertainty a bout the total
human gene number. When the early anal yses of the genome were reported in
2001 , the gene catalog generated by the International Human Genome Sequenc
ing Consortiu m was very much oriented toward protein-coding genes. Original
estimates suggested more than 30,000 human protein-coding genes, most of
which were gene predictions without any supportive experimental evidence. This
number was an overestimate because of errors that were made in defining genes
(see Box 8.5).
To validate gene predictions supportive evidence was sought, mostly by evo
lutionary comparisons. Comparison with other mammalian genomes, such as
262 Chapter 9: Organization of the Human Genome
DNA methylation in multicellular animals o ften involves methylation (A) cytosine (8)
of a proportion of cytosine residues, giving S-methylcytosine (mC). m
NH, S· - CpG - 3·
In most animals (but not Drosophila mefanogaster), the dinucleotide
CpG is a common target for cytosine methylation by specific cytosine
I m
'c 3· - GpC 5'
methyitransferases, forming mCpG (Figure lA).
DNA methylation has important consequences for gene
expression and allows particu lar gene expression patterns to be stably
-3 N~
I
' CH
,.,p C , .1"'CH
0'" J. ~M I>
II ! deamination
!
of host defense against transposons. Vertebrates have the highest methylation
at 5' carbon m
levels of S-methyicytosine in the animal kingdom, and methylation 3·- GpC - S ·
is dispersed throughout vertebrate genomes. However, only a small
percentage of cytosines are methylated (about 3% in human DNA,
mostly as mCpG but with a small percentage as mCpNpG, w here N is
NH,
I
c
1 DNA duplICation
those of the mouse and the dog. failed to identify coun terparts of many of the
originally predicted human genes. By late 2009 the estimated number of human
protein-coding genes appeared to be stabilizing somewhere around 20,000 to
21,000. but huge uncertainty remained about the number of human RNA genes.
RNA genes are difficult to identify by using computer programs to analyze
genome sequences: there are no open reading frames to screen for, and many
RNA genes are very small and often not well conserved during evolution. There is
also the problem of how to define an RNA gene. As we detail in Chapter 12. com
prehensive analyses have recently suggested that the great majority of the
genome-and probably at least 85% ofnucleotides-is transcribed. It is currently
unknown how much of the transcriptional activity is background noise and how
much is functionally significant.
By mid-2009. evidence for at least 6000 human RNA genes had been obtained,
including thousands of genes encoding long noncoding RNAs that are thought to
be important in gene regulation . In addition, there is evidence for tens of thou
sands of different tiny human RNAs, but in many such cases quite large nwnbers
of different tiny RNAs are obtained by the processing of single RNA transcripts.
We look at noncoding RNAs in detail in Section 9.3.
The combination of about 20.000 protein ·coding genes and at least 6000 RNA
genes gives a total of at least 26,000 human genes. This remains a provisional
total gene number; defining RNA genes is challenging and it will be some time
before we obtain an accurate human gene nwnber.
GENERAL ORGANIZATION OFTHE HUMAN GENOME 263
such repeats are said to be direct repeatsifthe head of one repeat joins the tail
of its neighbor (-->--.) or inverted repeats if there is joining of heads (-><--) or
A A
+
tails (<-.... ). Over a long (evolutionary) time-scale the duplicated sequences ~ . ....... .
TABLE 9.4 STRUCTURAL VARIATION IN SIZE AND ORGANIZATION OF HUMAN PROTEIN-CODING GENES
Human protein Size of protein (no. Size of gene No. of Coding DNA (%) Average size of Averag e size of
of amino acids) (kb) exons exon (bp) intran (bp)
Where isoforms are evident, the given figures represent the largest isoforms. As genes get larger, exon size remainsfairly co nsta nt but intron sizes can
become very large. Internal exo ns tend to be fairly uniform in size, but the terminal exon Of some exons near the 3' end can be many kilobases long;
for example, exon 26 of the APOB gene is 7.5 kb long. Note the extraordinarily high exon content and comparatively small intron sizes in the genes
encoding type VII COllagen and titin. In addition to SRY, other single-exon protein-cod ing genes in the nuclear genome include retrog enes [see
Table 9.8) and genes en coding other SOX proteins, interferons, histones, many G-protein-coupled receptors, heat shock protei ns. many flOOl1uc!e2.s5,
and various neurotransmitter rece ptors and hormone receptors.
266 Chapter 9: Organization of the Human Genome
introns, there is an inverse correlation between gene size and fraction of coding
DNA (see Table 9.4J. This does no t arise because exons in large genes are smaller
than those in small genes. The average exon size in human genes is close to
300 bp, and exon size is comparatively independent of gene length. Instead, there
is huge variation in intron lengths, and large genes tend to have very large introns
(see Table 9.4J. Transcription oflong introns is, however, costly in time and energy;
transcription of the 2.4 Mb dystrophin gene takes about 16 hours. Thus, very
highly expressed genes often have short introns or no introns at all.
Repetitive sequences within coding DNA
Highly repetitive DNA sequences are often found within introns and flanking
sequences of genes. They will be detailed in Section 9.4. In addition, repetitive
DNA sequences are found to different extents in exons. Tandem repetition of very
short oligonucleotide sequences (1--4 bpJ is frequent and may simply reflect sta
tistically expected frequencie s for certain base compositions. Tandem repetition
of sequences encoding known or assumed protein domains is also quite com
mon, and it may be functionally advantageous by providing a more available bio
logical target.
The sequence identities between the repeated protein domains are often
quite low but can sometimes be high. Lipoprotein Lp (a), encoded by the LPA
gene on chromosome 6q26, provides a classical example. It contains multiple
tandemly repeated kringle domains, which are each about 114 amino acids long
and form disulfide-bonded loops. The different kringle domains are often nearly
identical in amino acid sequence. Even at the nucleotide sequence level the DNA
repeats that encode the kringle domains show very high levels of sequence iden
tity, making them prone to unequal crossover. As a result, the LPA gene is subject
to length polymorphism, and the number of kringle domains in lipoprotein Lp
(a) varies but is usually 15 or more.
GENE NUMBER
Smallest num ber in one gene 1 (no introns- see Tables 9.4 and 9.7 for example)
Average exon size 288 bp (but exons at 3' end of genes tend to be
la rge)
Smallest noncoding RNA < 20 n uc leotides (e.g. many transcriptio nal start
Site-associated RNAs are 18 nucleotides)
aNote that the tota l size can vary between haplotypes because of copy-number variation. bFor the
longest isoform. Data were obtained largely from ENSEMBL rele ase 55 datasets.
268 Chapter 9: Organization of the Human Genome
IA)
o 900 kb
'"""''"''\ ,
+- ~ -+ +- ~~-+-+-+~-+ +- +- ~ ~+- -+~++ ¥ -++- -+ -+ /'
(8)
sense
of NFlstrand
gene .
.....
exon 27b
•
intron 27b
5' [::1C====~~~=====::;;;=;;;;;'::~=====------
==-=':J 3'
.....
exon 28
separate transcript for each gene. Such genes are said to form part of a poly
cistronic (= multigenic) transcription unit. Polycistronic transcription units are
common in simple genomes such as those of bacteria and the mitochondrial
genome (see Figure 9.3). Within the nuclear genome, some examples are known
of different proteins being produced from a common transcription unit. Typically,
they are produced by cleavage of a hybrid precursor protein that is translated
from a common transcript. The A and B chains of insulin, which are intimately
related functionally, are produced in this way (see Figure 1.26), as are the related
peptide hormones somatostatin and neuronostatin. Sometimes, however, func
tionally distinct proteins are produced from a common protein precursor. The
UBA52 and UBABO genes, for example, both generate ubiquitin and an unrelated
ribosomal protein (S27a and L40, respectively).
More recent analyses have shown that the long-standing idea that most
human genes are independent transcription units is not true, and so the defini
tion of a gene will need to be radically revised. Multigenic transcription is now
known to be rather frequent in the human genome, and specific proteins and
functional noncoding RNAs can be made by common RNA precursors. This will
be explored further in Section 9.3.
Growth hormone gene clu ster 5 clustered withi n 67 kb; one pseudogene (Fig ure 9.8) 17q24
Class I HLA heavy chain genes -20 clustered over 2 Mb (Fig ure 9.1 0) 6p21
HOX genes 38 organized in four clusters (Figure 5.5) 2q31 , 7p 15, 12q13, 17q21
Histone gene family 61 modest-sized clusters at a few locations; two large clusters many
on chromosome 6
Olfac tory receptor gene family > 900 about 25 large clusters scattered throughout the genome many
NFl (neurofibromatosis type I) > 12 one functional gene at 22q1 1; others are non processed many, mostly peri centrome ri c
pseudogenes or gene fragments (Figure 9.11)
Ferritin heavy cha in 20 one functional gene on ch romosome 11; most are many
processed pseudogenes
TABLE 9.7 EXAMPLES OF HUMAN GENES WITH SEQUENCE MOTIFS THAT ENCODE HIGHLY CONSERVED DOMAINS
Homeobox genes 38 HOX genes plus 197 homeobox spe<ifies a homeodomain of -60 amino acids; a wide variety of different
orphan homeobox genes subclasses have been defined
PAX genes 9 paired box encodes a pa ired domain of - 124 amino aCids; PAX genes often also have a
type of homeodoma in known as a paired-type homeodomain
sox genes 19 SRY-like HMG box which encodes a domain of 70-80 amino aci ds
Pseudogenes are usually thought of as defective copies of a functional http://www.pseudogene.org). Only about 10% of the 21,000 human
gene to wh ich they show sign ificant sequence homology. They protein-coding genes have at least one processed pseudogene, but
typically arise by some kind of gene duplication event that produces highly expressed genes tend to have multiple processed pseudogenes.
two gene copies. Selectio n pressure to conserve gene fu nc tion need For example, the cytoplasmic ribosom al protein contains 95 functional
on ly be imposed on one gene copy; the other copy can be allowed genes encoding 79 different protein s (16 genes are duplicated) an d
to mutate more free ly (genetic drift) and can pick up inactivating 2090 processed pseudogenes.
mutations, prod ucing a pseudogene. However, some seq uen ces are RNA pseudogenes are often difficult to id entify as pseudogenes
referred to as pseudogenes even though they have not originated (there is no reading frame to inspect, and RNA genes often lack
by DNA copying. For example, as we will see in Chapter 10, humans introns). Nevertheless, pseudogene copies of many small RNAs are
have rare solirary pseudogenes t hat are clearly orthologs offu nctional common (see the t able below), nota bly if they are transcribed by RNA
genes in the grea t apes and became defective after acqu iring harmful polyme rase III (genes transcribed by RNA polymerase III often have
mutations in the human lineage. internal promoters).
Different gene duplica tion mechanisms can give rise to multiple
functional gene copies and defective pseudogenes. Either the
genomic DNA sequence is copied, or a cDNA copy is made (after
RN A famil y 1 Number of
human genes
Number of related
pseudogenes
reverse tra nscription of a processed RNA transcript) that integrates
into genomic DNA. For a protein-cod ing gene, copying at the genomic U6 snRNA 49 - 800
DNA level can result in du plication of the promoter and upstream
U7 snRNA 1 85
regulatory sequences as well as of all exons and introns. A defective
gene that derives from a copy of a genomic DNA sequence is known
as a nonprocessed pseudogene (Flgur.'A). Such pseudogenes
YRNA
I 4 - 1000
usually ari se by tandem duplication so that t hey are located close to As will be described in Sect ion 12.4, the Alu repeat, the most
functional gene counterpa rts (see Figures 9.8 and 9.10B for examples), abundant sequence in the human genome, seems to have originated
but some are dispersed as a result of recombination (see Figure 9.11 by the copying of 7Sl RNA transcripts, and many other highly
for exa mples). repeated interspersed DNA families in mammals are copies of tRNA_
Copying at the cDNA level produces a gene copy that typically So, in a sense, RNA pseudogenes have come to be the most common
lacks introns, promoter elements, and upstream regu latory sequence elements of mammalia n genomes.
elements. Very occas ionally, a processed gene copy can retain some All th e pseudogenes are loca ted in the nuclea r genome, bu t they
function (a retrogene; see Tab le 9.7). However, beca use they lac k do include defective copies of genes that reside in the mitochondrial
important seq ue nces needed for expression, most processed gene ge nome (mitochondrial pseudogenes) . The mitochond ria l genome
copies degenerate into processed pseudogenes (sometimes called originated from a much larger bacterial genome and over a long
retrotransposed pseudogenes; Figure 1B). evolutionary time-scale much of the DNA of the la rge precurso r
Prevalence and functionality of p seudogenes mitochondrial genome migrated in a series of independent integration
Eukaryotic genomes typically have many pseudogenes. A long events into what is now th e nuclea r genome_mtDNA pse ud ogenes
standin g ra tiona le for their abundance is that gene duplication is now account for at least 0.016% of nu clear DNA (or abo ut 30 times the
evolutionarily advantageous. New functional gene variants can be content of the mitochondrial genome).
created by gene duplication, and pseudogenes have lon g been viewed The func ti onalit y of pseudoge nes has been an endu rin g debate,
as unsuccessful by-products of t he duplication mechanisms. Although and different pseudogene classes have been envisaged. A significa nt
some prokaryotic genomes see m to have many pseudoge nes, number of pseudoge nes (mostly processed pseud ogenes) are
pseudogenes are generally rare in pro karyotes because their genomes transcribed, and antisense pseudogene transcripts may regu late
are generally designed to be compact. parent genes. Pseudogenes have also bee n di rectly Implicated in
The great majority of w hat are conventionally recogn ized as the productio n of endogenous siRNAs that regulate transposo ns, as
huma n pseudogenes are copies of protein-coding genes simply described in Section 9.3. Fina lly, some pse udogene seq uences may be
because it is relatively easy to identify them (by looking for co~opted for a different function. Th ey have been described as exapted
frameshifting, splice site mu tations, and so on). There are more than p5eudogenes. An example is provided by the XIST gene. It makes a
8000 different processed pseudogene copies of protein-codi ng noncoding RNA that regulates X·ch romosome inactivation, and two of
genes in the human genome, plus more than 4000 nonprocessed its six exons are known to have originated from a pseudogene copy of
pseudogenes (see the pseudogene database at a protein-codi ng gene.
A A
(A) -~ I I (B) ~
Flgu.nI! "\ Origins of nonprocessed and processed
pseudogenes. (A) Copying of genomic DNA 1gene duplication I transcription
.J, and processing
sequence containing gene A can produce duplicate
copies of gene A. Strong selection pressure needs .! 1 I I ] mRNA
+I
A A reverse
to be applied to one of the copies to maintain
gene fun ction (bold arrow), but the othe r copy -e-r:.=:IIr::1::::II :J- -e--==o=o transcriptase
ca n be allowed to mutate (dashed arrow). If it I mutational : mutationa l II] cDNA
picks up ina ctivatin g mutatio ns (red ci rcles), a
non processed pseudogene ('VA) ca n arise.
~ constraint
V
: flexibility
---1
-L
chromosomal
Integration
A "A
(B) A processed pseudogene arises afte r cellul ar
reverse tra nscriptases co nvert a transcript of a ge ne
- o-t_ _LLi L J-
into a cDNA that then is ab le to integrate back into
the ge nom e (see Fig ure 9.12 for details). Th e lack
o • •
l mutation
of important sequences such as a promo ter usually promoter inactivating non·inactivating --~~--------
res ults in an inact ive gene copy. mutation mutation VA
PROTEIN-CODING GENES 273
(Al 3' UTR Figure 9 .10 The class I HLA gene family: a
[~'~:r~·.2:J : TM j ~~.~ cl u stered gene family with nonprocessed
pseudogenes and gene fragments.
(Al Structure of a class I HLA heavy·chain
(81 mRNA. The full -length mRNA contains a
-2.2 Mb polypeptide-encoding sequence with a
L.
~
1
--
8 C
•
E 2 't' '¥ 3.!. 4't' 5't' .2. 671.
i i
/ ttU
, ;\. . . , _ " "
:;::i'j1
leader sequence (L), three extracellular
domains (aI, 0'2, and a'J), a transmem brane
sequence (TM), a cytoplasmic tail (CY), and
a 3' untranslated region (3' UTR). The three
\\ /
~,.... //
/
I
I
I /
!
3'UTR
extracellular domains are each encoded
essentially by a single exon. The very sma ll
"';;1.. !~~~._
--'
~".v",,,.vv- / ! /-F0i"3I ™"CY}vvvvvvv 5' UTR is not shown. (8) The class I HLA heavy
chain gene cluster Is loca ted at 6p21J and
/ 3'UTR i
~
~
comprises about 20 genes. They include six
--6.t--- ,1M
~ .~
3'UTR
expressed genes (filled blue boxes), four
full -length non processed pseudogenes
(long red open boxes labeled 0/), and a
variety of partial gene copies (s hort open red
boxes labeled 1-7). Some of the latter are
transcripts such as mRNA to make cDNA that can then integrate into chromo truncated at the 5' end (e.g. 1, 3, 5, and 6),
some are truncated at the 3' end (e.g. 7), and
somal DNA (Figure 9.12). Processed pseudogenes are common in interspersed
some contain single exam (e.g. 2 and 4).
gene families (see 'fable 9.5).
Processed pseudo genes lack a promoter sequence and so are typically not
expressed. Sometimes, however, the cDNA copy integrates into a chromosomal
DNA site that happens, by chance, to be adjacent to a promoter that can drive
expression of the processed gene copy. Selection pressure may ensure that the
processed gene copy continues to make a functional gene product, in which case
it is desc ribed as a retrogene. Avar iety of intIonless retrogenes are known to have
testis-specific expression patterns and are typically autosomal homologs of an
intron-containing X-linked gene (Table 9.8) .
One rationaJe for retrogenes may be a critical requirement to overcome the
lack of expression of certai n X- linked sequences in the testis during male meiosis.
During male meiosis, the paired X and Y chromosomes are converted to hetero
chromatin, forming the highly condensed and transcriptionally inactive XYbody.
Autosomal retrogenes can provide the continued synthesis in testis cells of cer
tain crUCially important products that are no longer synthesized by genes in the
highly condensed X'{ body. Fig" '. 9.11 Dispersa l of nonprocessed
NFl and PKDJ pseudogenes as ~ result
of pericentromeric or 5ubtelomeric
instability. (A) The NFl neurofibromatosis
l Ob 13 27b type I gene is located close to the
IAI
centromere of human chromosome 17. It
I I
NFl
1 7QIL2 !t1 1" I t l nOI -II ml- - -- - - lIltrlf nm ' l spans 283 kb and ha s S8 exons. Exons are
represented by thin vertical boxes; intron s
are shown by connecting chevrons (A).
dispersed NFl
pseudogenes
rfn'IIY 'II m1 - '5p,1.2 (2 copies l Highly homologous defective copies of the
Nfl gene are found at nine or more other
/ 2012--q13 gen ome locations, mostly in pericentromeric
I II I""-.. 14pll<>qll regions. Each copy ha s a portion of the full ·
22pll<>qll length gene, with both exons and introns.
Seven examples are shown here, such as
•..-
-•
---
a result of segmental duplication events
.,. --
'f' 1 during primate evo luti on, large components
,¥PKDl 'P2 of this gene have been duplicated and Six
gene 'P:' PKDI pseudoge nes are located at 16p13.11,
clu slGr
't'4' with large blocks of sequence (shown as blue
16p13.11
'f'5 boxes) copied from the PKDl gene (asterisks
..6 represent the absence of counterparts to
PKOI sequences) .
274 Chapter 9: Organization of the Human Genome
-~ 8
and retrogenes originate by reverse
E2 E3 "-.A)o!,AAN, - - 3' transcription from RNA transcripts_
~;i= n 5 ' --
3' - - -TITTTN',, - - 5' (Al In this example, a protein-codi ng gene
'c, with three exons (El-E3) is transcribed from
l transcriPtlon and
RNA processing
1 reverse
transeriptase
3 "-nmW,., TIn --- TTTTT - 5'
second-strand s~thesis
using cellular reverse transcriptase function
(provided by lINE·l repeats). (8) Integration
TABLE 9.8 EXAMPLES OF HUMAN INTRONlESS RETROGENES AND THEIR PARENTAlINTRON-CONTAINING HOMOlOGS
TAF1L at 9p13 TAFI at Xq13 TATA box binding protein associated factor, 250 kD
Proteins cannot self-replicate. and so many evolutionary geneticists consider that auux:aralytic
nucleic acids must have pre-dated proteins and were able to replicate without the help of
proteins. The RNA world hyporhesis developed (rom ideas proposed by Alexander Rich and
Carl Woese in the 19605. It imagines that RNA had a dual role in the earliest stages of life, acting
as both the genetic material (with the capacity for se lf-replication) and also as effector molecules.
Both roles are still evident today: some viruses have RNA genomes, and noncoding RNA molecules
can work as effector molecules with catalytic activity. RNAse P. for example, is a ribozyme that
can cleave substrate RNA without any requirement for protein, and certain types of intron
are autocatalytic and able to splice themselves out of RNA transcripts without any help from
protein s (see text). Another observation consistent with RNA's being the first nucleic acid is that
deoxyribonucleotides are synthesized from ribonucleotides in cellular pathways.
As well as storing genetic information, RNA has been imagined to have been used
subsequently to synthesize proteins from amino acids. Different RNAs, incl uding rRN A and
tRNA, are central in assisting polypeptide synthesiS. Many ribosomal proteins can be deleted
Without affecting ribosome function, and the crucial peptidyl transferase activity-the enzyme
that catalyzes the formation of peptide bonds-is a ribozyme. However, RNA has a rath er rigid
backbone and so is not very well suited as an effector molecule. Proteins ilre much more fl exible
and also offer more functional variety because the 20 amino acids can have widely different
structures and offer more possible sequence combinations (a decapeptide provides 20 10 or about
10 13 different possible amino acid sequences, whereas a decanucleotide has 4 10 or about 106
different possible sequences).
The rep lacement of RNA with DNA as an information storage molecule provided Significant
advantages. DNA Is much more stable than RNA, and so is better suited for thiS task. Its sugar
residues lack the 2' OH group on ribose sugars that makes RNA prone to hydrolyt ic cleavage.
Greater effici ency could be achieved by separating the storage and transmission of genetic
information (DNA) from protein synthesis (RNA). All that was needed was the development o f a
reverse transcriptase so that DNA could be synthesized from deoxynucleotides by using an RNA
template.
We have known for many decades that various ubiquitous ncRNA classes are
essential for cell function. Until recently, however, we have largely been accus
tomed to thinking ofncRNA as not much more than a series of accessories that are
needed to process genes to make proteins. Transfer RNAs are needed at the very
end of the pathway, serving to decode the codons in mRNA and provide amino
acids in the order they are needed for insertion into growing polypeptide chains.
Ribosomal RNAs are essential components of the ribosomes. the complex ribo
nucleoprotein factories of protein synthesis.
Other ubiquitous ncRNAs were known to function higher up the pathway to
ensure the correct processing of mRNA, rRNA, and tRNA precursors. Various
small RNAs are components of complex ribonucleoproteins involved in different
processing reactions, including splicing, cleavage of rRNA and tRNA precursors,
and base modifications that are required for RNA maturation. Jypically these
RNAs work as guide RNAs, by base pairing with complementary sequences in the
precursor RNA.
We have also long been aware of a few ncRNAs that have other functions, such
as RNAs implicated in X-inactivation and imprinting, and the RNA component of
the telomerase ribonucleoprotein needed for synthesis of the DNA of telomeres
(see Figure 2.13). But these RNAs seemed to be quirky exceptions.
In the past decade or so, however, there has been a revolution in how we view
RNA. Many thousa nds of different ncRNAs have recently been identified in ani
mal cells. Many of them are developmentally regulated and have been shown to
have crucial roles in a whole variety of different processes that occur in special
ized tissues or specific stages of development. Several ncRNAs have already been
implicated in cancer and genetic disease.
Now that the human genome is known to have close to 20,000 protein-coding
genes. about the same as in the I mm nematode Caenorhahditis eiegans, which
has only about 1000 cells-the question now is whether RNA-based regulation is
the key feature in explaining our complexity. Certai nly, the complexity of RNA
based gene regulation increases markedly in complex organisms, as is described
in Chapter 10. Maybe it is time to view the genome as more of an RNA machine
than just a protein machine.
276 Chapter 9: Organization of the Human Genome
-
cluster. (8) A single gene can have m ultiple (BI short RNAs snoRNA ~ shortRNAs
transcri pt ional sta rt sites (right-angled arrows) ----t miRNA.
as well as many interleaved coding and proteln..codlng RNA -0
noncoding transcripts. Exons are shown as )
blue boxes. Known short RNAs such as small
nucleolar RNAs (sn oRNAs) and microRNAs c'-
5' _ .-.l _._-I_._L __-II-I.1- 3'
(miRNAs) can be processed from intronic
3'- ...,- - - - - - ·,-- - - - - - - - 5'
sequences. and novel species of short RNAs
that cluster around the beginning and end
J J
of genes have recently been discovered (see a~isense ( antisense
short RNAs
the text). (From Gingeras TR (2007) Genome ncRNA ncRNA
Res. 17, 682-690. With permission from Cold
Sp ring Harbor laboratory Press.] short RNAs
RNA GENES 277
J___ /I
regu,au\on~
control
and export speo jjk
/
~ rL
splicing
splicing regulator
snRNA
general HHIJ-52
tra nsCfi~tjon lSnoRNA
cleavage regulation
mRNA TERC piRNA
Of pre.tR~1
I
long RNAs in
U16nRNA cis-antisense
\
rRNA <RNA
U2 snRNA
regulation I endo-siRNA
RNDsti P I
T$IX
tRNA base RNa50a SIlAJ FlNA
MRP
RN,seMn,/
of pre-rRNA mod ification AIR
7Sl RNA 75K RNA
epigenetic long RNAs as
RNAi-based gene trans-acting
gene silencing regu lation regulators
snoRNA scaRNA
endo-sl RNA H19
HBU-85
S(l\l~N"
Figur. 9.13 Functional diversity of RNA. Various ubiquitous RNAs func tion in housekeeping roles in cells, and in protein synthesis
and prOtein export from cells (using 7SL RNA, the RNA component of the sig nal recog nitio n particle). RNAs involved in RNA
maturation include spliceosomal small nuclear RNAs (sn RNAs), sma ll nucleolar RNAs (s noRNAs), small Cajal body RNAs (scaRNAs),
and two ANA ribonuc1eases. Telome re DNA synthesis is performed by a ribonucleoprotein that consists ofTERC (the telomerase RNA
component) and a reverse transcriptase (see Figure 2.13) . 1he Y RNA family are involved In chromosomal DNA replication, and RNase
MRP has a cruci al role in initiating mtDNA replication as well as in cleav ing p re -rANA precurso rs in the nucleolus. Diverse classes of
RNA serve a regulatory function in gene expression. Although some do have general accessory roles in transcription, regulatory
RNAs are often restri cted in their expression to certain cell types and/or developmental stages. Three classes of tiny RNA use RNA
interference (RNAi) pathways to act as regulators: Piwi protein-interacting ANAs (pIRNAs) regulate the activity of transposons in
germ-line cells; mi croANAs (miRNAs) regulate the expression of target genes; and endogenous short interfering RNAs (endo-siRNAs)
act as gene reg ulato rs and also regulate some types of transposon. Some members of the snoRNA famil y, such as H811-85 snoRNA,
are involved in epigenetic gene regulation. A large number of long RNAs are Involved in regul ating various genes, often at the
transcriptiona l level. Some are known to be involved in epigenetic gene regulation in imprinting, X-i nactivation, and so on. RNA
familie s are shown in black type; individual RNAs are shown in red.
Ribosomal 125 rRNA, 165 rANA ) of each components of mitochondrial ribosomes cleaved from multigenic transcripts produced
RNA (,RNA), by H strand of mtDNA (Figure 9.3)
-120-5000
nucleotides 55 rRNA, S.8S rRNA, 1 of each components of cytoplasmic ribosomes 5S rRNA is encoded by multiple genes in
185 rRNA, 28S rRNA various gene dusters; 5.85, 18S, and 28S
rRNA are cleaved from multigenic transcripts
(Figure 1.22); the multigenic 5.85-185-285
transcription units are tandemly repeated on
each of 13p,14p, 15p, 21p, and 22p (= rONA
clusters)
Transfer mitochondrial family 22 decode mitochondrial mRNA to make 13 single-copy genes. tRNAs are cleaved from
RNA ('RNA), proteins on mitochondria l ribosomes multigenic mtONA transcripts (Figure 9.3)
- 70-80
nucleotides cytoplasmic family 49 decode mRNA produced by nuclear genes 700 tRNA genes and pseudogenes dispersed
(Figure 9.13) at multiple chromosomal locations with some
large gene clusters
Small nuclear spliceosomal fami ly 9 U1, U2, U4, US, and U6 snRNAs process about 200 spliceosomal snRNA genes are
RNA (snRNA), with subclasses Sm standard GU-AG introns (Figure'.1 9); found at multiple locations but there are
-60-360 and Lsm (Table 9.1 0) U4atac, U6atac, Ul 1, and U12 snRNAs moderately large clusters of Ul and U2 snRNA
nucleotides process rare AU-AC introns genes; most are transcribed by RNA pol (I
non-spliceosomal several U7 snRNA: 3' processing of histone mRNA; mostly sing le-copy functional genes
snRNAs 7SK RNA: general transcription regu lator;
Y RNA family: involved in chromosomal
DNA replication and regulators of ceil
proliferation
Small C/O box dass 246 maturation of rRNA, mostly nucleotide usually within introns of protein-coding
nucleolar RNA (Figu re 9.1 SA) site-specific 2'-O-ribose methylations genes; multiple chromosomal locations, but
(snoRNA), some ge nes are found in multiple copies in
-60- 300 H/ACA class 94 maturation of rRNA by modifying uridines gene clusters (such as the H611-52 and HBII-aS
nUcleotides (Figure 9.156) at specific positions to give pseudouridine clusters-Figure 1 , .22)
Sma ll Cajal 25 maturation of certain snRNA classes in usually within introns of protein-coding
body RNA Cajal bodies (coiled bodies) in the nucleus genes
(scaRNA)
Vault RNA 3 components of cytoplasmiC vault RNPs VAULTRCI , VAULTRC2, and VAULTReJ are
that have been thought to function in clustered at Sq31 and share -84% sequence
drug resistance identity
YRNA 4 components of the 60 kO Ro RNY1, RNY3, RNY4, and RNYS are clustered at
ribonucleoprotein, an important target of 7q36
humoral autoimmune responses
RNA GENES 279
MicroRNA > 70 families of -1000 multiple, important roles in gene see Figure 9.' 7 for examples of genome
(miRNA), related miRNAs regulation, notably in developm ent, organization, and Figure 9.16 for how they are
-22 nucleotides and implicated in some cancers synthes ized
Piwi-binding RNA 89 individua l > 15,000 often derived from repeats; expressed 89 large clusters distributed across the genome;
(piRNAJ.-24-31 clusters only in germ-line cells, where they individual dusters span from 10 kb to 75 kb
nucleotides limit excess transposon activity with an average of 170 piRNAs per cluster
Endogenous many probably often derived from pseudogenes, clusters at many locations in the genome
short interfering more than inverted repeats, etc.; involved in
RNA (endo 10,QOoa gene regulation in somatic cells and
siRNA), -21 -22 may also be involved in regulating
nucleotides some ty pes of transposon
long noncoding many > 3000 involved in regulating gene usually individual gene copies; transcripts often
regulatory RNA, expression; some are involved undergo capping, splicing, and polyadenylation
often> 1 kb in monoallelic expression but antisense regu latory RNAs are typically long J
(X-inactivation, imprinting), and/ or as transcripts that do not undergo splicing
antisense regulators (Table 9.1 1)
Rfam noncoding RNA families and sequence alignments httpjI rfam .sa nger.a c. u k!
sno/scaRNAbase smail nucleolar RNAs and small Cajal body-specific RNAs http://gene.fudan.sh.cn/snoRNAbase.nsf
p rRNAbank empirically known sequences and other related information http://p irnabank.i ba b.ac.i nJ
on piRNAs reported in various organisms, including human,
mOuse, rat, and Drosophifa
280 Chapter 9: Organization of the Human Genome
UAC GUA 14
Cy
s
[UGU
uGC
'" GeA
ACA 30- Genomic tRNA Database (see Tab le 9.9)
Ser Dec GG.l. as carrying that anticodon . Note that 12
l eu [UUA - UAA 7 tJCA-UGA 5 stop [OM - UUA - stoP [UGA-UCA - (3) of the 61 anticodo ns t hat could recog nize
VVG-eM 7 lieG - eGJI.. 4 stop [UAG - eUA Trp [UGG - ee A 9 the codons that specify the stand ard 20
amino acids are not represe nted in the
tRNAs (s how n by d ashes). Th is happens
[cuur'G 12 [CU?,GG 10 H- [eAU~ AUG - [CGU7 ACG 7
leu cee GAG - P eee GGG - IS CAe GUG 11 beca use of w obble at the third ba se
ro Arg eGC GeG
eUA - VAG 3 CCA - vGG 1 CGA - UCG 6 pOSition of most codons w here the th ird
CtiG-CAG 10 eeG - CGG 4
GIn [CM - VUG 11
G.G - CUG 20 eGG - CCG , base is U or C (excepti ons are fo r codon
pairs AU U/AU C, AAu/AAC, and UAUI
UAC). The (3) indicated by an asteris k
lie
[AUU7AAU
Aue
l4
GAV 3
I GUC
[GUU 7AAC 11-
GAe
[CU7~GC
Ala GCC ",GC
29- Asp [GAU
GAC
~ Aue
Gue 19 Gly
[GGU'
GGC
" Gce
ACC 15
a modified form of adenine known
as in osi ne, in which the am ino group
Va
Com ponent of minor spliceosome U11, U12, U4atac, and U5 snANAs U6atac snRNA
Bound core proteins Sm protein s (SmB, SmDl, 5mD2, SmD3, SmE, 5mF, 5mG) seven l sm proteins {lSM2-LSM8}
l ocation synthesized In the nucleus and then exported to the never leave the nucleus; undergo maturatio n In
cytoplasm. where they each associate with seven Sm the nucleolus and then accumulate in speckles to
proteins and undergo S' and 3' end processing; then perfo rm spliceosomal function
re·imported into t he nucleus to undergo m ore RNA
processing in Cajal bod ies, before accumulating in
speckles to perform spJiceosomal fun ction
Site-specific nucleotide performed by sca RNAs in the Cajal bodies of the nucleus performed by snoRNAs in the nucleolu s
modification
"A lth oug h not a spliceosomal sn RNA, U7 snRNA has a sim ilar structure to the Sm class of snRNAs and fi ve of its seven core proteins are identical to COre
proteins bou nd by the Sm snRN As.
a 3' stem-loop structure. The Sm site and the 3' stem elements are requi red
\
Lsm site
assembly of a ring of the seven Sm core proteins (see Table 9. 10). Th e TMG ca p ,
and the assembled Sm core proteins are required for recogn ition by the nuclear
MPG cap
g uanosine (MPG) ca p and a 3' stem, and termin ate in a stretch of uridi ne
uuuu·J
resid ues (the l sm site) th at is bound b y the seven lsm core proteins.
282 Chapter 9: Organization of the Human Genome
some 15. Nonstandard functions are known or expected for some snoRNA genes
that do not have sequences complementary to rRNAsequences. For example, the
HBIl-52 snoRNA has an 18-nucleotide sequence that is perfectly complementary
to a sequence within the HTR2C (serotonin receptor 2c) gene at Xp24, and
regulates alternative splicing of this gene. The neighboring HBll-85 snoRNAs
have recently been implicated in the pathogenesis of Prader-Willi syndrome
(OMIM 176270).
Small Cajal body RNA genes
The scaRNAs resemble snoRNAs and perform a similar role in RNA maturation,
but their targets are spliceosomal snRNAs and they perform site-specific modifi
There are at least 25 human genes, each specifying one type of scaRNA. Like
snoRNA genes, the scaRNA genes are typically located within the ilurons of genes
RNA interference (RNA!) Is an evolutionarily ancient mechanism t hat is argonaute enzyme will cleave the RNA, causing it to be degraded. Viral
used in animals, plants, and even single-celled fungi to protect cells and transposon RNA can be inactivated in this way.
aga inst viruses and tra nsposable elements. Both viruses and active Another class of argonaute complex is the RNA-induced
transposable elemen ts can produce long double-stranded RNA, at transcriptional Silencing (RITS) complex. Here, the single-stranded RNA
least transiently during their life cycles. Long double-stranded RNA is guide binds to compleme ntary RNA transcripts as they emerge during
not normally found in cells and, for many organisms, it triggers an RNA transcription by RNA polymerase II. Thi s allows the RrTS com plex
interference pathway. A cytoplasmic endoribonuclease called dicer to position itself on a specific part of the genome and then attract
cuts the long RNA into a series of short double-stranded RNA pieces proteins such as histone methyltransferases (HMTs) and sometimes
known as short Interfering RNA (siRNA). The siR NA prod uced is on DNA methyltransferases (DNMTs), which covalently modify histones in
ave rage 21 bp long, but asymmetric cutting produces two-nucleotide the immediate region. Th is process eventually causes heterochromatin
overhangs at their 3' ends (Figurf: 1). formation and spreading; in some cases, the RITS complex can
The siRNA duplexes are bound by different complexes that induce DNA methylation. As a result, gene expression can be
contain an argonaute-type endoribonuclease (Ago) and some other silenced over long periods to limit, for example, the activities
proteins. Thereafter, the two RNA strands are unwound and one of of transposo ns.
the RNA strands is degraded by argonaute, leaving a single-stranded Although mammalian celis have RNA interference pathways,
RNA bound to the argonaute complex. The argonaute complex is now the presence of double-stranded RNA triggers an interferon response
activated; the single-stranded RNA will gu ide the argonaute complex that causes nonspeCific ge ne silenci ng and cell death. This is
to its target by base-pairing with complementary RNA seq uences in described in Chapter 12 when we consider using RNA interference
the cells. as an experimental tool to produce the speCific silenci ng of pre
One type of argonaute complex Is known as the RNA-induced selected target genes. In such cases, artificially synthesized short
silencing complex (RiSe). In this case, after the single--stranded guide double-stranded RNA is used to trigger RNAi-based gene
! dlcer
c9 "--- cQJ
Ago+ othe~
RISC protei~s J-. siRNA r Ago + other
RITS proteins
@ ~
activated RiSe act'lvated RITS
1 I
~V>-DNA
cap - - - - - - -- - - poIy(A)
1
cap ---""I£~ ...... - - - polY(A)
1
cap - - -- -
1-+
- -_ _ _ poly(A)
1 1
histone met~lation
DNA methylation
degraded RNA
1
transcription repressed
represent a small component of what are a huge number of different tiny regula
toryRNAs made in animal cells. In manunals, two additional classes oftiny re gu
latory RNA were first reported in 2006, and these are being intensively studied.
Because huge numbers of different varieties of these RNAs are generated from
multiple different locations in the genome, large-scale sequen cing has been
Piwi-protein-interacting RNA
24-31 nucleotides long; they are thought to have a major role in limiting transpo
sition by retrotransposons in mammalian germ-line cells, but they may also reg
required because by integrating into new locations in the genome, active trans
pasans can interfere with gene function, causing genetic diseases and cancer.
Figu~ 9 .1' Human miRNA synthesis. (A) General scheme. The primary transc ript. pri-miRNA, has a 5' cap
(m' GpppG) and a 3' poly(A) tail. miRNA precursors have a prominent double-stranded RNA structure (RNA
<AI
hairp in), and processing occurs through the actions o f a series of ribonuclease complexes. In the nucleus,
nucleus Anasen, the human homolo g of Drosha, cleaves the pri-miANA t o release the hairpin RNA (pre-miANA); this
is then exported to the cyto plasm, where it is <:leaved by the enzyme dicer to produce a miRNA duplex.
The duplex RN A is bound by an argonaute complex and the helix is unwound, whereu pon o ne strand (the
passenger) is degraded by the argonaute ribonuclea se, leaving the mature miRNA (the guide strand) bound
! RNA pol II
to argonaute. miA, miRNA gene. (8) A specific example: the synthesis of human miR-26a 1. Inverted repeats
(show n as highlighted sequences overlined by long arrows) in the pri-miRNA undergo base pairing to form
a hairpin, usually with a few mism atches. The sequences that will form the mature guide strand are show n in
~!t
red; those of the passenger strand are show n in blue. Cleavage by both the human Drosha and dicer (green
arrows) is typicalJy asymmetric, leaving an RNA duplex with overhanging 3' dinucleotides .
pri-mIRNA
(B)
----------------------4) ~(-----------------------
cap poly{A) cap -If- GUGGCC UC GuuCAAGUAAUCCAGGAUAGGCUGUGCAGGUCCCAAUGGGC CUlI.UUCUUGGUUAC UUGCAC GGGG.t..CG': -{~ AAA.M
ca p '7:; l
~~
9 U U 9 c a 9
~ 9Ug ccu cg '.l Ca'tpHII,'~ "';'-~'~"'!JI\'~~C:l! gu 9
pn-miRNA
pre-miRNA III 111 11 1 111 111 1 111 11 1111 111 11 u
, ........ cgc g9"9 ) .'1. gl.llK"a:.lLJ -,i!]'J"'~'J""JC~ 9Ij1 ua a c
AAAAA;>-"'''' a l ec
! RNASEN complex
1 cytoplasm
U
IJ C"''''IJU :'')
IJ
~r;b-CJ-?4UC'l~gC'_~ gu
I
'" g
c a
g9
.JU 1 pre.miRNA 1"il
I 11 11 11 1 11 ' 11 1111 ' III i
..J " ... ~ ..I .. .;q '... ..... :.l !_c .:- ggud a c
u
pre-miRNA
\IT e t cc
! DICER compleK ! DICER complex
u u
JL miRNA duplex
5' II c,',.g-...;.~
i 111 111 1 11 11111 11 1
i: ~>;I<}l1ud';N· .• u 3'
miRNA duplex
IA)
Figure 9.17 The structure of human pri
miR-21
miRNAs. (A) Examples of transcri pts th at are
used exclu sively to make miRNAs: miR-2 1
cap - - - - -- - - ' ' -- - - - po ~IA) is produced from a single hairpin within a
dedicated primary transcript RNA;
miR-17-18-19a- 20-19b-1-92- 1 a single multigenic transcript w ith six
hairpins that will eventu ally be cleaved to
oap _ _ _ __
poly(A) give six miRNAs, namely miR-17, miR-1S,
miR·19a, and so o n. (8, C) Examples of
m iRNAs that are co·transcribed with a gene
IB) encoding either (8) a long noncodi ng RNA
miA·155 in BIG neRNA (n cRNA) or (Cl a polypeptide. In each part,
the upper e)(ample shows single miRNAs
- - - - poly(A)
located w ithin (B) an exon of an ncRNA
(miR·1SS) and (C) in the 3' untranslated
regi on (UTR) within a terminal exon of an
miR-15a-16-1 in DLEU2 ncANA intron
mRNA (mi R-198). The lower examples show
poly(A) multip le miRNAs located within intronic
sequences of (8) an ncRNA (miR- 1Sa and
miR-16-1) and (Cl a pre-mRNA (m iR- 106b,
miR-93, and miR-25). Cap, m 7G(S')ppp(5')
IC) G. (Adapted from Du T & Zamore PD
miR-198 in FSTLI mRNA 3' UTR
(2005) Development 132, 4645-4652. With
cap poIYlA) permiSSion from th e Company of Biologists.]
ca p
J\ f JlRJl ,,~ prny(A)
• Y\........J
More than 15,000 different human piRNAs have been identified and so the
piRNA family is among the mos t diverse RNA family in human cells (see Table
9.8). The piRNAs map back to 89 genomic intervals of about 10-75 kb long (for
more information see the piRNAbank database; Table 9.9). They are thought to be
cleaved from large multigenic transcripts. In humans, the multi-piRNA tran
scripts contaln sequences for up to many hundreds of different piRNAs.
piRNAs are thought to repress transposition by transposons through an RNA
interference pathway, by association with piwi proteins, which are evolutionarily
related to argonaute proteins (Figure 9.18).
Endogenous siRNAs
Long double-stranded RNA in mammalian cells triggers nonspecific gene silenc·
ing through interferon pathways, but transfection of exogenous synthetic siRNA
duplexes or short hairpin RNAs induces RNAi·mediated silencing of specific
genes with sequence elements in common with the exogenous RNA. As we will
see in Chapter 12, this is an extremely important experimental tool that can give
valuable information on the cellular functions of a gene. Very recently, it has
become clear that human cells also naturally produce endogenous siRNAs (endo·
siRNAs).
[n mammals, the most comprehensive endo·siRNA analyses have been per
formed in mouse oocytes. Like piRNAs, endo·siRNAs are among the most varied
RNA population in the cell (many tens of thousands of different endo·siRNAs
have been identified in mouse oocytes). They arise as a result of the production
of limited amounts of natural double-stranded RNA in the cell. One way in
which this happens involves the occasional transcrip tion of some pseudogenes
[Figure 9.19J.
RNA GENES 287
r ..
sense
transpas!ln
I ~-
T?
...
piRNA cluster
an1isar.se
trilflSp05Qll
from long RNA precursors transcribed
from defined loci called piRNA clusters.
Any transposon inserted in the reverse
orientation in the piRNA cluster can give rise
to antisense piRNAs (shown in red),
(8) Antisense piRNAs are incorporated into
(B) amplification
loaded piwi
! a piwi protein and direct its slicer activity on
sense transposon transcripts. The 3' cleavage
product is bound by another piwi protein
and trimmed to piRNA size. This sense piRNA
protein is, in turn, used to cleave piRNA cluster
"\
transposon
/ transcript
cleaved
complexes to cDNA for DNA methylation
(left) and/or histone modification (right).
DNMT, DNA methyltransferase; HMT, histone
tca",poson ~
.--.:: transcriPt ,_ methyltransferase; HP1, heterochromatin
" X /
piRNA cluster
transcript
~~';~tone
" "~,cat,ons
(C) repression
1 cytoplasm
RNA that is cleaved by dicer to give si RNA.
Endogenous siRNAs can also be produced
from duplicated inverted seq uences such
mRNA
SOliilili.I ••• li + I'
\
antisense transcript
I J II'
I
I II I I I I 3'
1IIIIIiJrO
1 as th e example shown here of an inverted
dup licati on of the p se ud ogene ('VA 'fA) at
the right. Transcription throug h both copies
of the pseud ogene results in a long RNA with
invened repea ts (blue, overlined arrows)
11111111111111 hairpin
causing the RNA to fold into a ha irpin that
is cleaved by dicer to give siRNA. In either
ca se, the endogenous siRNAs are guided
by Ri se to interact with, and degrade, the
parent gene's remain ing mRNA transcripts.
11111/1 111111 Gree n arrows indicate DNA rearrangements.
[Adapted from Sasidharan R & Gerstein M
21-nucleotide
(2008) Nature 453, 729-731. With permi ssion
siRNA
from Macmil lan Publishers ltd.)
mRNA Cleavage
H19 2.3 kb llp15 5 exons spanning 2.6 7 kb involved in imprinting at the II pIS imprinted cluster
associated with Beckwith-Wiedemann syndrome
KCNQTOT! (= LIT!) 595 kb llp1 5 1 exon antisense regulator at the imprinted cluster at 11 pI 5
PEG3 1.8 kb a 19q13 variable number of exons but maternally imprinted and known to function in tumor
up to 9 exons spanning 25 kb suppression by activating p53
HOTAIR 2.2 kb 12q13 6 exons spanning 6.3 kb trans-acting gene regulator; although part o f a
reg ulatory region 1n the HOX-C cl uster on 12q13, HOTAIR
RNA re presses transcription of a 40 kb region on the
HOX-D cluster on chromosome 2q31
alargest isoforms.
HIGHLY REPETITIVE DNA: HETEROCHROMATIN AND TRANSPOSON REPEATS 289
Satellite 3 ATICC/GGAAT 13p, 14p, lSp, 21 p, 22p, and heterochromatin on lq, 9q,
andYq12
,.
DYZ19 125 bp -4<10 kb at Yq 11
~
Microsatellite DNA < 100 bp often 1-4 bp widely dispersed throughout all chromosomes
irfhe distinction between satellite, min isatellite, and microsatellite is made on the basis of the total array length, not the size of the repeat unit.
bSate!lite DNA arrays {hat consi st of simple repeat units often have base compositions that are radically different from the average 41% GTe
(and so cou ld be isolated by buoyant density gradient cen trifugation, when they would be differentiated from the main DNA and appear as satellire
bands-hence the name).
290 Chapter 9: Organization of the Human Genome
lINE-1 family
LlNE-2 family -370,000
ORFl on"". members of any of the illustrated transposen
LlNE-3 family -44,000 - ~:;~~1-- 1AI" ~
- ~p'--DL fam ilies may be capable of transposing;
6~kb many have lost such a capacity after
SINEs
__ acquiring inactivating mutations, and many
-1,200 ,000
(A~"
Alu family
MIR -450.000 100-400 bp are short truncated copies. Subclasses of
MIR3 -85,000 the four main families are listed, along with
sizes in base pairs. ORF, open reading frame.
retrovirus-like (LTR transposons) LTR LTR [Adapted from International Human Genome
HERV familie s -2 40,000 P &>8 LI ("~ / LTR (l',i! lTR Sequencing Consortium (2001) Nature 409,
MaLR -285,000 I» 6-ll kb a OJ ~ 860-921. With permission from Macmillan
1.5-3 kb
Publishers Ltd.]
DNA transposon foss ils
MERl (Charlie) -213,000 ~ : 1.rft'I''tX,~,,- 4
............ )-------4
Of the three human LINE families, LINE-I (or Ll) is tile oruy family that con
tinues to have actively transposing members. LINE-I is the most important
human transposable element and accounts for a higher fraction of genomic DNA
(17%) man an y orher class of sequence in me genome.
Full-length LINE-I elements are more man 6 kb long and encode two pro
teins: an RNA-binding protein and a protein with both endonuclease and reverse
transcriptase activities (Figure 9.21A). Unusually, an internal promoter is located
within the 5' untranslated region. Full-length copies mere fore bring with them
meir own promoter mat can be used after integration in a permissive region of
me genome. After translation, the LINE-I RNA assembles with its own encoded
proteins and moves to the nucleus.
To integrate into genomic DNA, me LINE- I endonuclease cuts a DNA duplex
on one strand, leaving a free 3' OH group that serves as a primer for reverse tran
scription from the 3' end of the LINE RNA. The endonuclease's preferred cleavage
site is TTTTJ,A; hence the preference for integrating into AT· rich regions. AT-rich
DNA is comparatively gene-poor, and so because LINEs tend to integrate into
AT-rich DNA they impose a lower mutational burden, making it easier for their
host to accommodate them. During integration, the reverse transcr iption oft.en
fails to proceed to me 5' end, resruting in truncated, nonfunctional insertions.
Accordingly, most LINE-derived repeats are short, with an average size of900 bp
for all LINE·] copies, and oruy about] in 100 copies are full length.
The LI NE· 1 machinery is responsible for most of the reverse transcription in
tile genome, allowing retrotransposition of me nonautonomous SINEs and also
of copies of mRNA, giving rise to processed pseudogenes and retrogenes. Of the
6000 Or so fuIl·length LI NE- l sequences, about 80-100 are still capable of trans
posing, and mey occasionally cause disease by disru pting gene function after
insertion into an important conserved sequence.
Alu repeats are the m05t numerous human DNA elements and
originated as copies of 7SL RNA
SINEs (short interspersed nuclear elements) are retrotransposons about
100-400 bp in lengdl. They have been very successful in colonizing mammalian
genomes, resulting in various interspersed DNA families, so me with extremely
high copy numbers. Unlike LINEs, SINEs do not encode any proteins and they
cannot transpose independently. However, SINEs and LINEs share sequences at
their 3' end, and SINEs have been shown to be mobilized by neighboring LI NEs.
By parasitizing on the LINE element transposition machinery, SINEs can attain
very high copy numbers.
The human AJu family is the most prominent SINE family in terms of copy Fi gure 9.11 The human LlNE~' and Alu
number, and is the most abundant sequence in the human genome, occurring on repeat elements. (A) The 6.1 kb UNE·'
average more than once every 3 kb. The full-length AJu repeat is about 280 bp element has two open reading frames: ORF1,
a 1 kb open reading frame, encodes p40,
long and consists of two tandem repeats, each about 120 bp in length fOllowed by
an RNA-binding protein that has a nucleic
a short An/Tn sequence. The tandem repeats are asymmetric; one contains an acid chaperone activity; the 4 kb ORF2
internal 32 bp sequence that is lacking in the other (Figure 9.21B). Monomers, specifies a protein with both endonuclease
and reverse transcriptase activities. A
(AI LlNE'l repeat element bidirectional internal promOter lies within
the 5' untranslated region (UTR). At the
(;~
ORFl ORF2 3' UTR other end, there is an AnfTn sequence, often
I
~
,A
described as the 3' poly(A) tail (pA). The
UNE-l endonuclease cuts one strand of a
DNA duplex, preferably w ithin the sequence
! hi
! nnJ. A, and the reverse transeri ptase uses
the released 3' -OH end to prime eDNA
symhesis. New insertion sites are flanked
, 40 endonuclease reverse by a small target site duplication of
transcriptase
2-20 bp (flanking black arrowheads).
(B) An Alu dimer. The two monomers have
(8) Alu repeat element similar sequences that terminate in an
left monomer right monomer Anrrn sequence but differ in size beca use of
~I II I
A 8
~:> L..J
32 bp
AAAAAA'
TTTTTT ·
> the insertion of a 32 bp element within the
larger repeat. Alu monomers also exist in
the human genome, as do variOu S truncated
copies of both monomers and dimers.
CONCLUSION 293
containing only one of the two tandem repeats, and various truncated versions
of dimers and monomers are also common, giving a genomewide average of
230 bp.
Whereas SINEs such as the MIR (mammalian-wide interspersed repeat) fami
lies are found in a wide range of mammals, the A1u family is of comparatively
recent evolutionary origin and is found only in primates. However, A1u sub
families of different evolutionary ages can be identified. In the past 5 million or
so yea rs since the divergence of humans and African apes, only about 5000 cop
ies oftheAlu repeat have undergone transposition; the most mobileAlu sequences
are members of the Yand S subfamilies.
Like other mammalian SINEs, A1u repeats originated from cDNA copies of
small RNAs transcribed by RNA polymerase III. Genes transcribed by RNA
polymerase III often have internal promoters, and so cDNA copies of transcripts
carry with them their own promoter sequences. Both the Alu repeat and, inde
pendently, the mouse Bl repeat originated from cDNA copies of 7SL RNA, the
short RNA that is a component of the Signal recognition particle, using a retro
transposition mechanism like that shown in Figure 9.12. Other SINEs, such as the
mouse B2 repeat, are retrotransposed copies of tRNA sequences.
Alu repeats have a relatively high GC content and, although dispersed mainly
throughout the euchromatic regions of the genome, are preferentially located in
the GC- rich and gene-rich R chromosome bands, in striking contrastto the pref
erentiallocation of LINEs in AT-rich DNA. However, when located within genes
they are, like LINE- 1 elements, confined to introns and the un translated regions.
Despite the tendency to be located in GC-rich DNA, newly transposing A1u
repeats show a preference for AT-rich DNA, but progressively older A1u repeats
show a progressively stronger bias toward GC-rich DNA.
The bias in the overall distribution of A1u repeats toward GC-rich and, accord
ingly, gene-rich regions must result from strong selection pressure. It suggests
that A1u repeats are not just genome parasites but are making a useful contribu
tion to cells containing them. Some Al u sequences are known to be actively tran
scribed and may have been recruited to a useful function. The BCYRNl gene,
which encodes the BC200 neural cytoplasmic RNA, arose from an A1 u monomer
and is one of the few A1u sequences that are transcriptionally active under nor
mal circumstances. In addition, the A1u repeat has recently been shown to act as
a trans-acting transcriptional repressor during the cellular heat shock response.
CONCLUSION
In this chapter, we have looked at the architecture of the human genome. Each
human cell contains many copies of a small, circular mitochondrial genome and
just one copy of the much larger nuclear genome. Whereas the mitochondrial
genome bears some similarities to the compact genomes of prokaryotes, the
human nuclear genome is much more complex in its organization, with only
l.l % of the genome encoding proteins and 95% comprising nonconserved, and
often highly repetitive, DNA sequences.
Sequencing of the human genome has revealed that, contrary to expectatio n,
there are comparatively few protein -coding genes-about 20,000- 21 ,000 accord
ing to the most recent estimates. These genes vary widely in size and internal
organization, wi th the coding exons often separated by large introns, which often
contain highly repetitive DNA sequences. The distribution of genes across the
genome is uneven, with some functionally and structmally related genes found
in clusters, suggesting thatthey arose by duplication of individual genes or larger
segments of DNA. Pseudo genes can be formed when a gene is duplicated and
then one of the pair accumulates deleterious mutations, preventing its expres
sion. Other pseudo genes arise when an RNA transcript is reverse transcribed and
the cDNA is re-inserted into the genome.
The biggest surprise of the post-genome era is the number and variety of non
protein-coding RNAs transcribed from the human genome. At least 85% of the
euchromatic genome is now known to be transcribed . The familiar ncRNAs
known to have a role in protein synthesis have been joined by others that have
roles in gene regulation, including several prolific classes oftiny regulatory RNAs
and thousands of different long ncRNAs. Our traditional view of the genome is
being radically revised.
294 Chapte r 9: Org ani zati on of the Human Geno m e
FURTHER READING
Human mitochondrial genome Co nrad B& Antonarakis SE (2007) Gene du pli ca ti o n: a d rive for
ph enot ypi c diversity and cause of hum an di sease. Ann u. Rev.
Anderson 5, Ba nkier AT, Barre ll BG et al. (1981) Sequence and
Genomics Hum. Genet. 8, 17-35.
organization of th e hum an mitochondrial genome. Nature
Kaess man n H, Vin ckenbosch N & Long M (2009) RN A- based gene
290,4 57-465.
dupl ication: mechanistic and evo lutionary in sig ht s. Not. Rev.
Chen XJ & Butow RA (2005) Th e orga nization and inheritance of
the mitochon d ria l geno me. Nat. Rev. Genet. 6, 81 5- 825. Genet. 10, 19- 31.
Linardopoulou EV, William s EM, Fan Y et al. (2005) Human
Falkenberg M, Larsson NG & Gus tafsson CM (2007) DNA
subtelo meres are hot spot s of interchromosomal
replication and t ra nscript ion in mammalian mitocho ndria.
reco mbination and segmental duplication. Nature 437,
Annu. Rev. Biochem. 76, 679- 699.
94-1 00.
MITOMAP: human mitoc hondrial genome database. http://www
Redon R, Ishikawa 5, Fitch KR et al. (2006) Globa l variatio n in copy
.mitomap.org
number in the hum an genom e. Nature 444, 444-454.
Wallace DC (2007) Why do we still have a maternally inherited
Tuzun E, Sharp AJ, Bail ey JA et al. (2005) Fine-scale st ructura l
mitocho ndrial DNA? Insig hts from evolutio nary m ed icine.
vari at ion of t he human genome. Nat. Genet. 37, 727- 73 2.
Annu. Rev. Biochem. 76, 78 1-821.
KEY CONCEPTS
• Mod el organisms have long been used for understanding facets of basic biology and for
applied research, with important roles in modeling disease and helping evaluate potential
therapies. Genome sequences for many model organisms are now available.
• Sequences from different genomes can be aligned. Comparing genome sequences provides
genomewide information on the related ness of DNA sequences and inferred protein
sequences, and gives an insight into the processes shaping genome organization and
genome evolution.
• Many functionally important sequences are subject to purifying (negative) selection
whereby harmful mutations are selectively eliminated; these sequences are said to show
evolutionary constraint and seem to be strongly conserved.
• Some mutations in functionally important sequences are subject to positive selection
because they are beneficial. Such sequences underlie adap tive evolution.
• Ne utral mutations, which are neither harmful nor beneficial, can become fixed in the
populatio n through random genetic drift.
• Evolutionary novelty often arises through duplicated genes; after duplication, genes can
diverge in sequence and ultimately in function.
• Genes that originated by duplication of a gene within a single genome are known as
paralogs; orthologs are homologous genes in different species that descended from the
same gene in a common ancestor.
• One Or both members of a duplicated gene pair can acquire a new function. Changes in
gene function can be due to mutations in the cis-regulatory elements that regulate the
gene's expression.
• Comparative genomics has contributed to the assessment of evolutionary relationships
between organisms (phylogenetics).
Morphological evolution is largely driven by mutations in cis-regulatory elements rather
than differences in protein sequences.
• Fragments from transpos able elements can be co-opted to make new functionally
important sequences (exaptation); they can give rise to newexons (exonization) or to new
cis-regulatory elements.
• Closely related species such as humans and chimpanzees have almost identical gene
repertoires, but there are important differences between them in copy number of some
gene families, in differential inactivation or modification of some orthologs, and in
regulation of gene expression.
298 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
Now that we have the sequence of the human genome, biologists ultimately hope
to get detailed answers to two big questions. First, at the molecular level, how
does genome information program the normal functioning of cells and organ
isms? The corollary to this, of course, is: How does genome information become
altered or misinterpreted to lead to hereditary diseases? Second, what makes
individuals and species different? Getting detailed answers to these two funda
mental questions will take a long time.
Obtaining genome sequences seemed to be an extremely arduous enterprise
in the pre-genome era of the 1990s and early 2000s. In the post-genome era, as
next-generation sequencing technologies become routine, we will soon be inun
dated with genome sequences from a vast array of organisms and from multiple
individuals within species. With hindsight, we can now recognize that genome
sequencing was the easy part, and just the beginning of a fascinating voyage of
discovery.
To start to answer the question of how our own genes operate, we need to
identify all the genes and gene products, all the regulatory elements, and all the
other functional elements. In Chapter 12, we look at how gene function and gene
regulation are studied in cells and model organisms. Genome sequences also
give us the opportunity to compare representative individuals within and between
species. In Chapter 13, we consider variation between individual humans. In this
chapter, we address va riation between species. We will consider how humans
relate to other organisms in the tree of life and the forces that have shaped genome
evolution.
Sequence compatisons across whole genomes from different organisms are
now the mainstay of a new discipline, comparative genomics, and they are begin
ning to provide powerful new insights into our relationship to other organisms.
We have a major interest in a variety of model organisms from both basic and
applied research perspectives.
(-'1 ::.--_....' \ • in the gut of humans (and other 10)lm in diameter. It has one
'. ,
,. '....... l,I
/ /
.I
'/
vertebrates) in a symbiotic
relationship (they synthesize
large chloroplast and mUltiple
mitochondria, and uses two
II /lJ' ') ,
''-'1'',Q_
.1 - '
vitamin Kand B·com plex vitamins,
which their hosts gratefullY
absorb). Through intensive studies
anterior flagella for propulsion and
mating. It is amenable to genetic
analyses, and many mutants have
been characterized; it has provided
over decades we have built up
more knowledge of E. (01/ than a useful model for understand ing
of any other type of celt, and how eukaryotic flagella and basal
most of our understanding of bodies function. Cilia and flagella
the fundamental mechan isms of both orig inale from basal bodies,
life. including DNA replicatfon, so Chlamydomonas has been
DNA transcription, and protein valuable in helping identify and
synthesis, has come from studies study genes involved in huma n
of thiSorganism. ciliary disease.
cals). Other unicellular organisms have been studied because they are patho
gens, notably diverse bacterial species.
Archaea are not known to be associated with disease, and the major interest
in studying them is to know more about how they evolved to be so different. They
were initially found in unusual, often extreme environments, such as at very high
temperatures, in waters of extreme pH or salinity, or in oxygen-deficient muds of
marshes. They are now known to inhabit more familiar envi ronments such as in
soils and lakes and have been found thriving inside the digestive trac ts of cows,
termites, and marine life, where they produce methane. The metabolic and
energy conversion systems of archaea resemble those of bacteria but the systems
used to handle and process genetic information (D NA replication, transcription,
and translation) are more closely aligned to those of eukaryotes.
To model eukaryotic cell functions, most reliance has been placed on yeasts
because they are easy to culture and genetically very amenable, and because cer
tain key molecules are known to have been strongly conserved in eukaryotes,
from yeasts to mammals. Yeasts are common on plant leaves and flowers, in soil,
and in salt water, and are also found on the skin surface and in the intestinal
tracts of warm-blooded animals, where they may live symb iotically or as para
sites. They typically replicate asexually by budding rather than by binary fission:
the cytoplasm and dividing nucleus from the parent cell are initially a continuum
with the bud, or daughter yeast, before a new cell wall is deposited to separate the
two. Usually, yeasts are not associated with disease, but some Candida species
can cause co mmon health problems, including candidiasis (thrush).
Two genetically amenable yeasts, Saccharomyces cereuisiae and
Schizosaccharomyces pombe, have been particularly well studied (see Box 10.l ).
Genetic SCreens have provided a host of valuable mutants that have been partic
ularly valuable in understanding aspects of the cell cycle and DNA repair that are
very relevant to our understanding of human cells and cancer and birth defects.
Various protozoan models are studied, including amebae, flagellates, and cili
ates that move using, respectively, pseudopodia, flagella, and cilia. They are of
interest to biomedical researchers as models of various facets of cell and develop
mental biOlogy. Many are also parasites associated with disease, acting as hosts
for pathogenic bacteria that cause diseases such as legionnaire's disease, salmo
nellosis, and tuberculosis, or they may cause disease directly as in some trypano
somes and malaria- causing Plasmodium species. As models of cell and develop
mental biology, most interest has been focused on the ameba Dictyostelium
discoideum and the ciliate Tetrahymena thermophila.
Algae are plant-like organisms that exist in unicellular and multicellular
forms. The most relevant uniceUular model is Chlamydomonas (see Box lO.l).
With the advent of massively parallel DNA sequencing (see Chapter 8, p. 220) ,
unicellular genome sequencing is now routine. As a result, it is now possible to
conduct a comprehensive investigation of the human microbia me, the collective
term for the microorganisms that live and interact inside and on humans (resid
ing mostly on the skin and in the digestive tract, and outnumbering human cells
by a factor of!O or so). As an extension to the Human Genome Project, an inter
national effort oflinked projects called the Human Microbiome Project has been
set up to sequence diverse genomes within the human microbiome.
There is also the expectation that synthetic model organisms will be created in
the near future. The first cellular genome to be completely synthesized-that of
Mycoplasma genitalium-was reported in early 2008. The 0.53 Mb synthetic
genome was assembled by a variety of genetic engineering procedures, but ulti
mately derived from approximately 10,000 synthetic oligonucleotides about 50
nucleotides in length. TWs achievement came one year after another important
breakthrough: the first successful case of cellular genome transplantation. One
species, Mycoplasma capricolum, was changed into another, Mycoplasma
mycoides, by removing the host cluomosome from M. capricolu.m and replacing
it with an equivalent chromosome isolated from M. mycoides.
Together with the ability to mutate genes, the above breakthroughS herald the
start of a new era of synthetic biology that will lead to the production of artificial
life. Synthetic unicellular organisms llold great promise for biomedical researc h
and the possibility of developing novel solutions to environmental problems,
such as the production of green biofuels and breaking down toxic waste. However,
MODEL ORGANISMS 301
even with inbuilt protection systems, there are important safety issues and the
threat ofbioterrorism on a large scale.
Various fish, frog, and bird models offer accessible routes to the
study of vertebrate development
Mammalian development is not easy to analyze. First, egg cells are very small and
difficul t to manipulate. Second, developing embryos implant into the uterine
epithelium; de velopment then proceeds inside the mother, making it difficult to
a.ccess material for study. Fish, frogs, and birds produce eggs that can be large
and easy to manipulate, and the subsequent development proceeds externally to
the mother. For these reasons, various non-mammalian vertebrate models are
ro utinely studied to illuminate our understanding of vertebrate development.
302 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
Caenorhabditis e/egans is a roundworm 1 mm long . It ca n be The fru it fly Drosophila melanogaster has been studied extensively for
cultured easily in the laboratory, on agar plates (feeding on bacteria) decades, during which a va st amount of information has been built
or in liquid culture, and is very amenable to genetic anal yses. The up about gene function. Some of the major features and approaches
predominant sex is hermaphrodite (XX), a modified female. By that have assisted genetic mapping and functional analyses are listed
producing both sperm and eggs it can self-fertilize, resulting in below.
increased homozygosity. Rarely, a male (XO) form develops through
(Courtesy of Andre
occasional loss o f one of the two hermaphrodite X chromosomes, and
Karwath under the Creative
hermaphrodites w ill mate preferentially with males when available.
Commons Attribution
Several areas of study take advantage of the unique featu res of
ShareAlike 2.5 license.)
C. elegons:
Two types of small freshwater fish have been popular developmental models:
the zebrafish, which originated in India, and more recently the medalea (Japanese
killifish- Box 10.3). In addition, the genomes of pufferfish are remarkably free of
repetitive DNA. and two such species-Takijugu rubripes and Tetraodon nigro·
uiridis-have been useful in aiding the discovery of genes and regulatory ele
ments by comparative genomics.
Frogs of the genus Xenopus (African clawed frog) have been particularly
important models for investigating both embryonic development and cell bioi·
ogy. The name Xenopus means strange foo t and originates from the sharp claws
on the toes of the large, strong, webbed hind feet. Xenopus has been an important
MODEL ORGANISMS 303
TABLE 10.1 PRINCIPAL REASONS FOR STUDYING MAJOR INVERTEBRATE MODEL ORGANISMS
Organism class Principal models Major rea sons why studied
parasitic Wuchereria bancrofti (causes elephantiasis) to understand pathogenesis and for economic reasons (some
Onchocerca volvulus (causes river blindness) nematodes attack (rops)
Arthropods nonpathogenic Drosophila mefanogoster (fruit fly) model of development behavior, and neuroscience; also for
gaining insights into gene function, modeling some human
diseases at the cell ular level, and rapid drug screening
Danaus pfexippus (monarch butterfly) to study the cell ula r and molecular mechanisms underlying a
sophisticated circadian clock
pathogenic Anopheles gambiae (tra nsmits Plasmodium these mosquitoes can transmit a variety of different parasites
(aleiporum, which causes malaria) and viruses
Aedes aegypti (transmits yellow fever and
Dengue fever viruses)
Echinoderm s sea urchins, notably Strongylocentrotus models of development, and as basic deuterostomes
purpurotus
Ascidians Ciona intestina lis (sea squirt) (Figu re 10.7B) model of development, and as a primitive chordate
Cephalochordates Amphioxm (Figu re 10.7A) model of development, and because evolutionarily closely
related to vertebrates
model for establishing the mechanisms for early fate decisions, patterning of the
basic body plan. and organogenesis. There has also been seminal work on chro
mosome replication, chromatin and nuclear assembly. cell cycle components,
cytoskeletal elem ents. and signaling pathways.
Birds have amniotic membranes, and avian development closely resembles
that of mammals. However. where as a mammalian embryo depends on the
mother for its nutrition (with exchange occurring across the placenta). avian
embryos do not have a placenta and so are self-developing systems. Because a
bird embryo develops outside the body. it is accessible at all stages in develop
ment. The chick is a good model largely because its embryo is easily obtained
(see Box 10.3).
-~-
-
~-
has a sh ort generation
time, and large numbers
embryo is easily obta ined
and also has the advantage of
...r -
of eggs are produced at being very large and relatively
each mating. Fertilization is translucent, making delicate
external, so that all aspe<:ts o( microsurgical man ipulations
development are accessible; the embryo Is transpa rent, facilitating easy. it thus offers an excellent
the identification of developmenta l mutants. The hig h conservation system in which molecular
of genes that are importa nt in vertebrate development means that studies can be combi ned with
human developmental control genes normally have easi ly identifiable classical embryology. Popu lar
orthologs in zebrafish. Large-scale mutagenesis screens have experimental man ipulations
produced many valuable developmenta l mutants, and some of of chick embryos in clude
these have been used to model human di sorders. Antisense surgical manipulations and
tech nology and/or protein-based translation blocking are also widely tissue grafting, retrovirus
used to inactiva te specific genes. mediated gene transfer,
electroporation of developing
2.0 Generic license.) embryos. and embryo culture.
The medaka (Oryzias /aripes ) Rapid advances are being
is a distant relative of the made in chicken transgenics,
zebrafi.sh (the two most likely embryonic stem (ES) cell technology, and cryopreservatio n of sperm,
separated from a common blastod isc cells, primordial germ celJs. and ES cells.
ancestor more than
11 0 million years ago) and is The mouse (Mus musculus)
also well suited to genetic and is particularly well suited
embryological analyses, with to genetic studies and is an
(Courtesy of Seata ro under the a short generation time extensively used model of
Creative Commons Attribution (2-3 months). inbred strains, mamma lian development.
ShareAlike 3.0 license.) genetic maps, transgenesis. Its small size and sho rt
en ha ncer t rapping, and generation time have allowed
availability of stem cells. large-scale mutagenesis
progra ms and extensive
Xenopus is a genus of Africa n «(ourtesy of Rana under the genetic crosses. and various
clawed frog s of which two Creative Commons Attribution features aid in mapping
species- K/aevis (pict ured ShareAJike 2.0 France.) genes and phenotypes. The
on the left) and the much ability [ 0 construct mice with
smaller X. tropicalis (right) predetermined genetic modifications to the germ li ne (by transgenic
have been wide ly used in technology and gene targeting in embryonic stem cells) has been a
studying development. powerful tool in studying gene func tion and in creating models of
All developmenta l stages human disease, as will be detai led in Chapter 20.
are accessible, and the
comparatively large size of The rat, being considerab ly
[From Amaya E, Offield MF & Xenopus eggs and embryos larg er than the mouse, has for
Grainger RM (1998) Trends Genet. facilitates micromanip ulation, many years been the mammal
14,253- 255. With permission from including microinjections of choice for physiological,
Elsevier.] (of m RNA, antibodies, and neurological, pharmacological,
antisense oligonucleotides), and bio chemical analyses.
ce ll grafting, and labeling experiments. Each cel l of an amphibian Rat models have helped
embryo contains its own yolk supply, and so the ce ll s are well able to advance medical research
cope with being transplanted and to continue to differentiate in an in cardiovascular diseases
explant (where they can be incu bated with selected protein factors) (hypertension), psych iatric disorders (stud ies of behavioral
or as an individual cell in simple salt so lu tio ns. Powerful methods fo r intervention and addiction), neural regeneration. diabetes, surgery.
generating tran sgenic embryos were developed in the mid-1990s. t ransplantation. autoimmune disorders (rheumatoid arthritis),
X.laevis has a longish generation tim e of 1-2 yea rs, and can be cancer, wound and bone healing, and space motion sickness. In
induced to produce 300- 1000 large (1.0-1.3 mm) eggs at a time. drug development, the rat is routinely employed to demonstrate
X. tfopicaJis has a comparatively short generat ion time (less than therapeutic efficacy and assess the toxicity of drug compounds in
5 months) and can be indu ced to produce 1000-3000 eggs at a time. advance of human clinical trials. Genetic analysis in la boratory rats,
0.6-0.7 mm in size. Because of a recent genome duplication. X.lae vis however, is much less advanced than in mice. partly because of
has a pseudotetraploid genome and is not suited to genetic analyses. the relatively high cost o f rat breed ing programs and because until
X. cropica/is has a diplOid genome, however, and is increaSing ly being recently it has been m uch more difficult to modify the rat germ line by
used as a model. gene targeting.
cousins. Livestock animals, such as sheep and pig~, have some useful applica
tions, but rodents are by a long way the most studied mammalian models, partly
because of their greater amenability to genetic studies and partly because of
financial and ethical considerations. Of these, the mouse is the premier model for
a variety of reasons, although an increasing number of studies involve rat models
MODEL ORGANISMS 305
(see Box 10.3). We consider mouse and rat models, largely in the context of genetic
models of disease, in Chapter 20.
Rodents are not good models for understanding some human systems, nota
bly how our brain functions. Nonhuman primates resemble humans in physiol
ogy, cognitive capabilities, detailed brain organization, social complexity, repro
duction, and development, and have been particularly important in neuroscience
research. Being large and normally living in complex social groups, great apes are
expensive to maintain and their use in animal experimentation has been conten
tious. As a result, most primate studies ha ve been conducted on smaller primates
with shorter generation times, notably the rhesus macaque and marmoset (Table
10.2).
TABLE 10.2 PRINCIPAL REASONS FOR STUDYING MAJOR MA~MALIAN MODEL ORGANISMS ~
NON-PRIMATE MODELS
Cat interactions between host gene and infectious pathogens, notably invo lving feline immunodefiCiency virus; as a model of
genetic disease, and testing treatment regimens; model of reproductive physiology, endocrinology, and behavior
Cow, sheep, pig as models of human endocrinology and physiology; host-pathogen interactions in relation to food safety. Transgenic
sheep have also been used to produce valued human proteins in their milk
Dog as a model of genetic diseases (a long history of selective breeding means that many types of dog are prone to genetic
diseases such as cancer, heart disease, deafness, blindness, and autoimmune disorders); an important model for the
genetics of behavior, and used extensively in pharmaceutical research
Mouse principal model for understanding gene functio n and mammalian development, and for modeling human disease and
testing treatment regimens; as detailed in Chapter 20, sophisticated genetic tools and large numbers of mutants have
long been available
Rat large size compared with mouse makes it more suited to physiological, neurological, pharmacological, and biochemical
analyses. Has provided useful models for various complex human disorders, but genetics not as well developed as mouse
genetics; routinely used in testing drug toxicity and therapeutic efficiency
Chimpanzee fOT evolutionary reasons, being the species most closely related to humans; to identify the basis of its resistance to
developing Plasmodium falciporum -induced malaria and its comparative resistance to the HIV progression to AIDS
Marmoset used in studying brain function, drug sensitivity, immunity, and autoimmune disease. Also used for studying cognitive and
behavioral disorders
Rhesus macaque the most widely used primate model. The preferred model for vaccine development. Important model in neuroanatomy
and neurophysiology, and beneT SUi ted than marmosets for studying higher brain function
A series of white paper discussion documents on indivi dual model organisms and proposed genome projects can be accessed after dicking
on individual genome seque ncing projects listed on the US National Human Genome Research Institute's list of approved seq uencin g targets
(http:.!lwww.genome.gov/ lOOO2154).
306 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
BOX 10.4 PRINCIPAL MECHANISMS THAT AFFECT THE POPULATION FREQUENCY OF ALLELES
Individuals within a populatio n differ from each other, mostly as Random genetic drift has little effect in large popu lations. Even
a result o f inherited genetic variation. Several factors affect the though the gam etes that are transmitted are a small fraction of the
frequency of any mutant allele in a population. total gametes in the population, they are a statistically representative
Natural selection sample. However, when population sizes are smail (e.g. as a result of
Natural selection is the process whereby some of the inherited genetic geographical isolation), the number of gametes that are transm itted
variation will resu lt in differences between individua ls in their capacity to the next generation w ill be proportionately smaller, and random
to engage in reproduction (affected by parameters such as m ortality, genetic drift can cause considerable changes in allele frequencies
(Figure 1). tn the absence of new mutations and factors such as
health, and mating success) and to produce healthy offspring
selection that affect allele frequency, alleles subject to random
(d ifferences in fertility, fecundity, and viability of the offspringl. The
fitness of an organism is a measure of the individual's ability to survive genetic drift will eventually become fixed (the point at which the allele
and to reproduce successfully. In the simplest models, the fitness of an frequency in the population is 0% or 100%).
individ ual is determined solely by its genetic makeup, w ith each locus Interlo cus sequence exchange
contributing independently. As a result, o ne can also ta lk about the Freq uent sequence exchanges can occur between highly homologous
fitness of a genotype. gene loci that encode essentially identical products. For example,
Many mutations that occur in functionally important sequences huma n S.8S rRNA, 18S rRNA, and 28S rRNAare encod ed by multiple
decrease the fitness of the individ uals that carry them (the mutat ion tandemly repeated transcription units (see Figure 1.22) that show
could remove or change a functionally important amino acid or very high levels of sequence identity. Mechani sms such as unequal
nucleotide, or alter splicing, or cause premature term ination of an crossover and gene conversion cause sequence exchanges between
encoded protein). Carriers of the mutation w ill be reprod uctively less the different repeats, resulting in sequencehomogenizarion (one
successful and so, to conserve functionally important seq uences, sequence is replaced by another, so t hat one type of repeat can
harmful mutations are selected against and removed from the increase in freq uency). As a result of frequent sequence exchanges,
population. Th is type of selection is known as purifying selection (also the mu ltiple gene loci effectively behave as alleles. The frequency
called negative selection), and mutations are said to be evolutionarily of a specific gene repeat (allele) can be determin ed in part by the
constrained. Purifying se lection results in a decrease in genetic frequency with whic h it engages In sequence exchange
diversity as the population stabilizes on a particu lar trait value, an
effect known as stabilizing selection. generation allele
Very rarely, a new mutation confers a selective advantage and frequency
increases the fitness of i ts carrier. Such a mutation will be subjected
to positive selection (also called Darwinian selection or advantageous
I'.'!!l o 0.5/0.5
lli!J
selection), which would be expected to foster its spread through a .;.
· -·.._.. ...
population. A reSUlting change in phenotype can be described as • • • • • • •t :
t · • •
adaptive evolution. Mutations that do not confer any advantage or available
·....
...
..
0.5/ 0.5
harm to their carriers are described as neutral muratiom. gametes
If a locus has two alleles with different fitnesses, the fitness
of the heterozygote m ay be intermed iate between the two types
of homozygote. The mode of selection in this case is co-dominant
1 0.4/ 0.6
and the selection will be directional, resulting in an increase in
the advantageous allele. In some cases, however, a new mutatio n
may be advantageous in heterozygotes but not in homozygotes ·..... :
·.. , .
··.
(heterozygote advantage). This situation, in which the heterozygote • ••: • • - I 0-410.0
·r
has a higher fitness than both the mutant homozygote and the normal
homozygote, is a form of balanCing selecrion known as averdominant
selection (see the example of cystic fibrosis in Box 3.7).
Random genetic drrrt 2 0.'/0.3
Changes in allele frequency can occur simply by chance, a process
known as random geneticdr;ft. Even if all the indiv iduals in a
..·'... ....
·...
··.....· .. . .
population had exactly the same fitness so that natural selection could
not operate, allele frequen cies would nevertheless change because 0.7/ 0.3
of random sampling of gametes. Such sampling occurs because,
through chOice or circumstances, not all individuals in a population
will reproduce, and so o nly a small proportion of the available gametes
in any generation is ever passed on to the next generation. Even if
every single individ ual were to contribute two gametes to the next Flgu.re 1 Random sampling of gametes in small populations can
generation, sampling would still occur beca use heterozygotes can lead to considerable changes in allele frequencies. [Adapted from
produce two types of gamete bearing different alleles but the two Bodmer WF and Cava lii-Sforza LL (1976) Genetics, Evolution And Man .
gametes passed to the next generation may, by chance, carry the With permission from WH Freeman & Company.)
same allele.
are subj ect to purifying selection seem to be highly conserved between species
when compared with sequences in which no selection of this kind is operating.
Comparison of three or more aligned sequences will reveal regions that show
sequence conservation probably as a result of purifying selection rather than just
by chance (which can occur if just two sequences are aligned). Because regions in
which purifying selection is operating do not tolerate many mutational changes,
the sequences are said to be evolutionarily col1strail1ed (figure 10.1).
308 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
BOX 10.5 ASSAYING THE EVOlUTION OF CODING SEQUENCES BY USING K,IK, RATIOS
Coding sequences are, by definition, functionally important and so t hat is silent. However, for immune system genes that co-evolve with
t hey are subject to selection pressure. Most cod ing sequences are genes of invading microorgan isms in a battle to recognize and kill t he
subject to purifying selection to conserve functionally important invaders, t he KalKs ratio may be much greater than 1.0, which is very
amino acid sequences, and they seem to evolve slowly. However, strong evidence that positive selection is involved .
some coding sequences are subje<t to positive selection and Working out a KaIK.. ratio over a full -length coding sequence
diverge rapidly in sequence in a way that is beneficial to the can be misleading because one part of a protein may be subject
organism. They include many immune system prote ins that need to purifying selection while another domain is su bject to positive
to be able to adapt q uickly to be able to recognize rapidly evolving selection . Modern methods can use multiple alignments of
microbi al antigens, many proteins involved in reproductive processes, or tho logo us sequences from several species w ithin a phylogeny to
and many others. work out a KalKs ratio for each codon.
The sta ndard way of asses sing whether a seq uence is subject to Various potential pitfalls may affect the accuracy of the
positive or purifying selectio n is to calculate the ratio of the number calculations. One assum ption-that synonymous substitutions are
of non-synonymous (am ino acid-changing) substitutions per non neutral substitutions- may not always be true. For example, the silent
synonymous site (~ or dN) to the number o f synonymous (silent) site may be part of a regulatory seq uence within an exon, such as
substitutions per synonymous site (Ks or ds). This ratio is variously an exonic splice enhancer. Also, if the compared sequences are from
known as the KJKs ratio or the dNlds ratio, or 00. When comparing distantly related organisms, mu lti ple consecutive substitutions may
orthologs from different species, the overall KaiKs ratio is often very have taken place at a specific site but go unnoticed.
much tess than 1.0 because a subs t itution that changes an amino For a We b-based server for calculating Kal Ks ratios, see Selecton
acid is much tess likely to be different between two species than one (http://selecton.tau.acj ll).
COMPARATIVE GENOMICS 309
~
(8) Cl.."' UC A A by purifying selection. (8) Suggestive
-.
~ S;¥~~c;:
C A
C
CA.ifiJ G u A C
to form hairpins with a 5 bp stem. The
midd le nucleotides in the 5 bp stem are
highly variable, but compensatory double
substitutions at the DNA level maintain the
sequences, such as those of vertebrates, is remarkably inefficient. A significant stem. The middle nucleotides of the stem
difficulty is the extent of large-scale genome rearrangements (involving regions seem to be selected to be diversified within a
larger than 50 kb) that have occurred during evolution. highly conserved RNA structure.
For the comparison oflarge genome sequences, more powerful programs are
needed. For a collection of extant (currently existing) genomes, the challenge is
to find the minimum set of subsequence groups so that inside each group the
sequences to be compared are both homologous and collinear (Fig,,", 10.3). This
collinearity reconstruction problem seeks answers to two basic questions. First,
which segments of extant genomes arose from the genome of a common ances
tor? Second, what was the likely evolutionary path of events that generated the
different segments? Various computer programs have been developed to recon
struct collinearity for use in large-scale alignment rrable 10.3).
Once collinear relationships between homologous genomic regions have
been established, computer programs take the unalign ed sequences and deter
mine which bases in each sequence are orthologous. Multiple alignment of very
large sequences consumes huge computing power and inevitably the programs
use heuristic methods; that is, they apply workable solutions that are not formally
correct so as to reduce computational time. Nevertheless, by 2007, megabase
chunks of the human genome sequence could be aligned with sequences from 27
other vertebrates (see the TBAIMULTIZ program entry in Table 10.3). Additional
programs are used to inspect the aligned sequence to identify sequences that are
evolutionarily constrained, suggesting a conserved function .
[ ........... : =: :=r===-==
rea rrangements since they diverged from
a common ancestor (interchromosomal
rearrangements are not shown for the
(8) !
sake of simplicity). (B) The homologous
sequences need to be grouped into sets
that are then arranged in a linear sequence
~ ~~~~~:lUmm:::!:l
nnmltm:l li so as to simplify subsequent whole genome
I alignments. [From Margulies EH & Birney
.~ ~
E (2008) Nat. Rev. Genet. 9, 303-313. With
[" I r;;r;;; "--j I ! c=J permission fro m Macmi llan Publishers Ltd.)
310 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
TABLE 10.3 SOME COMPUTER PROGRAMS USED TO ALIGN COMPLEX GENOME SEQUENCES AND DETECT
EVOLUTIONARILY CONSTRAINED SEQUENCES
Task Prog ram Characteri stics URL/ reference
Reconstructing MERCATOR uses on ly cod ing exons as initial anchoring poi nts http://www.bi ostat.wisc.ed u/-cdewey Imerca torI
homologou s
colJi nearity GR IMM ana lyzes rearrangements in pa irs of genomes http://g rimm.bioprojects.org/GRIMM/index.html
MLAGAN multiple globa l alignment of genomic sequences Brudno et al. (2003) Genome Res. 13,72 1-73 1
Multiple TBAiMULTIZ TBA is designed to be able to alig n ma ny Miller et al. (2007) Genome Res. 17, 1797-1808; the
alignment megabase-sized regions of multiple mammal ian 28-way vertebrate alignment can be viewed with
genomes; to perform the dynamic-programming the UCSC genome browser at http://genome.ucsc
step. TBA runs a stand-alone program called .edu, downloaded in bulk by anonymous FTPfrom
MULTIZ, which has been used to align much of the http://hgdownload.cse.ucsc.edu/goldenPath/hg18
human genome sequence with the sequences of I multiz28way, or analyzed with the Galaxy server at
27 other vertebrate genomes http://g2.tr ac.bx.psu.edu
Constraint PHASTCONS identifies evolutionarily conserved elements in a Siepel et aJ. (200S) Genome Res. 15. 1034- 1050;
detection multiple alignment, given a phylogenetic tree http://compgen .bscb.comelI.ed u/- acs/phast
Cons-HOWTO.html
SCONE gives a sequence conservation score after Asthana et at (2007) PLoS Comp. BioI. 3, 2559-2568
estimating the rate at which each nucleotide is
evolving. and then computes the probability of
neutrality given the estimaled rat ~
For further information on these and addition al programs, see Margulies & Birney (2008) in Further Reading.
* • * * ..
SB - - -- -------- --- - -G~CT TTTCAACAAACGTAG T ~CTTGGTGCCGTC TTAA
* ..
...
8B AAG_t.!CA~GCAGCAGTAGCGAAAGGAACCGCCTGA.AJ\fATAGTACTCGGTAATeTTCA
* • • ** ... * *
positions where there is identity in the four
sequences. Trinucleotides underlined in red
indicate potential premature termination
codons. Not only is the start codon (green
se TCGTGAATTTACAT- TGACTGCTCTTTCGTGCAATTCATACTGTATTATTATAGTTGCAA
box) not conserved but the alignments
sp CAGTGAACCCATG~TGACTGC TCTTTeGTGCAATTCAAT CTGTA ---TTCTAGC CGCAA require frequent insertions and deletions
8M CAG ATAACTCATCG - TGAC TGCTC TTCTGTACGGT CCAAC CTGTATT CTCTT~C TTAAA (arrows). Thus, the homologs in the other
*** *** ..
8B CATACTTCTTAC.'!:.-CAGCTGTTCTCTCGTGTGGGCCAATCTGTATCTGACTATCATCTA
• ** ** ... ** • *
Saccharomyces species could not make a
protein like the one that had been predicted
for (he YDRI 02e ORF, raising doubts about
sc CTAGATGGCAAAAAATT--GATAATGCCC TAAAAATG AeAGGCC ~ whether (he YDR 102e sequence is really
SP TTAAA CGACAGAAAAT Tf~ATAATGCCC TAATGATGAC AGGTTGA TAA C J -start a coding sequence. [From supplementary
SM TTAGAAGACGGGAAGTT--GATAATGCCCTATTGACGACAGGGCATAAA c=J 'SIOP information in Kellis M, Patterson N, Endrizzi
S8 C TAGCC ~TCAAAGAATT-- GATAATGCC CTAACGGCGACAGGCCACAA~ 'fA . In/del Met al. (2003) NalVre 423, 241-254 . With
permission from Macmillan Publishers ltd.]
** * ••• ***** ...... **** ... ... **** •
312 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
inert, but high-throughput sequence sam plin g now suggests that at least 85% of
the human genome is transcribed. In addition, as we describe later in this chap
ter, highly repetitive retrotransposons can give rise to fun ctionally important
sequences including enhancer sequences, exons, and even some genes.
TABLE 10.4 PRINCIPAL GENOME BROWSERS FOR IDENTIFYING CONSERVED NONCODING SEQUENCES IN VERTEBRATES
Genome browser Conserved elements identified with Sequence alignment based on URL
aln global alignment, one sequence string is transformed into the other; in local alignment, all locations of similari ty between the sequence
strings under comparison are returned. Gl obal alignments are less prone to demonstrating false homology because each letter of one sequence
is co nstrained to being aligned with onl y one letter of the other. However, local alignments can cope w ith rearrangements betw een non-syntenic,
orth olog ous sequences by identifying Simila r regions in sequences; this, however, comes at the expense o f a higher false positive rate. Glocal
alignment is a combination of global and local alignment that allows the u ser to create a map that transform s one sequence into ano ther w hile
allowing for arrangement events.
I;
100%
identified, spanning the regions shown by
-(] I II II II II II II II II I I/
E2 ------ 17 Al'. PE QS HVVQ DCYH G DG Q SYRGT YSTTVTGRTC Q.~.W SSMT PH Q HNRTTENYP N Agun!! 10.8 Exon duplication. (A) The
E3 A CLU lNYCRNPD.WAAPYCYTRD PGVRWEYCNLTQCSDAEGT AVAP PTVTPV PSLElAPSEQ 1 30 hum an LPA gene encodes apolipoprotein(a)
1 31 AAP EQSHVVO DC YHGDGQSYRGTYSTTVTGRT C Q,~.W SSMTPHQH N R TT E NY PN and is implicated in coronary heart
AG L Il-lNYCRNP DAVl'J\PYCYTRDPGVRWEYCNL TQC SDAEGTAVAP P TVTPVPSL E .~. PSE Q 244 disease and stroke. There is extensive size
245 AA PEQS HVVQDC YHGDGQS YRGT YST TVTGRTC QAWS SMTPH QHNRTTENYPN polymorphism, and the N-terminus of the
AG L I MNYC RN P D.~. V.~.•~. PYC YT RDPGVRWE YCNL T QCS DAE GT AVA P PTVT PV PSLeAE SEQ 358 protein has a large but variable number
359 AJ. PEQSHVVQDCYHG DGQSYRGTY$T TVTGRT CQAWS SM'I'PHQHNR'l' TENYPN of repeats of a 114 amino acid kringle IV
A GL !l-lNYCRNPDl , VA/'.PYC YT RDPG VRWEYCNL TQC SDAEGT AVAPP TVTPVP SLEAPSEQ 472 domain. Each kringle repeat is encoded
473 Jo.J.. PEQSHVVQDCY HGDGQS YRGT Y STTVTGRTCQAWS SMTP HQHNRTTENYPN by a tandemly repeated pair of exons. The
first nine kringle domains (spanning amino
A GL IMNYCRN PDAV l,l'.PYC YT RDPGVR WEYCNL TQ CS DAEGT AVA PPT VT PV PSL £APS EQ 586
acids 17-1042) are shown here. They have
587 M PEQS HVVQDCYHGDGQSYRGTY ST TVT GRT CQl'.WSSI-1TPHQHNRT T ENYPN
identical amino acid sequences and are
AG L I MN YCRN PDA VAl'. PYC YT RDPGVRWE YCNL TQC SDJI.EG T A VAP P'I 'VT PV PSLEA PSEO 700
encoded by exons 2- 19 of the LPA gene;
70 1 M PEQSHVVQDCYHG DGQSYRGT YS TT V TGR T CQAW S SI-1TPHQRNRTTEN YPN
grey arrows represent tandemly repeated
A GL I NNYCRN PDAVAA PYCY T RDPGVRWEYCNL TQCS Dl'.EGT AVJ..PPTV T PVP SLEAPSEQ 814
exon pairs, starting with exons 2 and 3 (E2
8 1 5 AAPE05HVVQDC YH GDGQSY RGT Y STTVTG RT CQA'""S51-IT PHQHNRT TENY PN
and E3) and ending in exons 18 and 19 (El 8
lI.GLIl-lNYCRNPDAVAAPYCYTRDPGVRWE YCNL TQCSDAEGT AVA P?TVT PV PSLEAPSEQ 928
and E19). The 18 alanines shown in bold
E18 929 A..~. PE Q SH VVQ D CY H GDGQ S Y R G TYS T TVTGRTC Q AW S S MT PHQHNRTTENY pN
red letters are encoded by codons that are
E19 -- - AG L nlN YC RNPD.~VAAP Y CYTRDPGVRW EY C N L T QC5DJI. E G T AVAPP T VTPVP S L E_~. PSEQ 1 042
each interrupted by introns, (8) Fibronectin,
a large extracellular matrix protein, has
(6) multiple repeats of three different domains,
nbroneclin ~HI... +t-tHHHOtH[}H D KH- In the fibronectin gene, the type I and II
domains are encoded by individual exons
(orange and purple boxes, respectively), but
type III domains are encoded by a tandemly
coding potential
for exons
fibronectin
type I domain I nbronectin
type II domain
fibronectin
type III domain D no known
domain repeated pair of exons (blue boxes).
GENE AND GENOME EVOLUTION 31 S
Gene duplication can permit increased gene dosage but its major
value is to permit functional complexity
Human genes have been duplicated by various duplication mechanisms (see
Chapter 9, p. 263). Gene families are a major general featu re ofmetazoan genomes,
and one gene in a hundred is estimated to be duplicated and fixed in the popula
tion every million years. When a gene duplicates within a genome. the two result
ing gene copies are initially identical. In rare cases, the duplicated genes may be
maintained to produce identical, or essentially identical, gene products because
increased amounts of gene product are advantageous in some way. The large
numbers of genes that make essentially identical copies of different ribosomal
RNA and histone proteins come into this category.
In response to an altered environment. other genes may undergo dupli cation
to provide a selective advantage thro ugh increased gene dosage, as illustrated in
the recent adaptation of human populations to starch diets. Starch is digested
using the enzyme salivary amylase. The chimpanzee has a single sal ivary amylase
,..,
hybrid RNA
! Ll
~ AAAAAAA
E3
elements have weak poly(A) signals, and so
transcription often continues past such a
Signal until another nearby poly{A) signal
hybrid eDNA
1
reverse
transcription
is reached (e.g. after exon 3 in gene A). The
resulting RNA copy con tains a transcript
not just of LlNE-' sequences but also of
a downstream exon (in this case E3).Th e
Ll(E3 LINE·' reverse transcriptase machinery can
Ihe n act on the extended poly(A) sequence
lU
insertion toto
different gene
to produce a hybrid eDNA copy that contain s
both LlNE-' and E3 sequences. Subsequent
transpoSition into a new chromosomal
location may lead to the insertion of exon 3
gene B into a different gene (gene B).
316 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
looln _ _ anc~stral
figure 10.10 Orthologs and paralogs.
r
---, g
~
globin gene
paralogs paralogs
are dire< t/y related through descent from a
- myoglobin - cytoglobin _ - myOglobin common ancestor.
cytoglobin
orthologs orthOlogs
gene, but in humans a series of evolutionarily very recent tandem gene duplica
tions has produced multiple copies of such genes at the AMYl gene cluster at
Ip21. AMYl copy number va ries significantly between haplotypes, and AMYl
copy number positively correlates with the expression of salivary amylase. Human
populations that historically consume a high-starch diet have significantly more
AMYl copies than populations that traditionally consume a low-starch diet;
increased AMYl copy number seems to be an adaptive response to increased
starch in the diet.
Gene duplication to provide increased gene dosage is comparatively rare.
More frequently, after gene duplication the sequences of the two genes diverge
quite extensively. One of the duplicated gene copies may become a pseudogene
(see Box 9.2). Alternatively, two divergent functional gene copies are retained.
The divergent gene copies may acquire different properties and often they come
to be expressed in different ways that are functionally advantageous. Gene dupli
cation is thus thought to be a major motor that drives increasing functional com
plexity. The two homologous gene copies are described as paralogs (as opposed
to orthologs, which are present in different species; figure 10,10).
Retrogenes offer an immediate possibility for functional divergence: because
the gene copy is made at the cDNA level. it lacks the promoter and both neigh
boring and intronic regulatory elements of the parent gene (see Figure 9.12).
Continued expression of a retrogene is therefore dependent on regulatory
sequences in the neighborhood of its integration site that are usually rather dif
ferent from the sequences that regulate the parent gene. As a result, the parent
gene and retrogene copy can be expressed in different ways, often in different cell
types. Afrer being exposed to a different cellular (and molecular) environment, Figure 10.11 Functional divergence of
the retrogene can come to acquire different functions. duplicated genes. (A) Neofuncrionalization.
Here, one of the duplicated genes retains
Genes duplicated at the genomic DNA level can also acquire different func
the function of the ancestral gene. The
tions (J'lgure 10_11 ). One possibility is that one of the duplicate genes acquires a other paralog undergoes mutations (green
distinctive new function (neo[unctionalization). Pure neofunctionalization is arrow) that result in itshaving altered
rare, however. Usually both genes undergo expression changes over evolutionary characteristics (green box, but could be in
time. Sometimes the duplicated genes acquire complementary mutations in dif regulatory sequencesunlike as shown here),
ferent Cis-regulatory regions so that they diverge in expression while between leading it to acquire a new function.
them initially maintaining the expression domains of the ancestral gene_ As a (6) Subfunctionafization. The duplicated gene
result, different subsets of the functions of the ancestral gene can be partitioned copies undergo compl ementary deleterious
mutations, often at the level of regulatory
(A) elements. In this example, the ancestral
- ---+ gene is imagined to be regulated by two
sets of upstream control elements (b lue and
red symbols) so that it is express ed in two
---+ different cell types, or differe nt tissue types,
or at different developmental stages (blue
and red arrows). Mutations (jagged arrows)
(8)
inactivate one set of control elementsin
one paralog and the other set in the second
para log so that the express ion patterns
and functions of the ancestral gene are
partitioned between the two daughter
genes .
GENEAND GENOME EVOLUTION 317
ancestral globin
-800 MYA-,
/
NGB CYGB MB HBZ HBA2. HBAl HBEl HBG2 HBGl HBD HBB
_____. 14q
I
_____. 17q _ _ 22q
1
~ 16p
cL.:L •1 ...........a-.
1 1
t - - llp
neuroglobin
L
cytoglobin myoglobi n ,
1 a , G, '" 8 ~
43.2%
Figurl! 1D.12 Evolution of the vertebrate globin superfamily. Human globin genes are located on five chromosomes (bottom).
Those reSUlting from evolutionarily ancient gene duplications have subsequently been separated by chromosomal rearrangements.
The more recent duplications have resulted in gene clusters on chromosomes 11 and 16. Globins encoded within a gene cluster
show a greater degree of sequence identity (64.1-99.3%) than globins on different chromosomes (e.g. 43.2% between a- and
P-globin; sequence identities of the a- and P-globin genes to globins on the other chromosomes are significantly less than 30%).
Some of the duplications have been recent: the HBA 7 and HBA2 genes encode identical a-globins, and the HBG1 and HBG2 genes
encode y-globins that differ by a single amino acid. Other vertebrates show some differences. Sometimes there are copy number
differences (for example, mice have a single a-globin and two P-globin genes, goats have a triplicated p-globin gene cluster,
and many fish have two cytoglobin genes). In addition, there are qualitative differences, such as a globin gene that is expressed
specifically in the eye of birds. MYA, million years ago.
318 Chapter 10: Model Organism s, Comparative Genomics, and Evolution
functions in the case of neuroglobin. Fish have two cytoglobin genes that are
expressed widely but have diverged in gene expression: the cygb-2 gene is
expressed very strongly in neuronal tissues, unlike the cygb-l gene.
The blood globins have subsequently undergone further specialization, but
this time differences in gene regulation result in differences in developmental
timing. Thus, in early development, r;-globin is expressed instead of e<-globin,
which will take its place at later stages, and £-globin (embryonic period) and
y-globins (fetal period) are used instead of ~-globin, which begins to be synthe
sized at three months' gestation and grad ually accumulates as y-globin produc
tion declines. One rationale is that the glob ins used in hemoglobin during embry
onic and fetal periods are be tter suited to the more hypoxic environment at these
stages.
As we shall see later on in this chapter, different metazoan species can have
very different copy numbers for a specific gene family and, in some cases, the dif
ferences can clearly be seen to be evolutionarily advantageous.
! WGD
..,..> f.J·a-.Ii-. . . .o-HO ~ 1].___
tetraploid genome
.-ti3-. fGO
" -O IHl • • • -0-0-0-£1____
.0.
gene loss and
J, diploidization originating from whole genome
duplication followed by large-scale
• • • ___ {V cJ - -fI-- j. O Cl-o-o----a--
dIplOid genome gene loss. This hypothetical genome has
0 . . . . O- -GO ___ 22 different genes; only one haploid set is
! WGD
shown but the genome is diploid. Whole
genome duplication (WGD) results in a
. - -. _ -[jJJ Iii! EH J D-O El- 0 complete series of paralogs in identical
~----G~.~~r--j~~ ___ letraplold genome
order and, initially, a tet raploid genome.
__- __ ~~ ~ ~O~ In most of the paralogous gene pairs,
~----ju. t:i-.-tl-I'J--!3-D--.- one gene acquires disabling mutations to
become a pseudogene and is eventually
!
gene loss and
diploidization
(A)
II 01 ~ ft '
related phenotypically but have very
different karyotypes. (A) The Chinese
muntjac (Muntiacus reevesi) has 2n = 46
chromosomes. (B) Because of a series of
~o nn on 00 00 00
Y2 XY l
chromosome fusions, the Indian muntjac
(Muntiacus muntjak) has twO pairs of
very large autosomes plus two large X
on on "'0 "0 ,,0
" n chromosomes in females (2n = 6) or, as
shown here, an X chromosome and two
--, ~,. ,,~
A~ _A ~ ,. different Y chromosomes in males
(2n = 7). lKaryotype images adapted from
Austin CR & Short RV (1976) The Evolution
~,. AA
~A
x-v XY (male); xx (female) preva lent in mammals, but also in Y chromosome carries male-determining
some amphibians and in some fish b factor; sperm are X or Yand so determine
the sex
z-w zv.t (female); ZZ (male) prevalent in birds and reptiles, but also eggs are Z orW and so determine the sex
in some amphibians and in some fish b
X-autosome ratio XX (female/hermaphrodite); in many types of insect and some C. elegans is an example in which
XQJ/XY (male) nematodes XX individuals are hermaphrodites;
D. melanogQster has a Y chromosome but it
does not confer maleness
Haplo-diploid relies on ploidy only in many types of insect unfertilized eggs produce haploid fertile
males; fertilized eggs are diploid and give rise
to females or sterile males
"The 0 denotes zero, so that XO means a single X. b"fhere can be different systems in the same kind of organism. For example, some medaka species
have the X-Y system and others have the Z-W system.
(homomorphic) but are presumed to differ at the gene level. In other cases, the
sex chromosomes can clearly be seen to be structurally different and are said to
be heteromorphic.
In heteromorphic sex chromosomes, one sex has tvvo different sex chromo
somes and is said to be heterogametic. being able to produce gametes with eitber
one of the tvJo sex chromosomes. The other sex is homogametic and produces
gametes with the same sex chromosome. For some species, the male is hetero
gametic, whereupon tbe sex chromosomes are designated X and Y For other
species, the female is heterogametic and the sex chromosomes are designated Z
and W (see Table 10.5).
In both the X-Y and Z-W sex chromosome systems, the chromosome that is
exclusively associated with the heterogametic sex (Y and W) is comparatively
small and poorly conserved with very few genes and with large amounts of repet
itive DNA. For example, the human X chromosome contains many genes (includ
ing close to 900 protein-coding genes) within 155 Mb of DNA. The Y chromosome
has just 59 Mb of DNA and much of it is genetically inert constitutive heterochro
matin. There are more than 80 protein-coding genes on the Y chromosome, but
several genes are repeated and effectively the human Y chromosome encodes
only close to 30 different types of protein.
In female meiosis, the two X chromosomes pair up to form bivalents much
like any pair of autosomal homologs, but X-V pairing in male meiosis is more
problematic because of major differences betvJeen the X and Y chromosomes in
size and sequence composition. Recombination betvJeen the human X and Y
chromosomes occurs exclusively in short terminal regions that, as a result of
engaging in frequent X-V crossovers, are identical on the X and Y chromosomes.
Because of recombination these sequences can move betvveen the X and Y chro
mosomes; they are neither X-linked nor Y-linked and are known as pseudoauto
somal regions.
The remaining regions of the sex chromosomes, containing the great majority
of their sequences, do not recombine in male meiosis and so are sex-specific
regions. Because the Y chromosome is exclusively male, the non-recombining
region on the Y chromosome never engages in recombination and is sometimes
referred to as the male-specific region. The equivalent region on the X chromo
some does, however, recombine vvith its partner X chromosome in female
meiosis.
The Xp/Yp andXq/Yq pseudoautosomal regions are allelic sequences, but the
X-specific and Y-specific components outside these areas are rather different in
sequence. Nevertheless, there are multiple regions of X-Y homology in the
X-specific and Y-specific regions that indicate a common evolutionary origin for
the X and Y chromosomes. They include 25 gene pairs with functional homologs
on the X and Y, known as gameto logs (Figure 10.17).
322 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
Y
into pseudogenes (with symbols terminati ng in a P; see examples in the gene SMCX .,
ctusters labeled a, b, and c). As a result of positive selection. the sequence of the c
PRKY
male-determinant SRY is now rather different from its original gene partner on AM£LY
the X chromosome, the SOX3 gene (highlighted in yellow). (Adapted from Lahn
ARSEP
8T & Page DC (1999) Science 286. 964-967. With permission from the American ARSDP
DFFRY
c~~
UTV '
MSY
The pseudoautosomal regions have changed rapidly during fB4Y j
KALP
evolution STSP
Humans have two pseudoautosomal regions. The major one, PARI, extends over $MCY
£IF1AY
2.6 Mb at the extreme tips ofXp and Yp. The minor one, PAR2, spans 330 kb at the RPS4X
extreme tips of Xq and Yq. PARI is the site of an obligate crossover during male
meiosis that ensures correct meiotic segregation. As a result, the sex-averaged RaMY
0.7 Mb long), and in the mouse X chromosome it is located at the tip of the long SQJ(J I regions
boundary
,,
PAR1 ,, se)(-specific
--}--+
ANTJ
I ,,
W ... I138 (;Sr:''i'A DHRSXY
A05E .05,
TEL PGPL ~ .5110X ><7 \:,tJ"rs MTl
/ ASMr TMII1P MIC2 5')(G:3' )(G
":oJ. J.
• • • ... M._ • • ; II
mou..
artiUllO{,5
t
'oJ
E:.O~
~ ~
~
":.)
.
Q
0
~
~
•
'"
~ +
,,
,,
mouse X
crthOl0£5
--c :3
" '0 ,, ~ l•
-
r'ltIU~~ Y
,, ~
..... .
.Ai\lT ....
PP~RJB CSF?R,.I, I MRS)" ~ arthologs
TEL ~Pl \ ~~OX \'l"IRt SWl.,j "".,. "
• ••
'8 T-aII1{J 1\1,C2 S' XG StW
I Fi'ps~y
I • • • _.....IiI!LII
,
, II
Flgu~e 10. 18 Organization and evolutionary instability of the major pseudoautosomal region. The 2.6 Mb human PAR1 region is common to the
ti ps of Xp and Yp and contains multiple genes, of which 13 protein-coding genes are shown here. Adjoining PAR, are the sex-specific parts of the X
and Y chromosomes, with only the first few genes shown here. The PAR1 boundary occu rs within the XG blood group gene, so that the S' end of XG is
in the pseudoautosomal region whereas the 3' end of XG is X-specific. On the Y chromosome, there is a truncated 5' XG gene homolog containing just
the promoter and the first few exons. The adjacent Y·specific DNA contains unrelated sequences including SRYand RPS4Y. Genes w ithin PARl and in
the neighboring sex-specinc regions (which were part of a former pseudoautosomal region) have been poorly conserved during evolution. and mouse
orthologs are o ften undetectable (-) or have extremely diverged sequences with low similarity to the huma n ones. TEL, telomere.
GENE AND GENOME EVOLUTION 323
w
r
-<
ffi
r
@ r
.Xi
of genes so that only a few remain, mostly
ones that confer a male advantage, and the
P8
1m =
u L.LJ
P1P6 P5P4
DO ~=
LJ U'---'
P3 P2 Pi
identical sequences separated by shorter
(2 -170 kb) unique spacer sequences. The
structure of the P6 palindrome is shown
---_----.J~----------
. )
at the bottom as an example. The rest
of the MSY comprises heterochromatin.
There is also a 0.4 Mb internal block of
heterochromatin in the euc hromatic
region. [Ad apted (ro m 5kaletsky H, Kuroda
P6 left arm (110 kb) spacer (46 kb) P6 right arm (110 ko) Kawaguchi T, Minx PJ et at (2003) Nature 423,
825-837. With permission from Macmillan
key: 9 X.transposed 0 X-degenerate 0 ampliconic ~ heterochromatic Publishers ltd .)
GENE AND GENOME EVOLUTION 325
TABLE 10.6 ABOUT 75% OF PROTEIN-CODING GENES ON THE HUMAN MSY REGION FUNCTION IN THE TESTIS
MSY sequence Gene Gene(s) Copy number Expression Function
c1assa number
X-transposed 2 TGIF2LY
j
testis transcription factor
RBMY 6
BPY2 3
OAZ 4
aSee Figure 10.21 for the location of the different gene classes in the human MSY (mate-speci fic region of the Y chromosome). Data adapted from
Ska letsky et al. (2003) Nature 423,825- 837.
DNA sequence (a copy is made of the donor DNA sequence that is then used to
replace the original acceptor sequence-see Box 14.1 for the mechanism). In the
context considered here, the outcome is that one arm of a palindrome acts as a
gene conversion donor sequence and the other arm as an acceptor so that the
sequence of one arm is periodically replaced by a perfect copy of the sequence of
the other arm. Essentially, this is a type of intra chromo somal recombination and
may act so as to preserve the testis-expressed genes from the extinction that they
might otherwise have faced in the absence of recombination.
majority of animal phyla. They mostly have three germ layers and so Hominins-humans plus great apes (see Figure 10.26).
are now almost synonymous with tripfoblasts. Hominoids- humans plus great apes plus ora ngutans plus gibbons
Catarrhine primates-hominoids and Old World monkeys (see Figure (see Figure 10.26).
10.26).
Metatherian mammals-marsupials (e.g. ka ngaroo).
craniates.
wide variety of environments in South and East Asia, the Middle East
and are radially symmetrical (e.g. jellyfish, corals, and sea anemones).
Phylum-a group of species sharing a common body organization.
that have a fluid-filled body cavity but one that is not completely
which the mouth originates from the blastopore (the open ing of the
deuterostomes.
anus will open opposite the mouth. Includes molluscs, annelids, and
and becomes the anus. Later on, the mouth opens opposite the anus.
characterized by a fu lly moveable upper jaw, rayed fins, and a swim
A rooted tree (or cladogram) infers the existence of a common ancestor (rep (AI
resented as the trunk or motof the tree) and indicates the direction of the evolu
tionary process (Figure 10.238). The roo t of the tree may be determined by com
paring sequences against an ou.tgroup sequence (one that is clearly but distantly
related from the sequences under study). An unrooted tree (Figure lO.23A) does
not infer a common ancestor and shows only the evolutionary relationships
internal
between the organisms. Note that the number of possible rooted trees is usually nodes
much higher than the number of unrooted trees.
· .. M.N . • A .• M . • . E. MA.A .• SP . . . . . . . . . . c. Y .
· .. M. N .. A .. M .. . 0. MAPH . EP . . . . . . . . .. S. Y .
· A .. 01'.. KAAVTTI. . . VA . . IE S .. L . S . . . . . A. Y . . . . . . . . . . . VS
0 .465
F-1
0 .720
0 .060
chimp
0. 412 0. 24 0
mouse
0 .540
(SI
0.24 0
human Chimp mouse rat chick zebrafish rat
human 0. 02 0.24 0.26 0.42 0.54
chimp 0.26 0.28 0.42 0.54
t..:':::
.''
:.;5
.:._ _ _ __ chick
mouse 0 .08 0.40 0.54
rat 0.42 0.54
chick 0.56 1.65 7
L.:::=_ _ __ _ _ _ zebrafish
FlguN 10.24 Constructing a phylogenetic tree with UPGMA. In this example, the N-termin al 50 amino acids of S-globins from human and six
other vertebrate species were aligned, and a rooted tree was constructed by using the UPGMA (unweighted pair group method with arithmetic mean)
hierarchical clu stering method, which first calcu lates a distance matrix. (A) The human S-globin sequence is used as a reference for the alignment.
Sequence identity is shown by a dot. (6) A distan ce matrix is constructed in which pai rwise comparisons are used to establish the fraction of amino acid
sites that are different between two sequences. For example, the human and chimp sequences differ at one aminO acid out of 50, and so the fraction is
0.02. (e) The distance matrix values are used to con struct the phylogenetic tree. Note that the lengths of the horizontal branches are pro portional to the
distan ces between the nodes. Phylogeny softwa re packages such as PHYLIP are available at va rious Web servers (e.g. at the Institut Pa steur in Paris,
htlp:llbioweb2.pasteur.fr/phylogeny/intro-en .html) and typically offer neighbor joining (unrooted trees) and UPGMA (rooted trees) as alternatives.
OUR PLACE IN THE TREE OF LIFE 329
may be possible for a small number of sequences. For a large number of sequences,
the number of trees generated becomes extremely large and the computational
time needed to identify the best evolutionary tree becomes impossibly long. In
that case, a subset of possible trees is created by using heuristics: methods are
used that will produce an answer in a computable length of time, even if it may
not be the optimal one.
Once an evolutionary tree has been derived, statistical methods can be used
to gain a measure of its reliability. A popular method is bootstrapping, a form of
Monte Carlo simulation. Typically, a subsample of the data is removed and
replaced by a randomly generated equivalent data set; the resulting pseudose
quence is analyzed to see whether the suggested evolutionary pattern is still
favored. If there is a clear relationship between two sequences, the randomiza
tion introduced by the resampling will not erase it. If, however, a node connect
ing the two sequences in the original tree is spurious, it may disappear on ran
domization because the randomization process changes the frequencies of the
individual sites.
Bootstrapping often involves resampling subsets of data 1000 times. The boot
strap value is typically given as a percentage. A value of 100 means that the simu
lations fully support the original interpretation; values of 95-100 indicate a high
level of confidence in a predicted node; val ues less than 95 do not mean that the
original grouping of sequences is wrong, but that the available data do not pro
vide convincing support.
As a result of a combination of classical and molecular phylogenetic
approaches, humans have been classified as belonging to a range of different
groups of organisms. Figure 10.25 shows Simplified phylogenies for eukaryotes
and metazoans, and Figure 10.26 shows a vertebrate phylogeny.
(A)
I
, BACTERIA
_ 4000 ARCHAEA
MYA
EUKAR'I'OtES
L 3800
MYA
r----------------
taxa, etc. examples
L
Giardia lamblia
parabasilids Trichomonas
' - - - - - - - - - Apicomplexa
Paramecium tetraurella
Tetrahymena thermophlla
Plasmo dium fa/clparum
l Dictyostellda
microsporldlans
Diclyoste/ium discoideum
MYA
Saccharomycoti na Saccharomyces cerevlsiae
Candida alb/cans
Sordarlomycetes Neurospora crossa
r-------- choanoflagellates
' - - - - - - - - - METAZOANS (see panel B)
Figure 10.25 Simplified
eukaryotic and
metazoan phylogenies.
(A) Life is divided into
three doma ins-bacteria,
archaea, and eukarya
(B) tB1>B, etc. examples and the eukaryotic
r------------------- poriferan s sponges phylogeny is shown here.
The choanoflagell ates
DIPLOBLASTS (also called collar
r---------------- cnidarians Jellyfish . corals, flagellates) are considered
sea anemones to be the cl osest livi ng
protist relatives of the
sponges, the most
BILATERA = TRIPLOBLASTS
primitive metazoans. (B) In
the metazoan phylogeny,
PRQTOSTOMES
diploblasts have two germ
~
f======== platyhelminths
flatwonns layers whereas triploblas ts
~ molluscs octopus
annelids earthworms have three germ layers
- - -- - - - - arthropods 0, melanogasler and are bilaterally
Anopheles gamolae symmetrical. Note th at the
- - - - - - - - nematodes C. elegans
C. briggsae fundamental protostome
1000 deuterostome split, which
MYA DEUTEROSTOMES occurred about 1000
hemlchordates acorn worms million years ago (MYA),
echinoderms sea urchins, starfish, means that C e/egans
sea cucumbers and O. melanogaster are
more distantly related
CHORDATES
from humans than are
urochordates some other invertebrates
- - - tunicates eiona intestinafls
cephalochordates (ascldlans) such as the sea urchin
cephalochordates amphio)(U s Strongylocentrotus
~c;:reniates r-- Hyperotreti hagfishes purpurarus, and notably
l - . VERTEB RATES (see Figure 10.26) other chordates including
amphioxus and the
tunicate Ciona intestinalis.
OUR PLACE INTHETREE OF LIFE 331
medaka
estimated d ivergence times in millions of
years. OWM, Old Worl d m o nkeys; NWM, New
Worl d m onkeys.
I____
L
.
I 3&0
reptiles and birds chick
LI 310
MONOTREMES (PROTOTHER1A)
platypus, echidna
LI
kangaroo
Lr" 1",____
951.
75
L "~
L - pigs
horses
r--- dogs
cats
rabbits
r- rats
L mice
'" r NWM, e.g. marmoset
key:
11 (rIlIlOft
catarrhine primate gorll..,
hominoid Cll;mprirlltf" lXlll(1tx:)
• hominid hunlAn II
• homin ln
TABLE 10.7THE INCONSISTENT RELATIONSHIP BETWEEN GENE NUMBER AND ORGANISMAL COMPLEXITY
Species Description Genome size (Mb) Number of protein-coding gen es
"This refers to the macro nuclear (somatic) geno me. Ci liated protozoans are unusual because th ey o ften have two nuctei- a large macronucteus w ith a
genome that ha s a somatic, non-reproductive functio n, and a sma ller micronucleus th at is required for reproduction.
332 Chapter 10: Model Organisms, Comparative Genomics, and Evolution
TABLE 10.8 HIGHLY VARIABLE COPY NUMBER FOR CHEMOSENSORY RECEPTOR GENE FAMILIES IN DIFFERENT
VERTEBRATES
OR 388 399 1063 1259 822 1152 1198 348 300 1024 155
TAAR 6 3 15 17 2 17 22 4 3 6 109
OR. olfactory receptor; VIR. vomeronasal type 1 receptor; V2R, vomeronasal type 2 receptor; TAAR, trace amine-assoCiated receptor, Data abstracted
from Nei et al. (2008) Nat. Rev. Genet. 9, 95 1- 963.
OUR PLACE IN THE TREE OF LIFE 333
is less clear why Kruppel-associated box (KRAB}-zinc finger gene families have
expanded as they have in vario us mammalian lineages.
Although gene family expansions can therefore be an adaptive change. there
also seems to be a significant random element in the way in which gene family
copy number can vary between species. a process that has been called genomic
drift·
Regulatory DNA sequences and other noncoding DNA have
significantly expanded in complex metazoans
If increased gene number does not accowll for organism complexity, what does?
Alternative splicing has been put forward as a possible explanation to the para
dox. It is comparatively rare in simple metazoans such as C. elegan.s, but the great
majority of human genes undergo alternative splicing and there are Significant
human-chimpanzee differences in splicing of orthologous exons.
Another characteristic that clearly distinguishes between the genomes of
complex and simple metazoans and may well underlie organism complexity is
the proportion of noncoding DNA. Simple and complex metazoans can have
much the same amount of coding DNA (e.g. 25 Mb in C. elegans and 32 Mb in
humans) but the amount of noncoding DNA increases hugely in complex meta
zoans (75 Mb in C. elegans. but 3070 Mb in humans).
The increase in noncoding DNA includes a significant expansion in the
amount of un translated regions of protein-coding genes. which are known to
contain a wide variety of regulatory DNA sequences. The amount of highly con
served (and presumably functionally important) noncoding DNA has also
expanded very significantly during metazoan evolution. The most simple meta
zoans have very little highly conserved noncoding DNA. Thus. when the C. ele
gans genome is compared with that of another nematode. C. briggsae. almost all
highly conserved sequence elements are found to overlap coding exons. In
D. melanogaster. however. the amount of conserved noncoding DNA is about the
same as the amount of cod ing DNA. and in humans the amount of highly con
served noncoding DNA is close to four times that of coding DNA.
The great majority of the conserved functionally important noncoding DNA
sequences in the human genome are believed to be involved in gene regulation.
They include cis-acting regulatory DNA sequences (such as enhancers and splic
ing regulators). and sequences transcribed to give regulatory RNA. As detailed in
Section 9.3, recent analyses have uncovered a huge number of functional non
coding RNAs in complex genomes whose existence was not even suspected a few
years ago.
Gene regulation systems in vertebrate cells are much more complex than in
the cells of simple metazoans. As described below, there is considerable evidence
that ever increasing sophistication in gene regulation and lineage-specific inno
vations lie at the heart of organism complexity.
f lgur. 10.27 Humans have a gene that makes flies grow wings. (A)
(A) Drosophila with a mutation in the apterous ge ne lack win gs. The normal
phenotype ca n be rescued aher inserting the wild-type fly gene (B) or its human
onholog, LHX2 (C), into the mutant fly embryo. [Adapted from Rinc6n-Umas
DE, lu CH,Canall et al. (1999) Proc. NotlAcad. Sci. USA 96, 2165- 2170. With
permission from the National Academy of Sciences.]
closely related species. The cumulative data led Sean Carroll to propose a new
genetic theory of morphological evolution in 2008.
Key developmental regulators typically exhibit mosaic pleiotl'Opism: they
serve multiple roles (and so are pleiotropic), but they also function independ
ently in different cell types, germ layers, body parts, and developmental stages.
Because the same protein can shape the development of many different body
parts, the coding sequences of such regulators are under very considerable evo
lutionary constraint, and so are extremely highly conserved. At the protein level,
their funct.ions can be extraordinarily conserved; sometimes, their functions can
be mai ntai ned after they have been artificially replaced by very distantly related
orthologs, and sometimes even paralogs.
As shown in FIgure 10.27, a human ortholog of the Dl'Osophila apterous gene
can regulate the fo rmation of fly wings. The implication is that it is no t just the
gene regulators that are highly conserved, but also the proteins that they interact (C)
with to perform their functjons, and that in some cases the conservation has
been maintained over as much as the billion years of evolution that separate
humans and flies (Figure 10.25B) . Although there was a rapid expansion in the
number of fundam entally different genes encoding transcription factors, signal
ing molecules, and recep tors at the beginning of metazoan evolution, there has
been comparatively little expansion since the period just before the origin ofbila
terians (see Figure 10.25B).l'or example, 11 of the 12 vertebrate Wntgene fami
lies, which encode proteins that regulate cell-cell interactions in embryogenesis,
have counterparts in cnidarians such as jellyfish and, remarkably, six of the major
bilaterian signaling pathways are represented in sponges that diverged from
other metazoans at the very beginning of metazoan evolution (see Figure
10.25B).
The Hox genes, which specify the anterior-posterior axis and segment iden
tity in embryogenesis, have also undergone very few changes over many hun
dreds of millions of years. Coelacanths, cartilaginous fi sh that are very distantly
related to us, have much the same classical Hox genes as we do. More accurately,
they have four more Hox genes than us because of a loss of Hox genes in verte
brate lineages leading to mammals (Figure 10.28).
Thus, although many other protein famili es were diversifying over the last
several hundred million years, there has been comparatively little diversification
of key developmental regulators, possibly because of constraints imposed by
gene dosage. Instead, expansion of gene function without gene duplication has
often been achieved by using the same gene to shape several entirely different
traits. By approp riately regulating gene expression, the same gene product can be
sent to different but specific tissues (heterotopy) and at different developmental fJgu.-. 10.28 Gene duplication has not
stages (heterochrony) where, according to its environment, it can interact with occurred over hundreds of millions
of years of Hox cluster evolution. The
14 1 coelaca nth is a cartilaginous fish that
A~ descended from the earliest diverging
e • • ••••••••• lineage o f jawed vertebrates (see Figure
c-o ···QI ~~
o ••••• • •• , 0.26). Coelacanths, humans, and frogs have
14 "(- -- - - 1 coelacanth
remarkably similar Hox gene organizations,
A OD-Jo=-C---CoQOOOOO
e -(~-0DQi.;.c;ooooo. -410
and there does not seem to have been any
13 1
C ----or~ - - MYA A ~ gene duplication in the 410 million years or
D -OOO-~
common ancestor A14
810
~ I 8
C
o • ••••
•••••••••
•••• m::a •• a
• ••
so since they diverged from a last common
ancestor. Rather, there have been instances
C1 -370 813 western clawed frog of occasional Hox gene loss during this time,
C3 MYA D12 shown by curved arrows. For simpliCity. only
13 1
I A ---e--o • • • • • • • • 11
B
C~
. • • • • • ••••
the classical Hox genes are sh ow n within the
Hox cluster. MYA, million years ago. [Adapted
D •••••• • •• from Hoegg S & Meyer A (200S) Trends Genet.
human 21,42 1- 424. With permi ssion from Elsevier.]
OUR PLACE IN THETREE OF LIFE 335
~- JRE~ =t~:-~:-~~-:-t=~12:~t-~--~--=tRE'P
regions In pleiotropic developmental
regulators. (Al In this generalized scheme, a
gene with four exOns (El to E4) is controlled
by five ds-regulatory elements (CRE1 to
16) Paxfi CR ES). Each regulatory element consists of
deuto--j a variety o f modules that contain binding
muShroom Iritocerebral CNS sites for trans-acting proteins. Clustered
bodies neurons neurons
binding sites have extended sequence
conservation. w hich makes them easier to
muShroom eye
brain detect by comparative genomics. (B) The
bodies and
neurons Drosophila Pax6 gene. also called eyeless (ey).
has three exons (black bars) and encodes
Rhodopsin
a transcription factor that is expressed
.ftm L-J
lkb
in specific parts of the developing brain.
central nervous system (CNS), and eyes.
Expression is regulated b y six CREs (colored
blocks), most being 1 kb in length or longer.
The rhodopsin gene at the bottom is one
different sets of molecules. For example, the Drosophila Decapentaplegic signal of the target genes of Pax6.lt has a single
ing protein shapes embryonic dorsoventral axis polariry, epidermal patterning, fun ction that involves expression in the
gut morphogenesis, and the patterning of wings, legs, and other appendages. To photoreceptor cells of the eye, and it has a
achieve this type of highly specific regulation, pleiotropic gene regulators typi single CR E. as is commonly found in genes
cally have several large, modular CREs that independently regulate a specific pat that are restricted in expression. [Adapted
from Carroll SB (2008) Cell 134, 25-36. With
tern of expression ("'gure 10_291.
permi ssion from El sevier.]
Independent regulation of a developmental regulator by different CREs allows
multifunctionality, and harmful mutations in one CRE will not affect the func
tions of other CREs that regulate the expression of the same protein or the pro
tein itself. Each CRE consists of a series of modules with short binding sites (often
about 6- 12 nucleotides long) for trans-acting regulators, notably transcription
factors. In some regulatory elements, individual binding sequences recognized
by different transcription factors may be very tightly packed; in others, such as
diffuse enhancers, they may be much more spread out along the genomic DNA.
_I. •
(A)
r> pA
I
FigUfl! 10.]1 Transposable elements
can influence gene expression in several
ways. (A) When inserted upstream of a
gene, a transposable element (red box) may
provide new regulatory sequences, such
as an enhancer sequence (green triangle)
(6 )
I pA that alters the expression of a downstream
--~,-.,.~~~, ~~I-- gene. (B) A transposable elemen t inserted
withi n an intron may contain promoter
( • • • ~~ - - - - - - - _ • • ~~~ ~ - • • ~ I
elements (yellow oval) that drive antisense
transcription .The antisense transcript
(C) may interfere with sense transcription and
s
I D A
S pA
potentially regulate tran scription.
(e) An inserted transposa ble element may
carry cryptic splice donor (SD) and splice
acceptor (SA) sequences that allow it to
be incorporated as an alternative exon
(exonization). For various other possibilities,
e
see Feschotte (2008) Nat. Rev.
Genet. 9,397-405. See Figure 10.32 for
specific examples of (A) and (e).
pA, polyadenylation site.
factors, for example, or in the loss or modification of existing binding sites, But
new CREs may evolve by random mutatio n and also as a res ult of sequence inser
tion events involving transposable elements.
Transposable elements have been viewed as genome parasites that owe their
survival to their ability to replicate faster than the host that carries them; RNA
interference mechanisms are needed to limit their spread within a genome.
Transposable element repeats account for aboUl 45% of the DNA in the human
genome, although only a small percentage is active in transposition (see Chapter
9). By integrating into new locations in the DNA, they can influence the expres
sion of nearby genes by a variety of possible mechanisms (FIgure 10.31).
By disrupting the normal patterns of gene expression, transposable elements
are known to cause disease. [n addition to being a potential th reat at the level of
the genome and the individual, however, transposable elements can also be ben
eficial. Comparative genomic studies have shown that at least 10,000 human
transposable element fragments have evolved under strong purifying selection,
indicating that they are functionally important. Many of them are located close
to genes encoding developmental regulators, and many have been strongly con
served over hundreds of millions of years.
The functional importance of transposable elements arises from their occa
sional ability to donate sequences that are then used by the host genome for a
novel function, a process known as exaptallon. Thus, the sequences of a small
number of genes, sometimes known as neogenes, are thought to have originated
in large pan fro m transposons. They include TERT (telomerase reverse tran
scriptase), CENPB (encoding a centromere protein), and the recombination acti
vating genes RAGl and RAG2.
More commonly, transposable element sequences have contributed smaller
gene components. One way involves donating sequence fragments that have
appro priate splice sites so they can be included as novel exons (exo nization
Figure 10.31C). Such exons have contributed to both functional noncoding RNA
(such as the BC200 RNA-see Table 9.9) and various proteins. In addition, trans
posable element seque nces that contain binding sites for trans-acting factors
have been exapted to provide novel CREs.
As an example, the LF-SINE family of retrotransposons are known to have
contributed sequences to generate both a novel exon and a novel enhancer. The
starting point for this discovery was uc.338, a 223 bp ultraconserved element
located at 12q13 whose sequence is 100% conserved in human, m ouse, and rat.
The uc.338 sequence lies within the PCBP2 gene, which encodes an RNA-binding
regulatory protein, and spans an alternatively spliced exon that encodes part of
the PCBP2 protein . Comparative genomics also showed that the sequence of
uc.338 showed a high degree of sequence identity to members of the LF-SINE
retrotransposon family in the coelacanth, a fish that has d escended from one of
the most evolutionary ancient lineages of jawed fishes (FIgure 10.3Z).
OUR PLACE )N THE TREE OF LIFE 337
(A)
1 481 Figure 10.32 Extensive sequence
coelacanth If.-SINE A B ~ homology indicates a transposon origin
111;.3:)13 112013J for an alternatively spliced PCBP2 exon
human ~-
.~ -- . and an ISL 1 enhancer. (A) The COnsensus
PCBP2i12Q13j "--~--------------~. . . .----------- 481 bp coelacanth LF -S INE retrotransposo n
~Ll~nfi8r~~r(5Qll) c-______________________________ ~
II .
do"
[
ISL 7 gene. (6) In the top panel, vertical bars
IF-SINE - mou~
, ]CJ'I
[
hU~an Gi
- 1".. 1 , !.t ~J~ ;.'_ --iS'.,.
,.I ." li-. ;(. lrJ.
denote r~lons of sequ ence homology of
the coelacanth LF-SINE con sen su s with
1SLl. chrmp Gi
en""ncel mouse
dog °1'
G -
'. ) '"' .. .'-" ."'"11':1 . I
. A ClIL.......I[ I..»"~~
GLUD2 glutamate metabol ism enzyme hominoid-specific retrogene; evolved by retrotransposition from the housekeeping
gene GLUD7; believed to have undergone positive selection resulting in specific
targeting to mitoc hon dria with novel, pote ntially brain-specific functions
CDCI48retro cell cycle protein hominoid-specific retrogene that evolved by copying from a microtubule-expressed
isoform of CDC14; believed to have undergone positive selection and now
preferentially associates With endoplasmic reticulum
TCP10L likely spermatogenesis factor also expressed in hep'atocytes, where it may be Involved in maintaining the
differentiation sta te
snaR gene family - , 17-nucleotide noncoding RNAs hominid-specifi c and rapidly evolving; predominantly expressed in testis
that bind nuclear factor 90
Various miRNA genes various microRNAs generated by primate-specific expansions in miRNA gene families; they notably
include brain-expressed miRNAs, some of which have been claimed to be human
specific
Various primate-specific genes have been identified, and during the past 60
million years or so of primate evolution there seems to have been a burst of ret
rogene formation-some estimates suggest that at least one new retrogene has
formed on average every one million years. Other primate-specific genes have
originated through the duplication of genomiC DNA sequences, such as by seg
mental duplication. Some of the primate-specific genes are known to have
evolved recently and may be present in hominoids but not monkeys, and some
others are restricted to hominids (TubIe 10.9). Truly human-specific genes seem
to be extremely rare; however, families of genes encoding brain-expressed
miRNAs have undergone Significant expansion in primate lineages. and some of
the miRNA genes have been reported to be human-specific.
Lineage-specific differential expression of a single gene also plays a part.
Inappropriate ectopiC expression of a gene may have important consequences,
as strikingly revealed with the FGF4 gene in dogs (Agure 10.33). Human
chimpanzee gene differences resulting from lineage-specific gene inactivation
are clearly evident. Since the divergence of humans and chimpanzees from a
(A)
TABLE 10.10 EXAMPLES OF EVOLUTIONARY INACTIVATION OR MODIFICATION OF GENES IN THE HUMAN LINEAGE AFTER
DIVERGENCE FROM CHIMPA;.;.;;.E S _ _ _ _ _ _ _ _ _ _~~_~_ _ _ _ _ _ _ _ _ _ _ __'I
NZ ;.;E;;._
CMAH CMP-N-acetyl frameshifting deletion unlike other primates (and mammalsl. human s do not make
neuraminic acid N-glycolylneuraminic acid (Neu5Gc), which is one of the two
hydrolyase most common sialic acids in mammals and Is known to be
expressed in a developmentally regulated and tissue-specific
fashion
MYHI6 type of myosin heavy inactivation loss of this myosin protein in humans has been suggested to
chaIn correlate with the considerably sma ller masticatory muscles
used to chew food, making larg er brain sizes possible; see text
and Stedman et al. (2004) Nature 428,373- 374
SIGlECl1 type of sialic acid 5' end and first five exans repl aced by new and strong expression in hum an microglia
sequence copied from a pseudogene
common ancestor about 5-7 million years ago. about 80 genes have been
inactivated on the human lineage and now appear as nonprocessed pseudo
genes. and lineage-specific gene modification is also apparent in some cases
(Table 10. 10 shows some examples).
Differential inactivatio n of MYH16. a gene expressed in the jaw muscles. has
been considered to be one of the most biologically meaningful differences
between humans and chimpanzees. Humans cannot make a MYH16 protein
because MYH 16was inactivated by mutation in the human lineage about 2.4 mil
lion years ago. just before the modern hominid cranium evolved. Inactivation of
MYH16in humans may have led to a decrease in the size of the jaw muscle, allow
ing the necessary room for the hominid cranium to be remodeled to accommo
date a larger brain.
TABLE 10.11 EXAMPLES OF DNA SEQUENCES THAT HAVE BEEN POSTULATED TO BE SUBJECTTO DIFFERENTIAL POSITIVE
SELECTION IN THE HUMAN LINEAGE
DNA Type of Properties References
sequence sequence
FOXP2 protein-coding encodes a transcription factor that is important in speech; see the Enard et at. (2002) Nature 418, 869- 872;
gene text Enard et al. (2009) CeJl137, 961-971
MCPH protein-coding encodes microcephalin, a protein that regu lates brain size Evans et al. (2005) Science 309,
gene 17 17- 1720
HARIA (HAR1F) RNA gene expressed specifically in Cajal- Retzius neurons in the developing Pollard et al. (2006) Nature 443, 167-172
neocortex; co-expressed with it is reelin, a protein that is
fundamentally important in specifying the six-layer structure of the
brain
HACNSI enhancer important in early development; in compari son with orthologous Prabhaka r et al. (2008) Science 321,
sequences in chimpanzee and rhesus macaque it has gained a 1346-1350
strong expression domain in the developing limb that has been
postulated to result in altered human limb anatomy, such as
specia lization of the hand to facilitate using tools or the foot to allow
walking on two legs
In several cases of putative positive selection in the human lineage, the great majority of nucleotides altered in the human lineage arose by
replacement of an A orT with a G or C. This AT-tGC bias has prompted the suggestion that the nucleotid e changes sim ply renect biased gene
conversion rather than positive selection; for a recent discussion, see Hurst lD (2009) Nature 457,543-544.
We show in Table 10.11 some examples of DNA sequences that have been
believed to be positively seJected in tbe human lineage, and we outline different
approaches to identify them below. To identify genes that confer human charac
teristics, one major approach starts with phenotyp es that relate to our special
cognitive abilities and brain capacity. Studies of human microcephaly, a neu
rodevelopmental disorder characterized by decreased brain growth, have identi
fied genes that regulate brain growtb such as MCPHI for which there is evidence
for positive selection in tbe human lineage. However, tbe ongoing adaptive evo
lution of such genes is not explained by increased intelligence. Various language
and reading disorders with a significant genetic component have also been stud
ied, including developmental dyslexia (reading disability) , speech language dis
order (SLD), and also specific language impairment (SLl), where the heritability
is particularly high.
The SLD gene FOXP2 has been extensively studied. lt encodes a transcription
factor tbat has been very weU conserved during evolution-out of a total of 715
amino acids, only 3 are changed in the mouse. Of these, two amino acids, N303
and S325, seem to be human·specific when compared with various nonhuman
primates and are thought to have been positively selected during recent human
evolution (Figure 10.34).
Evidence that tbe apparently human-specific N303 and S325 residues are
important in brain function comes from the Foxp~ umJhu.m mouse, which is
homozygous for a humallizedversion oftbe Foxp2 gene (genetically engineered
to express tbe human-specific N303 and S325). Foxp2humlh",,, mice have altered
vo calization and also changes in certain neurons that are expected to enhance
the efficiency of brain mechanisms that in humans are known to regulate motor
control, including speech, word recognition, and several other cognitive
properties.
303 325
modern human LTTNNSSSTT 5 SKASPP ITHHS IVN GQ 55V .RRDS SSHEETGASH TLYGHG VCK\i
r'"
DN GIKHGGLD
Neanderthal DNGIKHGGLD LTTNNSSSTT IT HHSIVNGQ 55V RR DS SSHEETGASH TL YGHGVC KW
chImp DNGI KHGGLD LTTNNSSSTT S S RASPP ITHHSIV NGQ 55V RR D5 SSHEETG ASH TL YGHGVC KH
gorilla ONGI KHGGLD LTTN NSSSTT S S RA SPP ITHHSIVNGQ 55 V I RRDS SSH ~ET G A SH TLYGHGVCKW
orangutan ONGIKHGGLD LTT NN SSSTT SKASPP ITHHSIVNGQ 55V RRDS SSHEETGASH TLYGHGVCK W
rhesus
marmoset
DNGI KHGGLD LTTNNSSSTT
DNGI KHGGLD LTT NNSSSTT
S
S S RA SP!? ITHHSIVNGQ 55V "
RRD 5 SSHEETGASH TLYG HGVC KW
S SJ( A. SPP ITHHSIVNGQ 55 V RRDS SSHEETGASH TLYG HGVCKW
mouse DNGII(HGGLD LTTNNSS STT 5 SKASPP IT HHSIVNGQ SSV RR DS SSHEETGASH TLYGHGVCKW
Figur.l0.34 Human-specific amino acids in the highly conserved FOXP2 protein. The FOXP2 protein ha s been extremely highly conserved during
evolution, showing 99.6% identity to mouse Foxp2. Alignment with nonhuman primate orthologs suggests that amino acids N303 and 5325 result
from non-synonymous substitutions in the human lineage after human-chimp divergence but before the lineages leadi ng to modern humans and
Neanderthals diverged about 600,000 yea rs ago.
CONCLUSION 341
CONCLUSION
Model organisms have long been useful in research, to understand molecular
and physiological processes and to study disease and evaluate potential thera
pies for human diseases. Now that we have the genome sequences for many of
these model organisms as well as humans, the detection of differences and simi
larities between genomes is enabling us to learn more about the processes that
shape a genome and allow it to evolve.
Compara tive genomics has allowed the construction of molecular phyloge
netic trees of extant species, deepening our understanding of our position in the
tree of life, and revealing information about how genomes, including our own,
have evolved.
As well as helping to identify and validate predicted protein-coding genes,
comparative genomics has also been helpful in identifying functional noncoding
DNA. The latter is thoughtto account for about 80% of the highly conserved DNA
in the human genome, being made up of RNA genes and short regulatory ele
ments that control how genes are expressed.
Functionally important sequences tend to be strongly conserved and are sub
ject to purifying (negative) selection, with harmful mutations being selectively
eliminated. Those mutations that are beneficial tend to be positively selected.
These mutations underlie adaptive evolution.
Shuffling and duplica tion of the exons within protein-coding genes increases
gene complexity. Transposable elements can also be copied and used to make
coding sequences or to introduce new functionally important sequences, such as
novel exons and novel regulatory elements.
Evolutionary novelty can also arise tlu-ough the duplication of genes or even
whole genomes. Once a gene has been duplicated, the two copies can diverge
342 Chapter 10: Model Organ isms, Comparat ive Genomic s, and Evolution
FURTHER READING
Model organisms Kaletta T & Hengartner MO (2006) Finding function in novel
Butte AJ (2008) The ultimate model organism. Sdence 320, targets: C elegans as a model orga ni sm . Nat. Rev. Drug Discav.
325-327.
5,387-398.
Davi s RH (2004) The age of model organi sms. Nat. Rev. Genet. Segalat L (2007) Invertebrate animal models of di sease as
5,69- 76. screening tools in drug discovery. ACS Chem. BioI. 2, 231-236.
WormBook. http://www.wo rmbook,org [An online review of
Fields 5 & John ston M (2005) Whither model orga nism research?
Sdence 307, 1885-1 886. C. elegon5 biology.1
Spradling A, Ganetsky B, Hieter P et al. (2006) New roles for
model genetic organisms in understandi ng and treating Vertebrate models (rodents and zebrafish are
human disease: report from the 2006 Genetics Society of referenced in Chapter 20)
America Meeting. Genetics 172,2025-2032.
Beck CW & Slack JMW (2001) An am phi b',an with ambition:
Unicellular model organisms and synthetic biology a new role for Xenopus in the 21 st ce ntury. Genome BioI.
2, reviews1029. 1-1029.5 (doi:10.1186/gb-2001-2-10
Gibson DG, Benders GA, Andrews-Pfamkoch C et al. (2008)
reviews1029).
Complete chemical synthesis, assembly, and cloning of a Capitanio JP & Emborg ME (200S) Contributi ons of non -human
Mycoplasma genitalium genome. Science 319,1215-1220. primates to neurosc ien ce research. Lancet 37', 11 26-1' 35.
Ke ller LC, Romijn EP & Zamora I (2005) Proteomic analysis of Carlsson HE, Sc hapiro SJ, Farah I & Hau J (2004) Use of primates in
isolated Chlamydomonas centrioles reveals orthologs of research: a global view. Am.1. Prima tal. 63, 225- 237.
KEY CONCEPTS
• The characteristic forms and behaviors of different cell types result from different patterns
of expression ofthe same set of genes. Expression is regulated at both the transcriptional
and translational levels.
• RNA polymerase II transcrib es all protein-coding genes, and many that encode noncoding
RNAs. A large multiprotein complex must be assembled at the gene promoter before
transcription can be initiated.
• A fixed set of general transcription factors forms the basal transcription apparatus, with a
large cast of gene-specific or tissue-specific transcription factors, co -activators, and
co-repressors modulating promoter activity.
• Promoter activity is affected by sequence-specific DNA-binding proteins that bind either
directly to the promoter or at more distant positions (enhancers, repressors, and locus
control regions). DNA looping brings distant sites close to the promoter. Co-activators and
co-repressors do not bind DNA but are recruited to promoters by protein-protein
interactions.
• Large-scale studies of human transcripts have shown that the majority of human genomic
DNA is transcribed. The function (if any) of most of the transcrip ts is unknown. The
implications of this are currently not clear, but a substantial revision ofthe traditional view
of discrete genes sparsely scattered along genomic DNA ma y be required.
• The local chromatin conformation may be more important than the local DNA sequence in
defining promoters. It is governed by an interplay between histone-modifying enzymes,
ATP-driven chromatin remodeling complexes, and DNA methyltransferases that methylate
cytosine in CpG dinucleotides. RNA molecules may also have a role.
• Histones in nucleosomes can undergo many different covalent modifications. The pattern
ofthese constitutes an additional layer of heritable information (a histone code) that has a
major, but not exclusive, role in determining chromatin conformation.
• Heritable effects on gene expression that are not caused by a change in the DNA sequence
are called epigenetic changes. The heritability is due to heritable patterns of methylation of
CpG sequences.
• For a few dozen human genes, expression depends on parental origin because of
imprinting. This is an epigenetic process whereby alleles at a locus carry an imprint of their
parental origin that governs the pattern of expression.
• Most protein-coding genes encode multiple polypeptides because of the use of alternative
promoters and/or alternative splicing. In many cases, the resulting protein isoforms have
different tissue specificities, biological properties, and functions.
• Gene expression is also regulated by controlling whether or not an mRNA will be translated.
This often depends on proteins and microRNAs (miRNAs) that bind to sequences in the 3'
untranslated region of the mRNA.
• The role ofthe myriad different miRNAs is still uncertain, but it is thought that they fine
tune the expression of many genes. Major changes in gene expression are more often due to
controls that affect transcription.
346 Chapter 11: Human Gene Expression
All the cells of our body, apart from gametes, are derived by repeated mitosis from
the original fertilized egg. All those cells, with a few minor exceptions, therefore
contain exactly the same set of genes. The differences in anatomy, physiology,
and behavior between cells, described in Chapters 4 and 5, are the result of differ
ingpatterns of gene expression. During normal development, cells follow branch
ing traj ectories in which successively more specialized patterns of gene expres
sion define the transitions from totipotency through pluripotency and onward to
terminal differentiation. Most of these developmental decisions are irreversible
in the context of normal human physiology, although there is great interest in
reversing them by artificial manipulation, to produce induced pluripotent cells
(see Chapter 21) or to promote the regeneration of damaged tissues or organs. In
addition to these fixed patterns, cells have a repertoire of short-term flexible
changes in gene expression that enable them to respond to their fluctuating
environments.
Controls on gene expression operate at several levels. On the largest scale,
megabase regions of chromatin around centromeres and elsewhere form highly
condensed and genetically inert heterochromatin that can be seen by Giemsa
(G-band) staining of metaphase chromosomes. The dark G bands do contain
some expressed sequences, but most expressed genes are located in regions that
stain pale in a standard G-banded preparation (see Figure 2.14). The location of a
chromosome within the interphase cell nucleus can also have a profound effect
on its level of expression.
This chapter explores how gene expression is controlled at the single gene
level, from the selection of which genes to transcribe, through processing of the
transcript, to translat;on to produce the initial protein product.
As described in Chapter 8, techniques for studying gene expression have
evolved rapidly. Early studies used RT-PCR to document the expression or other
wise of single genes of interest in different tissues or at different stages of devel
opment. Gene expression was quantitated by real-time RT-PCR. As with other
aspects of genetics, the emphasis moved on to genomewide studies, using micro
arrays for expression profiling. Typically, the experiments compare expression
profiles in two or more tissues or cell states. Now the focus is moving on to direct
sequencing of cDNA on a massive scale. In some ways, this is just a refinement of
the large-scale generation of expressed sequence tags (ESTs) by partial sequenc
ing of cDNA libraries. However, the new massively parallel sequencing machines
allow an unprecedented depth of sequenci ng, and hence the unbiased quantita
tion of even very rare transcripts and detection ofrare splice isoforms.
Efforts to understand the factors controlling human gene expression have
been focused by the ENCODE (Encyclopedia of DNA Elements) project. This is a
systematic attempt to identjfy all functional sequence elements in human DNA
(see http://genome.ucsc.edu /ENCODE).Apilotphase,launched in 2003, focused
on 44 genomic regions, totaling 30 Mb or 1% ofthe human genome. In 2007, the
project moved into the production phase, studying the whole genome. At the
same time, parallel efforts (ModEN CODE; http://www.modencode.org) targeted
two important model organisms, the fruit fly Drosophila melanogasrer and the
nematode Worm Caenorhabditis elegans. Data from the ENCODE project inform
much of the discussion in this chapter.
There is a growing awareness that the way in which our genome works is
much more complex than it once appeared. Early understanding naturally
focused on the sparsely scattered protein-coding sequences and was guided by
the idea that each gene encoded a single polypeptide. It now turns out that most
protein-coding genes encode multiple polypeptides. Moreover, although less
than 1.5% of our DNA codes for protein, much of the rest is transcribed, and cells
are awash with newly recognized RNA molecules, of whose function (if any) we
still have only a very limited understanding. The genome projects showed that
we humans have scarcely more protein-coding genes than the I mm long
Caenorhabditis elegans worm. It seems inevitable that our greater complexity
must be the result of more complex organization and reguJabon of a broadly sim
ilar set of genes. Some of the new vision was presented in Box 9.5, which dis
cussed reviSing the concept of the gene. We are still at an early stage of unraveling
the mechanisms that control human gene expression, but this chapter outlines
current concepts and sets the scene for emerging new data.
PROMOTERS ANO THE PRIMARY TRANSCRIPT 347
- 41 to - 35
/
- 34lo - 28
\
-2 y: +28 to +32
i
is either necessary or sufficient for p romoter
activity, and many active polymera se II
promoters lack all of them. Py, pyrimidine
(C arT) nucleotide; N. any nucleotide.
BRE TATA box In' OPE (Adapted from Sm ale ST & Kad onaga JT
~g~CGCC TATA*AA~ Py PYAN ~P Y PY AG A.C~ (2003) Annu. Rev. Biochem. 72. 449-479. With
G TTC permission from Annual Reviews.)
348 Chapter 11 : Human Gene Expression
5' RACE (Rapid amplification of eDNA ends) \s a form o f RT-peR t hat The method is a variant of the SAGE (Serial analysis of gene
allows amplification of the whole mRNA seq uence between an internal expression) technique that fi rst introduced the use of type li s
primer and the 5' end of the RNA (FIgure 1). This is a low-throu ghput restriction enzymes and sequencing of concatemerized tags. The short
method but is particu larly useful for identifying unsuspected tags are sufficient to ide ntify the molecule, and the concatemerization
upstream exons o f a gene, the resul t of initiation at a novel promoter. allows huge num bers of tags to be sequenced, permi tting
In the ENCODE project, 5' RACE was systematica lly used to examine unprecedentedly accurate quantification of the relative abundances o f
transcripts from 399 lod in 12 different t issues. RACE products different mRNAs. The purpose of using the type li s enzyme to generate
were identified with the use of oligon ucleotide microarrays. Novel short tags Is to reduce the amount of sequencing needed. With the
fragments were found for 90% of all loci tested. introductio n of ma ssively parallel sequencing tech nologies, this may
CAGE (Cap an alysis of g ene exp ress ion) is a high-throughput no lo nger be an importan t conside ration, and it may be simp ler j ust to
method that sequences j ust the first' 8 nucleotides adjacent to the 5' sequence cap-seleded cDNAs on a massive scale.
cap of all mRNAs in a sample. The sequence allows t he transcription 5' AAAAMA ... A" 3'
•
sta rt site to be identified, and the number of occurrences of that
seq uence is a d ired count of the number of molecules of that mRNA in reverse transcription of RNA using an
the original sample. Th e method depends on three basic steps:
3' MAMAA
" Internal primer; extend 3' end of transcript
cDNAs that co rrespond to t he capped 5' end of mRNAs are
1 using terminal transferase and dATP
' rm •
selected. This is achieved by a chemical reaction t hat biotinylates 5'
the mRNA cap, followed by selection with streptavidin-coa ted 3' anneal adaptor contain ing primer
magnetic beads.
3' MM.~ 5' sequence A and extend
An adaptor containing the recognition site for the type li s enzyme
Mmel is ligated to t he cDNA. Type lis restriction enzymes recognize
a specific DNA sequence but cut the DNA elsewhe re, a defi ned
number of base pairs away. Digestion with Mme l releases an
1
S'<=MAAA
B
3'
, 8-nucleotide fragm ent of the eDNA with a two-ba se overhang (a
3'rIIImT n n - - -. . . . .S. PCR with primers A and B
stlckyend).
The l a-base fr ag ments with their sticky ends are ligated together A
to form long concatemers, which are t hen sequenced. Figu.l'e 1 Principle of 5' RACE_
TFIID
TBP subun it recognizes TATA box
TAF subu nits -11 recognize other DNA sequences near the start point; regulate DNA
bind ing by TBP
TFII F 3 stabilizes RNA polymerase interaction with TBP and TFIIB; helps
attract TFIIE and TFIIH
TFIIH 9 unwinds DNA at the tran scription start point; phosphorylates Ser 5
of the RNA polymerase C-terminal domain; releases RN A polymerase
from t he promoter
TBP, TATA-binding protein; TAF,TBP-associated factors; BRE, TFIIB recog nition elemem.
PROMOTERS AN D THE PRIMARYTRANSCRIPT 349
helix at the tra nscripti on start site and initiates a con form ationa l change in
the polymerase by phos phorylating its protruding (-terminal dom ain. Thi s
l-+
allows the polymerase to commence transcription. Ad ditional DNA- binding
transc riptio n factors and protein-binding co-activators (not shown) control the
level of activi ty of a promoter. TSP, TATA-box binding protein; CTD, C-terminal TAIB
domain of RNA pol II. [From Alberts B, Johnson A, LewisJ et al. (2007) Molecula r
Biology of the Cell, 5th ed . Garland SCiencefTaylor & Franc is LLC.]
Unlike DNA polymerases, RNA polymerases do not need a primer to get started,
but further actions of the TFll facto rs are needed before transcription can start. ~O-OD~
~--1-
One subunit ofTFIlH is a DNA helicase. This uses energy from the hydrolYSiS of
ATP to open up the DNA double helix, giving pol II access to the template strand.
The polymerase then starts RNA synthesis but initially fails to progress away from
the p romoter. Short oligoribonucleotides are p roduced in a process of aborti ve
lAtH
~:J
RNA polymerase II
initiation until a set of conformational changes OCCur that allow the polll to move
into elongation mode. These changes are triggered by TFIlH-dependent phos
phorylation of serine residues in the C-terminal domain of pol II. Some data sug
gest that this transition is the most critical pan of transcription control. Most
promoters, whether active or inactive, are said to be occupied by a pre-initiation
complex, but only those on active genes are able to move into elongation mode.
~ UTP' ATP
The polymerase then moves along the template strand, leaving behind most
of the general transcription fa ctors but picking up new proteins, the elongation
factors such as the Elongin complex. The average speed is about 20 nucleotides CTP, GTP
per second, although there are pauses and spurts. At this speed it would take
more than 24 hours to transcribe the 2.4 Mb dystrophin gene! ~ -'--i..!...L
Termination of transcription
There seem to be no specific sequences that act as stop signals for RNA polymer
ase; th ere are no transcriptional analogs ofthe UGA, UAG, and UAA translational
stop co dons. For many genes, termination occurs at various positions in different RNA
individual transcripts. It turns out that termination of transcription is linked to
cleavage of the transcript. A well-known feature ofmRNAs is the poly(A) addition TRANSCRIPTION
site-AAUAAA or a closely similar sequence. As described in Chapter 1 (see Figure
1.21), the mRNA is cut a few nucleotides downstream of this site. This cleavage
also initiates the events leading to the termination of transcription. The cut is
made by a protein complex that is physically tethered to the RNA polymerase.
The complex includes an exonuclease. Once the cleavage has produced a free
RNA end, the exonuclease m oves along the RNA that is still emerging from the
polymerase, degrading it in a 5' ....3' direction. A sort of race ensues between the
polymerase, continuing along the DNA and elongating the tail of the transcript,
and the exonuclease coming up behind the polymerase and eating up the tran
script (Figure 11.3). The exonuclease is the faster ofthe two, and when it catches
up with the polymerase, transcription terminates.
=================&~~
exonuclease
~- /~\=====
_ AAUAA.I.- - !
e ----
\
polymeras e progression'
=--:=~~
- MUMA- - !
X-
transcription stops when
the exonuc lease meets
the polymerase
just as likely to bind downstream of the start site as upstream (Figure 11.41. ,•
cr
m
Distal DNA sequences where sequence-specific binding of proteins affects
transcription are labeled enhancers or silencers, depending on whether they '".•2
activate or repress transcription. They can lie either side of the promoter. The
chromatin loops ro und to provide direct physical contact with proteins bound to
.."
the promoter (Figure II ,5). Developmental genes, in particular, often have com - 5000 - 3000 -1000 1000 3 0 00 5000
plex sets ofvery distant enhancers. These have usually been discovered by chance, distance to tranSCript iOn sta rt site
when a chromosomal translocation separates a gene from an enhancer. This pro
duces the phenotype associated with loss-of- function mutations of the gene, FlgutellA Transcription factors may
although the coding sequence is intact. Typical examples include the following: bind either upstream or downstream of
the transcription start site. The curves
• Heterozygous loss-of-fun ction mutations in the PAX6 gene on chromo
show t he distri bution of locations at which
some Il pl5 cause aniridia (a bsence of the iris of the eye; OMIM 106210).
the transcripti on factors MYC, E2Fl , and E2F4
Some patients with aniridia have no mutation in PAX6 but have transloca bound to the promoters of a set of diffe rent
tions with breakpoints up to 125 kb downstream of the last PAX6 exon. actively expressed genes, rel ative to the
Experiments with transgenic mice located a cluster of essential regulato ry transcription sta rt site (vertical dashed line).
elements in this region (Pigure 11 .6A). There are several different tissue-spe The curves are roughly symmetrical about
cific enhancers in the cluster. the transcription start sites, showing that the
binding si te for one of these tran scriptio n
• The role of the SHH(soni c hedgehog) gene in limb development was described factors in a particu lar gene is just as likely to
in Chapter 5. Another gene, LMBRl , is located 1 Mb away from the SHH gene be downstream of the transcription start site
on chrom osome 7q36. Several deletions, insertions, or translocation break as upstream in rhe convention al promoter
points in the LMBRI gene in huma ns or mice cause various limb abnormal i location. [Adapted from the ENCODE
ties (Figure 11.6B). It was natural to ass ume that these were the result of a loss Project Consortium (2007) Nature 447,
of function of LMBRl-but point mutations in LMBRl had no effect on limb 799-8 16. Wi th permission fro m Macmill an
development. Instead, these changes disrupt a series of enhancers of SHH Pu blishers ltd.)
that happen to be I?cated in introns of LMBRl.
When a chromosomal rearrangement abolishes the expression of an appar
ently intact gene by separating it from a distant eobancer, as in the PAX6 cases
described above, this is often but incorrectly described as a position effect.
That term was coined to describe the gene silencing observed in Drosophila
when a chromoso mal rearrange ment relocated a gene to a position close to a
heterochromatic region (see Chapter 16). Position effects reflect the tendency
of heterochromatinization to spread along a chromosome. The only undoubted
position effect known in humans is the occasional silencing of autosomal loci
when an X-autosome translocation relocates them to an inactivated X chromo
some (see below).
Wh en clusters of distal regulatory elements were first discovered that we re
both necessary and sufficient to ensure the correct tiss ue-specifi c expression of a
gene, they were called locus control regions (LCRs) . In experimental systems,
locus control regions can confer strong position-independent expression on
trans genes, being required if the transgene is to function as a fully independent
transcrip tional unit. The best-known human examples control expression ofth e
.~)f:m
th ese distal elements and some of the many
protei ns bound to the promoter, For cla rity,
o nly the RNA polymerase is shown at the
~ promoter.
352 Chapter 11 : Human Gene Expression
·• ,•
• I I I
expression. (A) Translocation breakpoints
(vertical dashed arrows) up to 125 kb
downstream of the 3' end of an intact PAX6
:i~:~~
O ~D, ~~4
X6
gene cause a loss o f PAX6 function in patients
with aniridia. Regulatory elements essential
- I_L"_
E_2_[3':"'
___'1s:-::______---""
,0';'I "' I for the expression of PAX6 must lie distal of
all these breakpoints. DNase hypersensitive
lens enhancer Sites (red boxes) mark stretches of DNA
where nucieosomes are absent or unstable,
(B) SSQ In sertion which are therefore available for interaction
••r\.,~!!~o
LM8Rl
acheiropodia
~
II OJ
RNF32
1 MO
II
SHH
r=J
with DNA-binding proteins. Two of these
sites have been identified as retina-specific
and lens-specific enhancers. The PAX6
regulatory sequences lie within introns of
ELP4 (yellow boxes mark exons of this gene),
an unrelated gene that is transcribed in the
opposite orientation. (B) Function of the SHH
deleUon
(sonic hedgehog) gene in limb development
depends on enhancers located' Mb away,
within introns of the LMBRl gene (blue boxes
two clusters of globin genes. In the a-globin cluster on chromosome 16p and the show exons of this gene). An additional
f)-globin cluster on lip. distant LCRs allow tissue-specific expression and devel gene, RNF32, ties between SHH and these
opmental switching of genes in the cluster (Figure 11.7). The LCRs include regulatory elements. Positions of a deletion,
enhancer and insulator elements-the latter preventing more distal sequences an insert ion, and a translocation breakpoint
from affecting the expression of the genes controlled by the LCR. Deletion of the are shown, all of w hich cause phenotypes
LCR prevents expression of the genes it controls. even though the gene sequences. due to abnormal control of SHH expression.
including the promoters. are intact. Many examples are now known of genes in Ssq, Sasquatch mouse mutant; PPD, human
which the regulatory sequences are more scattered. and so the idea of discrete preaxial polydactyly (OMIM 174500);
acheiropodia, human syndrome of bilateral
locus control regions has become less useful.
absence of hands and feet (OMIM 200500).
[F ro m Kleinjan DA & van Heyningen V (2005)
Co -activators and co-repressors influence promoters withou t Am. J. Hum. Genet. 76, 8-32. With permission
binding to DNA from Eisevier.j
... . • .-,
kO
-
with DNA-binding proteins. These sites are
;, 0, a, e
hypersensitive in erythroid cells, where the
-> 1Jf~1 IJfra 1Jfa:1 -; - ; ...
~ - a-glObin
cluster
globin genes are expressed, but not in cells
of other lineages. Blue boxes show expressed
,
LCR
s genes, purple boxes pseudogenes. The
--
Gy Ay 'I'll P
"
• " •
LCR
• .i J}glooln
cluster box) is uncertain. Arrows mark the direction
of transcription of expressed genes.
CHROMATIN CONFORMATION: DNA METHYLATION AND THE HISTONE CODe 353
(AI
- NH -CH-CO- -NH-CH- CO - -NH -CH-CO- -NH - CH- CO - - NH- CH - CO
1 1 1 1 1
(CH Z)3 (CH,h (CH2h (CH 2)3 (CH')3
1 1 1 1 1
CH, CH, CH, CH, CH,
1 1 1 1 +1
NH, NH NH N N
1
eocH)
1
CH,
/"-CH,
CH,
1
(CH 3))
(8~ ~, TAT
can be modified by acetylation or the
addi tion of one to three methyl groups. The
,,
"
p.. , histone Iysines is shown. (B) The N-terminal ta ilsof
H3N - AR T K Q T AR K S, T G G P Q l A T A A A
H3
2 4 910 14 1718 23 histones H3 and H4 are the sites of many of
the mod ifications that control chromati n
structure. The am ino acid sequence is
H,~-t!GTG GTG G
1
TGG 3 5 8
l
12
AKRH R KV L R D
16 20
til I Q G... ... histone
H4
shown in single-letter code, and potential
modifications are indicated. Ac, acetylation;
Me, methylation; P, phosphorylation.
354 Chapter 11 : Human Gene Expression
H3K4 trimethylated a;
monomethylated b
H3K27 trimethylated
H4R3 methylated
H4KS acetylated
Within each categor y of chromatin there are sub-varieties and variant patterns of modification,
but those shown are the most frequent modifications in each broad category of chromatin.
5ee Box 11 .2 for nomenclature of amino acids in histones. Blank boxes signify that there is no clear
pattem. aAt promoters. bAt enhancers.
CHROMATIN CONFORMATiON: DNA METHYLATION AND THE HISTONE CODE 3SS
!
(antibodies are available against many modified histones). crosslink proteins to DNA, lyse cells.
In cubation at 65°( reverses the protein- DNA crosslin king, and and isolate chromatin
protein is removed by digestion with proteinase. The DNA that
was associated with the protein is then identified. PCA with
gene-specinc primers can be used to check what proteins are
=1, / = =€
(8- i/=
"
bound near a gene of interest. More usually the recovered
DNA is hy bridized t o a microarray to identify genomewide
-il - //=
associations- t he so-called ChiP-chip technique.
ChiP-c hip can give a genomewide overview and has been =1/·8- - 8 8 = //=
widely used, but it has several limitations. It requires severa l
micrograms of DNA to get a good signal, so the resu lts are an
average of the state of m illions of cells. Repeated sequences and 1 fragment chromatin and mix with antibody
specific for just one protein
1/ @==
allelic variation are hard to study because of cross-hybridization,
and the resu lts are only semi -quan titative because of possible
bias in the extensive PCR amplification needed to obtain enough
II
!d
[Ch iP-PET (paired end tags) or STAGE (sequence tag analysis precipitate antibody-bound
of genomic enrichment)]. Increasingly, the w hole fragm ent is protein-DNA complexes
sequenced (ChiP-seq). The new generation of massively parallel
sequencers is making th is an attractive method for obtain ing
unbiased data to any desired level of reso lutio n. ~
Chromosome conformation capture (3C, 4() (Figure 2.
!
reverse crosslinking, degrade
n
o\,/ort•• is used to identify DNA sequences that may be widely protein, and purify DNA fragmenls
separated In the genome sequence but lie physically close
together within the cell nucleus. Living cells or isolated nuclei are
treated with formaldehyde to cro sslink regions of chromatin that
lie physically close to one another. The crosslinked chromatin
is solubilized and digested with a restriction enzyme. Dista nt
interacting sequences will be rep resen ted by crossl inked
chromatin containing a DNA fragment from each o f the
J:
test specific
l
hybridize fragments
l
tag sequence
l
sequence
interacting sequences. As a result of the restriction digestion, target regions to a microarray fragments fragments
ChiP-Chip ChiP-PET or STAGE ChlP·seq
the two fragment s will have compatible sticky ends. DNA ligase
is used to try to ligate together the ends from the two different sta ndard ChiP genomewlde ChiP
reg ions ofONA. The ligated DNA fragm ents are released from
the chromatin as before. They ca n then be identified by any Figure 1 Chromatin Immunoprecipitation.
of several methods.
The original3C method used quantitative PCR amplificatio n of inverse PCR t o amplify interac tor seq uences in the circular ligation
the ligati on product t o test the f requency with which two specified product. These are then identified by hybridization to microarrays
DNA sequences associated. One, primer of eac h pair was specific fo r or, increasingly, by mass sequencing. 3C and 4C are quantitative
each of t he two chosen sequences. 4C extends this idea to produce techniques: the aim is to identify interactions th at are present
an unbiased list of sequences (interactors) w ith which a given bait more freq uently t han the many random events identified by these
sequence associates. Primers from the bait sequence are used in techniques.
356 Chapter 11: Human Gene Expression
contrast with TAF3, DNA methyltransferases can bind only to nucleosomes that BOX 11.3 continued
are unmethylated at H3K4. Thus, H3K4 methylation contributes not only to
recruitment of the transcription machinery but also to the protection of pro
moter regions against undesired silencing. This is particularly important for DNA-bi ndlng
housekeeping genes transcribed from CpG island promoters, the natural targets proteins
for DNA methyltransferases.
A web of interactions affects the final outcome, and no one factor is fully
determinative. The proteins that recognize speCifically modified histones may
themselves have histone- modifyi ng activity, producing positive feedback loops
for example, some chromodomain proteins have histone deacetylase activity.
Thus, histone modifications can be interdependent. Certain modifications fol
low on from one another. Phosphorylation of serine at H3S10 promotes acetyla
tionofthe adjace nt H3K9 and inhibits its methylation. Ubiquitylation ofH2AK1l9
is a prerequisite for the dimethylation and trimethylation of H3K4. The protein treat living cells or cell
complexes often also include components of ATP-dependent chromatin remod
eling complexes and DNA methyltransferases that methylate DNA, both described
l nuclei with form aldehyde
crosslink 10rmed
below. Gene expression depends on a balance of stimulatory and inhibitory \. ...
~%
effects rather than on a simple one-to-one histone code.
ATP-dependent chromatin remodeling complexes
The histone-modifying enzymes described above constitute one set of determi
nants of chromatin function; another set comprises ATP-driven multiprotein
complexes that mOdify the association of DNA and histones. The components of
these chromatin remodeling complexes are strongly conserved from yeast to
humans, and studies in many organisms have demonstrated their invo lvement extract and fragment chromatin;
in numerous developmental switches. They can be grouped into families,
depending on the nature of the ATPase subunit (Table 11_3). Each ATPase associ
l dige st with res triction enzyme
and may be involved in containing the spread of heterochromatin. The relative sequences an
tioned above, reflects the fact that nucleosomes at these locations tend to con sequencing
tain H2A.Z and another variant histone, H3.3. This makes them unstable, thus
allowing DNA-binding proteins better access to these sequences. Another vari F1gutlll Chromosome conformation
ant of H2A, macro-H2A, is characteristic of nucJeosomes on the inactive X chro capture (4C). [Adapted from Alberts
mosome. Incorporation ofthese variant histones depends on the ability of some B, Jo hnson A, lewis J et al. (2007)
chromatin remodeling complexes to loosen the structure of nUcleosomes and Molecular Biology of the Cell, 5th ed.
promote histone exchange. For example, the TIP60 complex (see Table 11.3) has Garland Scie ncerra ylor & Francis LLC.]
a functi on in replacing H2A by H2A.Z.
CHROMATIN CONFORMATION: DNA METHYLATION AND THE HISTONE CODE 3S7
(HRAC 3 SNF2H
AU 2 SNF2H AC Fl (bromodomain)
BAF, 6rg- or Srm-associated factor; NuAF. nucleosome remodeling factor; CHRAC, chromatin
(AI
H
I
O ..... H - N\ .
HC' ~N
~ \5'f'
' c• - C
'\_
C -c I; CH
Flgur. 1 1.10 5~ Methylcytosine.
\ I;
~-c
\N-H ..... N /
(Al S-Me thylcytosine base-pairs with guanine
'\~=~
j1 'C - N
in the same way as unmodified cytosine.
/
sugar f" ' '\. (8) There is a symmetrically methylated CpG
- \ N- H ..... 0
seq uence in the center of this molecul e. The
I sugar methyl groups (red) lie in the major groove of
H
the double helix. ((B) designed by Mark Sherman.
G 5MeC Courtesy of Arthur Rigg s and Craig Cooney.]
r
Bisulfite modification is a method for treatment. The status of spec ific CpG
identifying the methylation status of sequences can be examined by sequencing,
cytosines in geno mic DNA. When single restriction digestion, or pe R. A C-)T change
stranded DNA is treated with sodium bisulfite m ay create or destroy a restrictio n site (in
(Na2S0) or meta bisulfite (Na2S20S) under Figure 1, t he treatment destroys a Toql TCGA
carefully co ntrolled conditio ns, cytosine is site jf the C is un methylated), Methylation
deaminated to uracil but 5-methylcytasine speci fic peR uses primers that are specific
remains unchanged. When the treated DNA for either modified (MSP) o r un modified
is sequenced or PeR-amplified, uracil is read (UMP) cytosines in the primer-binding site.
as thymine. By comparing the sequence Genomewid e methylation profiles can be
before and after treatment with bisulfi te, it obtained by analyzing the bisulfite-treated
is possible to identify which cytosines were DNA o n specially designed o ligonucleotide
methylated. microarrays that contain probes to match
Several methods have been used either each normal sequence or its bisulfite
to identify any changes induced by the modi fied counterpart.
®®
T I
5' D;, gggUgg.gOl t Ug.3gtu a 3' 5' U"999C99g-Ot tC ~l!Sg t.tl a 3'
PCR
5' 'l'a g g~'i99gT t t Tqa'9 t'l' a 3' 5' 'ragggCgggTt. tCgag t.'l'a 3'
analyses
~ ______~A~~________~
! ): J.
~ ~~MLJ
Fi gura 1 Modific:ation of DNA by sodium bisulfite. MSP, primer specific for the methylated
sequence; UMp, primer specific for the un methylated sequence.
These dense clusters ofCpG sequences tend to be protected from DNA meth·
ylation regardless of whether the associated gene is active or inactive, and they
are likely to be under strong selective pressure to retain promoter activity. On
occaSion, however, CpG islands can acquire DNA methylation, which inevitably
results in the shutting down oftranscription, as observed on the inactive X duo·
mosome or at imprinted genes (see below). Aberrant methylation ofCpG islands,
particularly those associated with tumot suppressor genes, is a general charac·
teristic of cancer cells (this is discussed further in Chapter 17). A minority of CpG
islands do show variable methylation that correlates with gene silencing, often as
part of a developmental switch. Figure I J.1 I shows examples of the distributions
of CpG methylation in different tissues and at different loci.
Methyl-CpG-binding proteins
The effect of DNA methylation on gene expression can be observed directly, in
that some transcription factors such as YYj fail to bind to methylated DNA. DNA
methylation can also be read by proteins that contain a methyl·CpG-binding
domain (MBD). These can then recruit other proteins associated with repressive
CHROMATIN CONFORMATION: DNA METHYLATION AND THE HISTONE CODE 359
DNMT3L 6065B8 binds to chromatin with unmethylated H3K4 and DNMT3 A, DNMHB; histone deacetylases
stimulates activity of DNMT3N3B
peNA, proliferating cell nuclear antigen; HP1 , heteroc hro m atin p rotein 1. aFar the distinction between de novo and maintenance m ethylases, see
Sec tion 11.3 on p. 365. bDNMT2 turned o ut to have RNA rather than DNA as its substrate. but its structure is that of a DNA m ethylt rans ferase, and it is
"'pose
A
. ..
-:...............
""'
B
......:!
;;:=-;••
C
.......-........
D
or locus-specific. Ea ch square represents
::~:.:.~~~ ~ .
•••
~
.....
........ ..,~~ ~~ ,
c..• • ~ . .. .....:
:::::::=::::::::::::
:::::::::::~ :---
a epG sequence w ithin the major
histocompatibility complex o n chromosome
U""ef rI ..
........ ...
_......-" ••••
..". ...• .......
.........
~::' ......
............ :::::: .,....
~
:1" • . • It •
~.....................
......................
. . . . . . . . . . .. i . . . . . . . . .
~;;;;;;;;;:;~~~~~-=
indicated. Squares are color-coded to
indicate the extent of methy lation (see scale
_- .... ...----
.-.. !
'. and between different samples of the
same t issu e, w herea s region B shows mo re
prO:!;lbtt' I ~ ~~~~_ ~! . -, H ~ I ::::::::::::!':::::::: tissue-specific methylatio n. Regio n C is
largely un methylated, and region D is almost
completely methylated. (Data from www
chromosome 6 , 3 17 77160 31563784 31883848 318 17168
part or n1 position: 3 1793831 31580455 31900519 31833839 .sanger.ac.uk/PostGen omics/ epigenome, as
depicted by Hermann A, Gowher H & Jeltsch
A (2004) Cell. Mol. Ufe Sci. 61, 2571 -2587.
1111:11111111111111111111111111/1111111111111111111111 11111111
r
''r\I-\\
maintenance
methYlatiO\
c
o
••
W'
~
erasure of
parental de nOW)
placenta.
yolk sac
epigenetic
settings
methylation
and
,0'(
.AI
de novo
U
stabllShment
of imprinting methylation in
marks trophoblast
bl~~tocystl
...
______ lineages
--- ~
gonadal germ oells fertilIZa tion morula early pregastrulatiOn gonadal
differentiation dev~9 P blastocyst differentiation
of imprinting m<lrks
developmental slage
the two pronuclei have fused, the early zygote genome undergoes further Flgure,'.12 Changes in DNA methylation
genomewide demethylation until the morula and early blastula stages in the pre during mammalian development. Drastic
implantation embryo. Starting at the blastocyst stage, and concurrently with ini and often tiss ue-specific changes in overall
methylation accompany gametogenesis and
tial differentiation events, the genome undergoes a genomewide wave of de novo
early embryonic development. Th eir causal
DNA methylation. The extent of this methylation varies in different cell lineages: role remains uncertain, althoug h mice that
somatic cell lineages are heavily methylated, whereas trophoblast-derived line are specifically unable to methylate sperm
ages (giving rise to the placenta, yolk sac, and so on) are relatively undermethyl DNA are infertile. Note the breaks (slashes)
ated. The primordial germ cells ofthe embryo (from which the gametes will ulti in the developmental lines. PGCs, primordial
mately be derived-see Chapter 4) undergo a second wave of genomewide germ cells.
reprogramming, with progressive demethylation first occurring on entry of the
migrating primordial germ cells to the genital ridge, followed by genomewide
remethylation as the germ cells begin to develop.
These changes are very dramatic, but it is debatable how far they are causative
ofthe events they accompany, rather than being downstream res ults of the mech
anisms that cause differentiation. DNA methylation is a feature of vertebrates,
flowering plants, and some fungi, but it is virtually absent fro m many of the best
studied model organisms, including yeast, C. elegans, and Drosophila- yet the
mechanisms of differentiation are fundamenta lly similar in all these organisms.
DNA methylation is certainly required for vertebrate development, because an i
mals defi cient in DNA methyltransferase activity die at various stages of develop
ment. Similarly, although pluripotent embryonic stem cells derived from
blastocyst-stage embryos can grow normally even in the complete abse nce of
DNA methylation, these cells immediately undergo apop tosis when stimulated
to differentiate. Extensive epigenetic erasure might be important for rese tting the
epigenetic information originating from the previous generation to achieve a
relatively blank and versatile pluripotent state necessary at the onset of
development.
humans with loss-of-function mutations of DNMT3Bhave the autosomal reces methylated DNA binds
MECP2 and MBDl
sive ICF syndrome (Immunodeficiency, Centromeric instability, Facial anoma
lies; OMIM 242860), in which the pericentromeric heterochromatic satellites 2
these can
DNA methylation and histone modification interact (Figure 1), l3). DNA and hIstone
methyltransferase s
Many reports document how sequences that are located far apart on a chro
mosome, or even on different duomosomes, but that are involved in regulatory
interactions, may come to lie together within the nucleus-a phenomenon
described as gene kissing. The technique of chromosome conformation capture
(3C or 4C) is used to study such effects (see Box 11.3). For example, in differentiat
ing female mouse embryonic stem cells, the two copies of the Xic (X-Inactivation
center) sequence lie close to one another just before the X-inactivation process
starts, when the X chromosomes are being counted, but not at other times.
No single prime cause?
DNA methylation, histone modification, and nucleosome positioning all show
clear correlations with different states of chromatin, but it has not ye t been pos
sible to identify a single overriding control. Heterochromatin contains methyl
ated DNA, tightly packed nucleosomes, and dimethylated Or trimethylated H3K9.
Transcription start sites have un methylated DNA and are marked by monometh
ylated, dimethylated, or ttimethylated H3K4 and acetylation ofH3K9 and H4K5,
8, 12, and / or 16. Additionally, nucleosomes at these sites often contain the vari
ant histones H2A.Z and H3.3, which make tllem unstable. Active genes are
reported to have a 5'....3' gradient of modifications. Near the 5' end, lysines 4, 9,
and 27 ofH3 are all monomethylated . In more distal parts of active genes, H3K36
is trimethylated, and the DNA is often highly methylated. By contrast, inactive
genes have trinlethylation of H3K4 and H3K27. Current understand ing suggests
that the different states of chromatin are tlle result of several different processes,
each of which individually contributes only a small part of the effect. Feedback
loops cause specific combinations of small effects to govern the overall chroma
tin s tate.
2D~eIS
~ mass
1 pllosphorylation
1
human genome. This involves intensive
transcriptome analyses, concerted efforts
to identify both cis-acting regulatory DNA
elements and their respective trans-acting
regulatory proteins, and efforts to define
spectroscopy e.g. comparing two
chromatin structure and some of its
samples representi ng
states or conditions
1
assays interaction applic"tioDS
ioaV
cell biology
subcellular
localization
purifying
protein complexes
1
1
identifying
interactions
protein
targets fo r
drugs
large-scale
assays for
recombinant
proteins,
antibodies ,
inhibitors
etc.
phenotype
protei n
analyses
L
functiona l
drugs to disrupt
protein-protein
analyses networks interactions
known genes by including novel upstream exons. The novel exons were often
50-100 kb upstream of the previously documented transcription start site, with
20% located more than 200 kb upstream. There was also evidence of transcripts
containing exons of more than one gene. What this means in terms of cell func
tion remains to be determined.
In the light of these findings, it is difficult to maintain the traditional idea of
genes as discrete stretches of DNA that constitute clearly delimited transcription
units. The results raise serious questions for future research. Do the novel tran
scripts represent novel functional RNAs, and if so what is their function? Or do
they represent a requirement for transcription of much intergenic DNA, but not
for the transcript itself-as suggested by studies of some imprinted loci (see the
next section)? Alternatively, do the data mean that cells, just like the people stud
)~ng them, find it difficult to identify which parts of the genome are functional
and which are junk? Examination of the novel transcription start sites showed
that they had properties indistinguishable ftom known start sites, which suggests
~
H3KljmcJ HOLa
H3a<: HaLa
1-1.4"" >-Inl"
i •
Figure 11 .15 An example of the ENCODE data. This small subset of the overall data shows genes, CpG islands, transcripts, putative promoters, and
histone modification in 350 kb of genomic DNA adjacent to the telomere of chromosome 16p of HeLa cells. The aim is to identify and map all the factors
that, acting in combination, make up the chromatin landscape and govern gene expression. [Adapted from the ENCODE Project Consortium (2007) Nature
447,799-816. With permission from Macmillan Publishers Ltd.]
364 Chapter 11: Human Gene Expression
that the novel transcripts are not just some sort of noise in the system. It seems
likely that the traditional view of the genome is in need of major revision, but the
implications of this are at present far from clear (see also Box 9.4).
\,
Predicting transcription start sites
Within a core promoter there are typically several alternative transcription start
sites, each operating with a different efficiency. The relative usages of the differ
ent start sites can be predicted fairly successfully from the local DNA sequence.
However, on a larger scale the location of core promoters is not predicted by the
DNA sequence. Transcription start sites can be identified by a combination of
indicators: DNase hypersensitivity, DNA methylation, and histone modifications.
(A)
Figure 11.1 7 shows that a specific pattern of histone modification is clearly asso
ciated with transcription start sites of active genes. -H3K4mel
These sites clearly share certain common properties. What is much less clear - H3K4me2
_ H3K4me3
is why particular genomic areas should show these features. There does not seem _ H3ac
- H4ac
to be any combination of sequence motifs that can predict whether a particular - FAIRE
stretch of DNA will adopt these features. Chromatin structure is clearly crucial, _ DNase I
but what determines it? Experiments that interfere with histone methyltrans
ferases or DNA methyltransferases do not cause the catastrophic global disrup
tion of transcription that might have been predicted if methylation of DNA or
histones were the primary determinants of transcription. Maybe the determi
nants do lie with these factors, but their interplay is very complex and no one distance to transcription start site
factor is important. Or maybe much ofthe non-repetitive chromatin can poten
tially adopt this configuration, and a very heterogeneous set offactors trigger dif (8)
ferent regions to do so. Alternatively, maybe the answer lies in a higher order of
chromatin structure, perhaps at the level of DNA loops and the nuclear matrix or -H3K4mel
scaffold. Certainly, the ENCODE project confirmed previous work that demon - H3K4me2
_ H3K4me3
strated the existence of large (hundreds of kilo bases) functional chromosome _ H3ac
- H4ac
domains. Although we can now describe and manipulate many gene control ele _ FAIRE
ments, we are often still unable to explain why those sequences have acquired ~_ DNasel
that function.
5000
Figure 1'.1., Chromatin conformation at transcription start sites.
distance to transcription start site
Curves show the distribution of markers of chromatin conformation across
transcription start sites of genes within the ENCODE pilot phase regions.
(A) All annotated genes. (6) All active genes associated with a CpG island. (e)
(Cl All inactive genes associated with a CpG island. Data are shown for five
different histone modifications (mono methylation, dimethylation, and
-H3K4mel
trimethylation of H3K4, acetylation of H3 and H4), and for two markers of open - H3K4me2
_ H3K4me3
chromatin [susceptibility to formaldehyde crosslin king, FAIRE (formaldehyde - H3ac
assisted isolation of regulatory elements), and DNase I hypersensitivity]. The - H4ac
- FAIRE
peaks show that the chromatin surrounding active transcription start sites _ DNase I
is in an open conformation and carries a particular combination of histone
modifications. The signature comprises H3K4me3 and H3K9/14ac located just , ,
downstream of the transcription start site, with H3K4me2 further downstream.
There is a weaker and more diffuse peak of acetylated H4. [Adapted from the
~
~- 5000
~
~
~~ 7 (
ENCODE Project Consortium (2007) Nature 447, 799-816. With permission from -3000 - 10 00 , 1000 3000 506.':>
Macmillan Publishers Ltd.) distance to transcription start site
EPIGENETIC MEMORY AND IMPRINTING 36S
. j
and their daughter cells need to remember those decisions long after the signal
has disappeared. The polycomb group (PeG) and trithorax group (TrxG) genes " DNA duplkation
were identified through Drosophila mutants that take correct developmental
decisions but then seem to forget the decision later. Humans have at least two 5' • "CpG CpG :> 3'
polycomb-group rep ressive complexes, PRCI and PRC2. These methylate H3K27 3' -GpC GpC 5'
scription factors. Trithorax group proteins function in complexes whose compo 5' CpG CpG 3'
nents overlap with the SWl/SNF set of chromatin remodeling co mplexes (see 3' ' -GpC GoG=:. -. 5'
Table 1l.3). Anotable exampie is MLL (Mixed-Lineage Leukemia), a gene involved 'I
CpG $ 3'
These systems act to maintain transcriptional states that have been estab 3' GpC GpC 5'
M
lished by other mechanisms. They provide a memory of the transcriptional state, M
+
and at least some of the effects can be maintained through mitosis. We note,
however, that the longer-lasting effects ofthese genes have been described mainly ~: :,:.:--:...-~-~:~~~~~ ~:
M
in Drosophila. It is wo rth bearing in mind that Drosophila does not use CpG
methylation in gene regulation; it may be that in mammals there is less reliance Figure 11 .18 Maintenance DNA
on PeG and TrxG proteins for long-term epigenetic regulation. methylation. A 5'-CpG·3' sequence o n
o ne DNA strand is paired w ith a 5'-CpG-3'
seq uence on the o ther strand. When DNA
X-inactivation: an epigenetic change that is heritable from cell to containing a methylated CpG seq ue nce is
daughter cell, but not from parent to child replicated, the epG in t he newly synthesized
strand is initial'y unmethylated . The DNMTl
In Chapter 3, we described the phenomenon of X-inactivation. To recapitulate DNA methylase specifically methylates epG
briefly: early in embryogenesis, human cells somehow count how manyX chro sequences that are paired with methylated
mosomes they contain, and then permanently inactivate all except one randomly epG, thus preserving the pattern of
selected X. At very early stages in development, both X chromosomes are active, methylation that was present in the DNA
but X-inactivation is initiated as cells begin to differentiate from totipotent or before replication.
366 Chapter 11: Human Gene Expression
example of example of
paternal X Inactivated maternal X inactiVated
It ~
tt~ It~
tt~tt~ tt~tt~
pluripotent lineages, occurring at the late blastula stage in mice, and most prob
ably also in humans. Inactive X chromosomes remain condensed throughout the
cell cycle and can be seen as the Barr body or sex chromatin on the periphery of
the cell nucleus (see Figure 3.8). A 46,XX cell may inactivate the maternal or the
paternal X chromosome, but whichever is chosen for inactivation, that same one
is inactivated in aU daughter cells (Figure 1l.19AJ. The body of an adult female is
thus a mosaic of cell clones, each clone retaining the pattern of X-inactivation
that was established in its progenitor cell early in embryonic life. Figure 11.19B
shows a striking example from the tortoiseshell cat. Inactivation is stable through
n1itosis but not across the generations. A woman's maternal X chromosome can
equally well have been the active or inactive one in her mother, and it has the
same chance as her paternal X chromosome of being inactivated in her own
cells.
Initiating X-inactivation: the role of XIST
Inactivation is initiated at the X-inactivation center (XIC) atXq13 and then prop
agates along the whole length of the chromosome in what may be an extreme
example of the tendency of heterochromatin to spread. Spreading depends on
physical continuity of the chromosome, and also on some unknown special fea
ture of the X-chro mosome DNA. As described in Chapter 2, if the inactive X chro
moson1e is split in two by an X-autosome translocation, inactivation is limited to
the segment that includes XIC It cannot jump to the detached portion of the X
chromosome. It may spread from the XICsome way into the attached autosomal
sequences, but only a short distance; it does not propagate aJI along the auto
somal segment.
As previously mentioned, the earliest observed event in XX cells is a transient
pairing ofthe two XICsequences. This is probably the mechanism by which the X
chromosomes are counted, because counting is disrupted if the XIC sequences
are deleted, replicated, or translocated. XIC encodes a large noncoding RNA, XIST
(X-inactivation- specific transcript), which is expressed only from the inactive X
chromosome. Its primary transcript undergoes splicing and pOlyadenylation to
generate a 17 kb mature noncoding RNA. The RNA is then thought to recruit
some protein factors that organize the chromatin into a closed, transcriptionaJIy
EPIGENETIC MEMORY AND IMPRINTING 367
inactive conformation. In differentiated cells that have already undergone RT-PCR results
from 9 ce ll lines
X-inactivation, loss of XISTdoes not cause reactivation. Thus, XISTis required for
the establishment ofX-inactivation, but not to maintain it.
10
In the mouse, expression of Xist from the active X chromosome is silenced by
an antisense transcript (Tsix) that alters the chromatin configuration at the Xist 20
locus, possibly through an RNAi mechanism induced by double-stranded Xist 30
Tsix RNA. Humans have a TSIX gene, but it has little sequence or structural
homology to mouse Tsix, and it is much less clear whether a similar mechanism 40
operates in humans. However, in both species the XIST! Xist RNA is somehow 50
involved in spreading heterochromatinization along the X chromosome that is to
60
be inactivated. XISTRNA comes to coat the whole inactive X. The chromatin of ~~
- - cen
the inactive X carries modifications typical of heterochromatin. H3K9 is dimeth 70
ylated or trimethylated, H3K4 is unmethylated, and H4 is deacetylated. H3K27 is 80
--------
b" :"! ~ - XIST
also trimethylated, reflecting the involvement of polycomb-group genes. CpG
islands at promoters of inactivated genes are methylated. In addition, many 90 r ----=
t .-.- -
r
nucleosomes carry the histone H2A variant macro-H2A. 100 .. ~
IGF2 insu lin-like growth factor maternal impri nted in many t issues but bia!ie lically expressed in brain, adult liver,
chond rocytes, etc.
PEGIIMEST esterase maternal imprinted in fetal tissues but biaJlelically expressed in adult blood
UBDA ubiq uiti n ligase paternal imprinted exclusively in brain; biallellcally expressed in other tissues
KvLQT1 potassium cha nnel paternal imprinted in several tissu es but bialielicaJly expressed in heart
WT1 Wilms tumor gene paterna l freque ntly imprinted in celis of placenta and brain but biailelically expressed in
kidney
""-7~<r£V
F'gurell..21 A recurrent microdeletion
causing Prader-Willi or Angelman
syndrome. Non-alleliChomologous
recombination between low-co py
--- - - -- -
--~?~~
I!!!!!!I
repeats genera tes recurrent de novo
deletion or dup lication of a 4 Mb region
on chromosome lSql1-q13 . People
1
-_-. . . .--=_.. " .-'
,. .
These regions bind the insulator protein CTCF when unmethylated, but not
when methylated. CTCF can prevent a promoter from having access to an ......--=
9 -~ ~v
enhancer if, but only if, it lies between the two U'igure I 1.23).
done?
The phenomeno n of imprinting raises two difficult questions. The first is: How
does it work? It must be reversible: a man may inherit a particular sequence with
a maternal imprint from his mother, but if he passes it on to his child it will then
carry a paternal imprint. The imprint probably consists of differential methyla
tion of CpG dinucleotides in imprinting control centers in the paternal and
maternal genomes. These are imposed by sex-specific de novo methylation dur FIgur.".23 The H79and IGF2 genes
on chromosome 11 p compete for an
ing gametogenesis. How the specific regions to be imprinted are chosen is not
enhancer. Binding of the CTCF insulator
known. Although methylation patterns are quite different in sperm and eggs, protein to a differentially methylated
most of th ese differences are erased in the wave of demethylation in the early region determines the outcome of the
zygote (see Figure 11.12). However, for these special regions the imprint is erased competition. On the maternal chromosome,
only in primordial germ cells, not in somatic cells. Interestingly, at least in the the imprinting control region (lCR) is
mouse the Dnmtl DNA methyltransferase uses different promoters in spermato unmethylated, allowing it to bind CTCF
cytes, oocytes, and somatic cells, so that each contains a different isoform of the (ora nge oval).This prevents the IGFl gene
enzyme. Maybe gamete-specific isoforms allow this differential control of from accessing the enhancer, and so the
methylation. enhancer drives H J9 expression. On the
paternal chromosome, methylation of the
The second question concerns the funct ion of imprinting. One popular the
imprinting control region prevents the
ory is based on a conflict of evolutionary interest between fathers and mothers. binding ofCTCF, thereby allowing IGFl to
Selfish gene theory suggests that paternal genes might be best propagated by outcompete H1 9 for access to the enhancer.
ensuring that offspring are born as robust as possible, even at the expense of the [Adapted from Wallace JA & Felsenfeld G
mother-a man can father children by many different mothers. Maternal genes, (2007) CurroOpin. Gener. Oev. 17,400-407.
in contrast, are best propagated if the mother remains capable of further preg With permission from Elsevier.]
nancies. So, the theory goes, paternal genes program the fetus to exlIact nutri
ents at the greatest possible rate from the mother through the placenta, whereas
maternal genes act to limit the depredations of a parasitic fetus. This does fit
many imprinted loci, but is unlikely to be the full explanation. Why should some
imprinting be tissue-specific? And not all imprinted genes have functions related
to intrauterine growth, or are imprinted in the direction predicted by the parental
conflict hypothesis.
[1;~~I]
Some genes are expressed from only one allele but independently
of parental origin
X-inactivation ensures the mono allelic expression of most X-linked genes.
However, monoallelic expression, independen tly of parental origin, is also a fea
ture of a surprisingly high proportion (maybe as high as 5-10%) of autosomal
genes. Three processes can be distinguished.
First, the immunoglobulin and I-cell receptor genes show mono allelic expres
sion after programmed DNA rearrangements. Each B or T lymphocyte expresses
only a single allele of the relevant gene. As described in Chapter 4, functional
immunoglobulin and T-cell receptor genes are assembled from large batteries of
potential gene segments by a complicated series of DNA rearrangements. Some
random elements within the process mean that the chance of a productive rear inactive H
rangement is quite low; however, once a functional protein has been prod uced, a (methylated)
feedb ack mechanism inhibits further DNA rearrangements.
Competi tion for a single-copy enhancer provides a second mechanism. The
_1--
olfactory receptor genes are a well-studied example (Figure 11.25). As mentioned
in Chapter 9, olfactory receptor genes are the largest gene family in humans and
mice. Mice have some 1300 ofthem, and humans at least 900. Nevertheless, each active H
olfactory neuron expresses just a single allele of a single receptor gene, so that it
fires only in response to one specific odorant. In mice, and presumably also in
m
humans, expression of any olfactory receptor gene depends absolutely on an
enhancer sequence (H) present as only a single copy in each genome. Chromo OR,
some conformation capture experiments show that the single copy of H can
associate with anyone of the 1300 receptor genes, regardless of their chromo
somal location. Although a diploid olfactory neuro n contains two copies of H, OR2
only one is active; the other is inactivated by methylation (interestingly, appar
ently at CpA rather than CpG sequences). OR3
Finally, the great majority of cases are random. In independent cell clones, a
gene is monoallelically expressed in some clones, but both alleles are expressed
in others. In cell cultures, the effects are stable within a clonal lineage. For a given Figure- 11.25 Competition for a single
enhancer leads to monoallelic expression
gene, some clones may show paternal and others maternal expression. In an
of olfactory receptors. Expression of any
individ ual there is no consistency of parental origin between different mono
of the hundreds of olfactory receptor genes
allelically expressed genes, even when they li e on the same chromosome. (OR1, OR2, OR3, ... ) distri buted around
Polymorphic variation in the expression level of alleles is co mmon and heritable the genome dep end s on association with
(expression quantitative trait loci or e- QTLs), so monoallelic expression may be a single-copy enhancer, H. One of the two
the tail end of the distribution. The cause and significance of all this are unknown, alleles o f H in a diploid ceIJ is inactivated by
but probably e-QTLs and random monoallelic expression explain much of the methylation (green ci rcles). so that there is
phenotypic diversity of our species. only a single active copy o f H in each cell.
372 Chapter 11: Human Gene Expression
At least half of all mammalian genes have two or more alternative promoters.
These will drive transcription from alternative ve rsions of a first exon, which may
or may not be coding. Some genes have whole arrays of alternative first exons,
each with its own promoter. The UDP glycosyltransferas e gene UGT1Al on 2q37
has 13 alternative first exons. On chromosome 5q31, the protocadherin a and y
genes each consist of large tandem arrays of 2400 bp alternative first exons that
encode the bulk of the protein. These are spliced onto three small inva rian t exons
that encode the C- terminal part of the protein. There are some analogies to the
immunoglobulins, in that both have variable N-terminal but constant C-terminal
regions, although achieved by entirely different means.
The 3' end of each first exon of a gene has a splice donor sequence, but the 5'
end does not have any splice acceptor signal, so the splicing machinery can jump
as necessary over any alternative first exons in the primary transcript. Thus, al ter
native first exons can be spliced in each case to a common set of downstream
exons (Figure 1l.26).
The ENCODE pilot project identified additional upstream exons in 273 of the
399 genes investigated, most of which would be alternatives to the previously
annotated first exon. We note, however, that around half of these were already
annotated, but as exons of a different, upstream gene, so these m ay not have
been produced from true alternative promoters. Some alternative promo ters are
internal to a gene, rather than upstream. In such cases, exollS upstream of
the internal promoter are not included in that transcript. The dystrophin gene,
d escribed below, has several examples.
The alternative promoters can serve two purposes. First, they can contain dif
ferent regulatory elements. Tissue-sp ecific or developmental stage-specific alter
native promoters can allow different regulation of gene expression in different
circumstances. It is perhaps significant that the first intron of a gene is often by
far the longest (genomewide mean size 14,186 bp, in contrast with 4847 bp for
internal introns). Thus, promoters often lie far upstream ofthe internal exons of
a gene, and alternative promoters are often well separated from each other. The
novel ups tream exollS iden tified by the ENCODE project we re located an average
of 186 kb upstream of the previously annotated 5' exon. This wo uld allow pro
m oters to sit in a different chromatin environment, maybe in a different chromo
somal loop from each other and from the bulk of the gene.
four alternative
plomoters and
lirst exons
_~,~ ~~~'• •!!.D"':'• • •__ downstream exons :
Figur. 11.l6 A gene with several
four alternatIve
prima ry ttanscripts alternative promoters. The gene has four
alternative pro moters (P 1-P4) and firs t
exons (different-colored boxes), plus two
MAAAAA...
downstream exom (red boxes). The positions
AAAAAAA ... of splice donor (O) and splice acceptor (Al
four al ternative sites are shown . Each splice junction has to
mature mRNAs AAAAAAA...
be made by joining a donor and an acceptor
AAAAAAA... site.The mature mRNA has a poly(A) tail.
ONE GENE-MORE THAN ONE PROTEIN 373
~1500 ~
C M P G Figure 11 .27 Alternative promoters for
c r r , r'!> ,C the d)'strophin gene. The positio ns o f seven
0 500 1000 2000 2500 alternative pro m o ters are shown at the to p.
(Figure 11.27). Transcripts using these promoters encode large s;!ystrophin uro
tein (Dp) isoforms with a molecular weight of 427 kD. The three Dp427 isoforms
differ in their N-terminal amino acid sequence because the three alternative first
exons include coding sequence. However, in addition to these three promoters,
at least four other alternative internal promoters can be used, generating much 16)
smaller transcripts that lack many of the exons of the Dp427 isoforms.
APOBg", 5' 1:1 ,i:, iI 1,11,11 1 Ii 1111 ~.- 3' products of theAPOB gene. In the liver, the
APOB mRNA encodes a 14.1 kb 4536-residue
exon 26 protein product, ApoB100. However, in the
intestine a cytosine deaminase, APOBEC1,
LIVER INTESTINE specifically converts cytosine 6666 to
uridine, changing the CAA glutamine codon
,
6666 6666
,
2153 into a UAA stop codon. The mRNA now
l..
5'- -- CAA - - UAA - - AAAA ..An 3' 5' - - - CAA - - UM - - AAAA ..An 3'
encodes a product, ApoB4B, consisting of
C -+ U J,. RNA edit ing just the first 2152 amino acids of ApoB1 00.
J, translation
1 2152 2153 4536 2152
NH2 le OOH c==== !eaOH protem
ApoBl00 Apo848
N~N/
I
or later in development, the stored inactive mRNAs can be activated by cytoplas R R
adenosi ne inosi ne
mic polyadenylation, which restores the normal-sized poly(A) tail.
External factors can also regulate the translation of mRNAs, as shown by two FigUl'l! l' .31 Deamination of adenosine.
examples in iron metabolism. Increased iron levels stimulate the synthesis of the Enzymes of the ADAR family deaminate the
iron-binding protein ferritin without any corresponding increase in the amount amino group at carbon 6 of adenosine to
of ferritin mRNA. Conversely, decreased iron levels stimulate the production of produce inosine.
376 Chapter 11: Human Gene Expression
+F.•
in the ferritin and transferrin receptor
G U
G
A
C
U
~, .
C Tl-F. mRNAs. {Al Stem- loop structure of an
iron-response element (IRE) in the 5'
A' U
e·G
U' G
1 RNA' bln~g--E'
'R -Bp----!
- r--""'l untranslated region (UTR) of the ferritin
heavy (H)-chain mRNA. (B) In the presence
of iron, a specific IRE-binding protein
U·A
c.Q .c _ AAAAA
5' UTR coding 3' UTR
~AAAAA
5' UTR COding
(IRE-BP) is activated, enabling it to bind the
IRE in the ferritin heavy-chain gene and
U' G
c e-G ferritin H.chain mRNA
3' UTR
U' A
G· C ~
! binding of IRE·BP
tooneormore ~
REs
1 the translation of ferritin but protects the
transferrin receptor mRNA fror;n degradation.
G· C _ AAAAA AAAAA
G· c
5' 3'
translation initiation
IRE in human
the transferrin receptor (TfR). part of the machinery for absorbing dietary iron,
again without any effect on the production of the mRNA, The 5' UTRs of both fer·
ritin heavy-chain mRNA and light-chain mR NA contain a single iron-response
element (IRE), This is a specific cis-acting regulatory sequence that forms a hair
pin structure, Several such IRE sequences are also found in the 3' UTR of the
transferrin receptor mRNA, Regulation is effected by a specific IRE-binding pro
tein that is activated at low iron levels to bind IREs (figure 11.32), Like many
translational controls, the effects depend on the secondary structure of the
mRNAs,
Until a few years ago, these scattered exa mples of RNA-based controls on
gene expression were considered interesting but rare events. That view was
changed with the discovery ofRNAi and miRNAs,
IA)
.'•
transfeotion
8h
pulsed
labeling
~
mlRNA
'" '"
i$--?i
M
..
control
IS·~
L
H
(B)
;;
~
!!
~50
~
miR·155
100 transtection
C EBP-~
IOl
eH
Flgur.1 1.33 Proteomi< analysis of
the effect of changing the level of a
miRNA. (A) An outline of the experimental
procedure. HeLa cells were tr an~fected
wi th a mi(roRNA in a medium containing
light ISOlOpic labe l (L); after 8 hours they
~ were transferred to med ium containing a
,j,.
24 h
.j-
'"°. eo0'··"··
0:0
00
• • e Cl
587 589
mil
591 S93
medium-weig ht isotopic label (M). Control
HeLa cells were tran sferred to a medium
contai ning heavy isotopi( labels (H). Newly
harvest and
combine
,j,.
mass
.,0
0\W:
'"""""+
'; 0°
HIM ratio, 2.6
synthesized proteins carry an M or H label.
Extra( ts are combined and ana lyzed by
mass spectrometry. (B) An example result:
,j,. tran sfection with miR-155 resulted in a
spectrometry
2.6-fold deoease in synthesis of CEBP·p
L HIM (atlo
0 I
protein [compa re the height s of the H
H (control) and M (experimental) peaksJ. [From
proteins:
• produced in M
11 I • M • Selbach M, Schwanhausser B, Thierfelder N el
• produced in H
D pree;(isting al. (2008) Nature 455, 58- 63. With permission
mil from Macmillan Pu blishers Ltd .1
378 Chapter 11: Human Gene Expression
CONCLUSION
Gene expression is controlled by decisions at all levels: on whether or not to tran
scribe a sequence; on the choice of transcription start site; on how to process an
alternatively spliced primary transcript; on how, when, and where to translate
the message in the mature mRNA; and on the modification and transportation,
within or outside the cell, of the initial polypeptide.
Protein-coding genes, and many genes encoding functional RNAs, are tran
scribed by RNA polymerase II. Before transcription can start, a large complex of
proteins must be assembled around the transcription start site. Apart from the
RNA polymerase, these include the TFII transcription factors (a complex of at
least 27 individual polypeptides) that comprise the basal transcription appara
tus, plus many tissue-specific proteins. Some ofthese proteins recognize specific
DNA sequence elements near the transcription start site, such as the TATA box.
Others join the complex through protein-protein interactions. These include
proteins bound to distant regulatory elements that are brought into close prox
imity to the promoter by DNA looping.
Transcription start sites are defined more by the local chromatin conforma
tion than by the local DNA sequence. The conformation depends on a set of
mutually reinforcing processes: histones in the core nucleosomes undergo
numerous covalent modifications or are sometimes replaced by variant mole·
cules; cytosines in CpG dinucleotides may be methylated, and ATP-dependent
chromatin remodeling complexes alter the distribution of nucleosomes. The
many nonhistone proteins in chromatin include the enzymes that add or remove
modifying groups to or from the histones and DNA. They are themselves differ
entially bound to chromatin, depending on its conformation and state of
modification.
Cluomatin conformation controls gene expression, both in the short term as
cells respond to fluctuating metabolic and environmental conditions, and in
the longer term as developmental programs unfold. End uring changes in confor
mation are the basis of epigenetic effects-changes in gene expression that do
not depend on changes in the DNA sequence. For some genes, epigenetic effects
depend on the parental origin-the phenomenon of genetic imprinting.
Transcription requires an open chromatin conformation, whereas heterochro·
matin has a closed, tightly packed conformation and represses gene expression.
Small noncoding RNA molecules are involved in the establishment of stable
heterochromatin.
Primary transcripts are often subject to alternat ive splicing. Exons can be
skipped or included, or variant donor or acceptor sites can be used. Some genes
can potentially encode hundreds of different polypeptides because of extensive
FURTHER READING 379
FURTHER READING
Promoters and the primary transcript Chromatin conformation: DNA methylation and the
Fuda NJ, Al dehali B & Lis JT (2009) De~ning mechanisms that histone code
Vise l A, Rubin EM & Penn ac hi o LA (2009) Genomic views of Buhler M & Moazed D (2007) Transcriptio n and RNAi in
distant-acting enhancers. Nature 461,199-205. [Review of heterochrom atic gene silencing. Nat. Seruer. Mol. BioI. 14,
the nature and fun cti o n of enhancers, with emph as is on 1041-1048. [Involvement of small RNAs in heteroc hroma tin
human exampl es.] formation.]
Chapter 12
KEY CONCEPTS
o Clues to the function of a gene can be inferred by comparing the sequence ofthe gene and
any protein products with all previously investigated nucleotide and protein sequences to
find related sequences. The related sequences may be closely related genes or proteins, or
they may be short functional sequence components such as protein domains or peptidel
nucleotide sequence motifs.
o Inactivating an individual gene often results in an altered phenotype that provides
important insights into the functio n ofthe gene, but useful information can also sometimes
be obtained by overexpressing genes or expressing them ectopically.
o Gene functio n can be conveniently studied in cultured cells by silencing the expression of a
specific target gene using RNA interference. More extensive information about how a gene
works can often be inferred by selectively inactivating a gene in Whole model organism s.
o Yeast two-hybrid screens identify interactions between proteins by splitting a transcription
factor into two functionally complementary domains that are then joined onto test proteins;
interaction between t",vo such test proteins can reconstitute the transcription factor and
switch on a reporter gene.
o Large- scale protein interaction stud ies seek to define interaetomes, global functional
p rotein networks in cells.
• Protein- protein and protein-DNA interactions can be identified with the help of chemical
crosslinking agents such as formaldehyde that can covalently bond proteins to other
proteins or to DNA within cells.
o Chromatin immunoprecipitation identifies DNA-binding sites for a protein ill vivo by using
a specific antibody to capture a protein after it has been chemically crosslinked to DNA
within cells.
382 Chapter 12: Studying Gene Function in the Post-Genome Era
Sequencing of the human genome was, with hindsight. the easy part. Now comes
the hard part: how do we unravel what the sequence means and discover how it
helps construct a functioning human being and differentiates us from other
species?
In the post-genome era, work will continue for some time to get a complete
inventory of human genes. As was described in Chapter 10, comp arative genom
ics has been particularly helpful in identifying regulatory elements and in vali
dating predicted protein- coding genes. The vast majority of protein-coding genes
have been identified in the human genome, and we now know that there are
about 20,000 genes of this kind. However, for the reasons described in Chapter 9,
RNA genes have been much more difficult to identify within genomic DNA, and
our knowledge of human RNA genes, although growing rapidly, remains rather
incomplete.
As the effort to identify all the genes and other functional components in the
human genome intensifies, researchers seek to understand our genome in differ
ent ways. One major priority, as will be detailed in Chapter 13, is prompted by the
biological and medical need to understand genetic variation at the genome level.
In this chapter we consider ano ther priority, the need to understand the fun c
tions of genes. In the post-genome era, whole genome analyses are increasingly
being applied to understand the function and regulation of genes. In many cases,
such analyses have been developed from methods that were used to study one or
a few genes at a time. However, getting a whole-genome perspective allows us to
develop comprehensive networks of interacting genes and gene products, and so
obtain a fuller picture of how genes and cells work.
Chapter 11 (p. 362), the ENCODE project seeks to identify and analyze all the
functional components of our genome by a combination of experimental
approaches and bioinformatics analyses. The widespread availability of many
experimental approaches to perform global gene analyses has meant that a net
work of small laboratories throughout the world are also making invaluable con
tributions in addition to large, well-equipped genome centers.
Gene function can be studied at a variety of different levels
Let us first consider the simple case of a single previously uncharacterized gene
that has been predicted by using gene-finding programs and then validated by
co~parative genomics. How can we study its function?
First of all, we can use bioinformatic approaches to compare the sequence of
a gene and its products with all other known nucleotide and protein sequences.
Additionally, the sequences can be scanned to look for Significant matching to
short nucleotide and peptide sequence components that have previously been
shown to be ilIlportant contributors to gene function. The other major avenue to
understanding gene function requires laboratory investigations, and a whole
range of different types of experimental approach can be used, including those
listed below.
Gene expression studies
Some studies look to see where the gene is expressed at the RNA level by using a
complementary antisense RNA probe, and / or at the protein level (usually by
using a suitably specific antibody). The expression studies often begin byanalyz
ing expression in tissue sections of early embryos, or by probing RNA extracted
from adult tissues, such as in northern blots or using RT-PCR. In which tissues is
the gene of interest expressed? Ubiquitous expression would suggest some very
general (housekeeping) cellular function. Alternatively, if the gene is expressed in
a specific tissue-such as the testis-then a mOre defined role is indicated, in this
case possibly a role in reproductio n. Intracellular localization is another aspect of
gene expression that can be studied in intact cells or by analyzing subcellular
fractions. See Chapter 8 (p. 239) for a detailed descri ption of expression screening
techniques.
Gene inactivation and inhibition of gene expression
To obtain more precise information on gene function, follow-up studies are usu
ally designed to specifically inactivate the gene in vivo or to specifically block its
expression. This can be done at two rather different levels.
One way is to perform the required genetic manipulation in cultured cells,
often standard human cell lines. Conducting these manipulations in cultured
cells can be done quickly and easily, but inevitably rhe results obtained only give
information about the role of the gene within the cell type that is being studied.
We describe this type of approach later in this chapter.
At another level, genetically modified animals can be produced. Earlyembryos
can be manipulated to introduce specific changes in germ-line DNA or to specifi
cally block the expression of a predetermined gene of interest, and the effects on
the developing phenotype can be tracked at different developmental stages and
in the postnatal period. Such studies can be time-consuming and laborious, but
they offer the big advantage that the effect of inactivating or inhibiting a specific
gene can be tracked in the whole organism, providing valuable information about
genes that have roles in the functions of organs, tissues, and other sets of inter
acting cells. The technology for genetically manipulating animals is complex,
and because it is also used extensively to produce animal models of disease we
will consider the details in Chaprer 20.
Defining molecular partners for gene products
To perform their functions, gene products need to interact with other molecules
in the ceUs. Proteins, for example, often bind to other proteins, sometimes
transiently, and sometimes as components of stable multisubunit complexes.
Many regulatory proteins and functional noncoding RNAs bind to specific nucleic
acid sequences, as in the case of transcription factors. To identify molecular part
ners for a specific gene product, various screens can be employed to test protein
protein and protein-nucleic acid interactions.
384 Chapter 12: Studying Gene Function in the Post-Genome Era
compilation
Mouse Atlas of Gene SAGEa and sequence tag data from many http://w ww.mouseatlas.org/mouseatlasjndex_html
Expression mou se tissues during development
Tra nscriptorni c Allen Brain Atlas high-resolution express ion patterns http://www.bra in-ma p.o rg /
(tissue from sections of adult mouse brain
hybridization)
EMAGE high-resolution gene expression during hnp:/lgenex.hgu.mrc.ac.uklEmage/databasel
mouse development emageln tro.hrml
aSAG E(serial analysis of gene expression) is a method that is designed to make double-stranded DNA cop ies of RN A populations. These are then
processed enzymatically so that from individual RNA molecu les short sequence fragments (tags) are recovered and then sequenced. bRNA-seq
involves shotgun seque ncing of the whole transeriptome, using massively parallel DNA sequencing.
databases and new algorithms-for example, to search and analyze the data or to
model structures and molecular interac tions-as it has by experimental advances.
However, in addition, as described in this section, bioinformatics can be used
directly to infer gene function simply by comparing sequences from previously
un.investigated genes with other nucleotide and protein sequences about which
some function is already known.
." o n "
Yo' r.l'G t E.t.KE J [ 0
as X but recog nizes a sig nificant minority
v . " /' .1_.10£
,o ,
E ,
,; II
., ,
'j , .' 1
, o , , .; g number 12. (B) A more comprehensive
,, , "
o,
,. o 0
1 2 3 4 5 6 7 B 9 1 1 11 1 111 H 0 .,
, , ,, position-specific scoring matrix (P55M)
• • ,. "F
0123 4 567 0
, ..• 0 '
"0 j
• •" •
¥
C L ,
E ;:..
L
" C·~ r;)
o • , ,.
~
ov o,
,. 0 , j) t) 1
': ,".
a amino acid residue (vertical axis) at each
~
7 V
r
., ,r ~l
, ,o "u ,
PO I,i
I) " ·1
::r 1
o"
0
" 0 ~ Q Q 'J
.J , 1
amino acid position (ho rizo ntal axis), as
shown here. Subsequently, the observed
o' .
, , ,, -, , ,1 0 frequenc y is compared w ith the frequency
consensus
[' p r , G , C
- . !i.. •
0
0 0
0
'J 0
0
0 c
0 1
at w hich a residue can be expe<ted in a
random sequence so as to provide a score fo r
.. U v 0 0 0 'J , 0 1 eac h position in the matrix (as a logarithm
panern
K"Io4-~-X{?l--;-X-r: XOI-r:ih) · X(~I ~£ y
",1 "
0 0 1
" D
C' 0 ,,
) 00 ;,
- C'
of the ratio of t he o bserved and expected
frequencies-not shown here).
BIOINFORMATIC APPROACHES TO STUDYING GENE FUNCTION 387
BOX 12.1 HOW BIOINFORMATICS CAN HAVE A VITAL ROLE IN ELUCIDATING GENE FUNCTION:
THE EXAMPLE OF THE NIPBL GENE
Positio nal cloning identified a previously unstudied gene at Sp13 Different protein interacti on methods were used to show t hat
as a major locus for a severe developmental malformation, Cornelia the fl y Nipped-B protein and the human NIPBL protein eac h bound a
de lange synd rome (Cd LS; OMIM 122470). The sequence of the small protei n roughly 600 amino acid residues long, w hich turned out
underlying gene provided few dues to its fu nction, but sequence to be closely related orth ofogs of each other and unexpectedl y also
homology searching with the predicted 2804 amino acid protein of the well-studied C e/egans Mau-2 protein. Mau-2 was know n to be
revealed that it had well-studied orthologs in flies and fungi. a regulator of cell and axo n migration In C e/egans, but noth ing was
The Drosophila ortholog, Nipped-B, was known to be a known about any of its orthologs in other species.
d evelopmental regula tor that see med to faCilitate lo ng-range Standard BLAST searches did not find any evidence to suggest
interactions between promoters and remote enhancers in certain that Mau-2 and its ortho logs were related to SCC4 in any way.
target genes, and so the human gene was named NIPBL (Nipped However, su bsequent analyses with PSI-BLAST, a specialized sequence
B-like). Fu ngal orthoJogs of the N1PBL protein, such as bud d ing homology search prog ram that is useful for detecting similarities
yeast SCC2, were known to load cohesins onto chromatin and to between distantly related sequences, showed that SCC4, Mau-2,
be im porta nt in regulating chromosome segregation. Individual s and the proteins bindi ng to NIPBL and Nipped-B were orthologs.
with CdLS possessing a causative NIPBL mutation are heterozygotes Su bsequent experimental analyses confirm ed the idea that vertebrate
and do no t have a significantly increased incid ence of ch romosom e orthologs fun ctioned in load ing coh esin o nto chromatin as partners of
segregation abnormalities. However, when both NfPBL alleles were vertebrate SCO ortholog s. As for the SCO/Ni pped-B/NIPBL
selectively inactivated in human cells by RNA interference, clear family, selective inhibition of express io n of genes encoding Mau-2
defects in sister chromatid cohesion were fo und. Parallel studies in and its ortho logs also resulted in chromosome segregation defects
which the Nipped-8 gene was inactivated in Drosophila cells also (figure 1).
produced sister chromatid cohesion defects. It was concluded that the yeast SCCl-SCC4 co hesi n-Ioading
In b udding yeast, SCD was known to bind a sma ller pro tein com plex and it s chromosome segregation function has been
SCC4, close to 600 amino acid residues long, and it had been shown co nserved during evolution but that, in vertebrates, the same proteins,
that the SCC2-SCC4 complex was responsibl e for loading cohesins, including the NIPBL protein, are involved in various aspects of g ene
multisubunit protein s that regulate chromosome segregation, on to regulation during embryonic development. Subsequent work showed
chromatin. What had been intriguing, however, was that whereas SCO that both NIPBL and human Mau-2 were involved in reg ulating axon
had been highly conse rved during evol ution with clear o rthologs in migration. Genes encod ing certain cohesin subun its were also found
all genomes, including NIPBL and Nipped -B, SCC4 seemed sO poorly to be mutated in CdLS, and cohesi ns w ere found to have roles in
conserved that standard BLAST searching could find no m atches other long-distance g ene regulation in vario us cells including non-dividi ng
than homologs in closely related species of yeast that were confined to neurons.
the Saccharomyces genus. Figu.re 1 Synergy between
bioinformatic. and experimental
investigations to identify functions for
Cornelia de l ange
human phenotype the Cornelia de Lange syndrome gene,
syndrome
NIPBL. Blue arrows, experimental work;
l
TBlAS1N
BlAS", s.rfam• details of databases and homology
translatIOn sea rchi ng programs used. aa, amino acid.
prolein ~
ref : C. elegans : 4('--- NIPBl protein
~
1
()(thOlogs NIPped-B :. _ ~~~~~ __: TBLASTN (2804 aa)
TBlASlN
~~:: BlAS:l~
BlASTF
l
SGO
RNAI l prOtein
PlJbMed
orthologs in
orthologs of
interacting
proteins
CG4203
(632 aa) , TBLASTN
) ~ lel.ASlN
BWTF ~
KlAA0892
(613 aa)
PSI-8LAST • TllLASlN
BLASlP
Saccharomyces
~ genus of yeasts
only
~~:=~~~~~ r8U\S'fN
aU\STP
388 Chapter 12: Studying Gene Funct ion in the Post-Genome Era
TABLE 12.2 SECONDARY DATABASES OF PROTEIN SEQUENCES USED TO IDENTIFY CONSERVED ELEMENTS AND PROTEIN
DOMAINS
Database Contents URL
Inte rPro a search facility that integrates the information from other http://www.ebLac.uklinterpro/
secondary databases
Flgur. 12.2 Searching PROSITE for the presence of domains and motifs in a test protein, Th e
PROSITE database at http://www.expasy.org/prosite/storesalargenumber of documented seque nces
for protein domains and motifs. In this exam ple, searching with a 422 amino acid (a a) protein query
returned two hits when compared against protein domain profiles with significant matching to
both a paired domain (found in the PAX family of transcription factors) at amino acids 4- 130 and a
homeodomain and a homeobox va riant at positions 208-268. A further two hits are eVident when
compared against patterns of protein motifs: a paired domain type motif [R-P-C-x{111-C-V-S) spanning
amino acids 38-54 and a motif characteristic of some homeodo mains spanning amino acids 243-266.
Th e motifs identified suggest strongly that the test protein is a DNA-binding protein and is most
probably a transcription factor.
STUDYING GENE FUNCTION BY SELECTIVE GENE INACTIVATION OR MODIFICATION 389
TABLE 12.3 DIFFERENT TYPES OF REVERSE GENETIC SCREEN IN CULTURED MAMMALIAN CELLS
--------'1
Type of screen Basis of method
l oss-of-func tion usually, the RNA interference (RNAi) pathway is induced to selec tively degrade RNA transcripts of a specific target gene
(see Figures 12.3 and' 2.4)
Gain-at-function involves transfectlo n of an exogenous ge ne copy driven by a suitable promoter to overexpreS5 or misexpress that gene in a
desired cell type
Dominant-negative relies on producing a mutant gene product that interferes with the norma l product of a specific target gene; Similar (0
loss-of-function screen but suppresses gene activity more effi ciently than RNA; often the mutant product is designed to
be a truncated version of the wild type that competes with wild-type protein in some way, or the mutant protei n becomes
incorporated with wild-type proteins in multi subunit complexes, thereby inactivating the complex
Modifier seeks to identify genes that en hance or su ppress a specific phenotype; the in itial phenotype is produced by one of the three
methods above; thereafter, the mutant cellsare subjected to a second genetic modification (wh ich can again be any of the
three methods above); if altered expression of a second gene enhances or su ppresses the initial phenotype, the second gene
is considered to be a modifier of the gene that caused the initial phenotype; it then remains to be worked out how the two
genes interact
For further details see the review by Tochitani & Hayashizaki (2007) .
typicaUy reverse genetic screens. There are four different functional classes of
mutation-gain-of-function, loss-of-function, do minant-negative, and modifi
er-and differe nt scree ns can be conducted depending on the desired effect on
the target gene ("table 12.3).
Of the differen t approaches listed in Table 12.3, by far the most widely used
involves selective gene inactivation to cause a loss of function. An early general
approach used the specificity of base pairing to selectively inhibit the expression
of a predetermined gene. [n antisense technology, RNA or oligonucleotide con
structs are designed to have a seque nce tha t is complementary to that of RNA
transcripts from a gene of interest. By hybridizing to the sense transcripts, the
antisense constructs block gene expression (gene kn.ockdown).
Antisense technology was first developed in the 1980s. Specific antisense RNA
molecules were initiaUy used to knock down the expression of a target gene, but
this was very much a hit-or-miss affair depending on which gene was targeted.
Subsequently, antisense oligodeoxynucleotides were used because they were
more stable than RNA. More recently, morphoJino antisense oligonucleotides are
used. They have a vastly more stable and robust structure than conventional oli
gonucleotides, and they are more consistent in producing significant inhibition
of gene expression. As will be described in Chapter 20, morpholino oligonucle
otides are now ro utinely used to knock down the expression of specific genes in
va rious vertebrate models, notab ly zebrafish embryos.
As explained in the next section, the natural phenomenon of RNA interfer
ence (see Box 9.7) has also been widely applied as an experimental tool to knock
down the expression of specific target genes in cultured cells and in the embryos
of some invertebrates, notably the nematode Caenorhabditis elegam.
,
~
'"
'- shRNA-mlr
1 shows lhe u se o f long double- stranded
RNA (typically longer than 200 bp) that can
be used in invertebra te systems such as
dlcer cleavage
or dsRNA '" shRNA
• however, long dsRNA triggers nonspecific
RNA cleavage (alternative pathways 2
and 3 are used). Short double-stranded
RNA can be produced from an expression
vec tor (pathway 2) transcribed in the
nu cleus to give a single RNA molecule w ith
~ dicer cle""age
inverted repeats. The repeats base-pair to
each other to produce a short hairpin RNA
that resembles initial mJRNA t ranscripts
(shRNA-mir). The S' and 3' tail s of this
.J,. / primary tran script are then cleaved in the
• • siRNAs ~('-----'"
nucleu s to give a more compact short
-i•
hairpin RNA (shRNA) that is exported to the
Ago cytopl asm w here it is processed by the dicer
RI SC endonuclease to give double-stranded short
interfe rin g RNA (siRNA). Alternatively, two
short oli gori bon ucleotides are chemically
(siRNA) (~1gure 12.3. pathway 1). The siRNA associates with a multisubunit pro·
tein complex, the RNA·induced silencing complex (RISe), and the duplex siR NA
is unwound and one strand degraded by a RISe ribonuclease called argonaute.
The RISe with one siRNA strand attached to it is now activated, and the siRNA
strand will guide it to bind RNA transcripts containing complementary sequences.
Any transcript that is bound by the specific antisense siRNA is then targeted for
destruction by the argonaute ribonuclease.
The addition of long dsRNA to mammalian cells does not inhibit the expres
sion of specific genes as in C. elegans.lnste ad, long dsRNA induces a general anti·
viral pathway that produces global, nonspecific gene silenCing. It does this by
activating the protein kinase PKR (which is also induced by interferon). Activated
PKR phosphorylates the translation initiation factor 2 (eIF2a), therehy inactivat·
ing it and causing a generalized inhibition of translation. PKR also phosphory·
lates 2',5' oligoadenylate synthetase (2',S'AS), which, when so activated, induces
a ribonuclease that causes nonspecific degradation of mRNA.
As an alternative to adding long dsRNA to mammalian cells, two approaches
are used (see Figure 12.3). In one method (Figure 12.3, pathway 2; see also n ext
section), transgenes are added that can be expressed to give a partly double·
stranded hairpi n RNA (like the structure of miRNA). Another widely used method
involves che mically synthesizing two short complementary oligoribonucleotides
and allowing them to anneal to form a synthetic equivalent of siRNA. Often, the
392 Chapter 12: Studying Gene Function in the Post-Genome Era
1 I
ge"e m
OOr gene inactivation
-t
tran~n& t:xpr!5'5iDfMi
I
at DNA level a t RNA level at ptoleln level
mutate
gene
to change a mino
gene
I
knockout
by mutating
gerle
I
knockdown
base pairing with
I
prOlt!'ln
block
binding to
. ,,,.....," I
e"'"
~
ec.toplc
ellpl8S8lon
1
random gene
I
targe ted RN A rnorpholino dominant
aptamers Intmbodies
knockout gene knockout interference anUsense oligos negatives
Although all four types of screen listed in Table 12.3 can also be performed in Figure 12..5 Principal ways of studying
uiuG, the most common approach to defining the function of a predetermined gene function in vivo by genetically
gene in whole model organ isms is to inactivate a gene and then assess the result modifying model eukaryotes. (A) Most
ing phenotype. This can be done by m utating the gene to produce a null allele, a methods seek to inactivate a gene to
produce a mutant phenotype that can
gene knockout, or the gene can be effectively inactivated by inhibiting its expres be correlated with a specific gene (peach
sion at the RNA or protein level (Figure 12.5). boxes). Usually this is done at the level of
Valuable clues to gene func tion are usually obtained by this method, but not DNA, where large-scale random mutagenesis
all genes produce a phenotype when knocked out because of genetic redundancy screens have been performed in many
(sometimes two Or more genes have identical or very similar ro les and so double models, includ ing both inve rtebrates
or triple gene knockouts may be needed to elucidate a phenotype). Other genes (such as D. melanogaster) and vertebrates
may be difficult to study because their functions are so crucial that inactivating (no tably zebra fish and mice). Spec ific
them causes cell death. knockouts of predetermined genes have
The technologies involved in designing and studying genetically modified also been conducted, no tably by using
homologous reco mbinatio n in mice. The
animals to understand gene function are highly sophisticated. We examine these
expression of a specific gene can a lso
in Chapter 20, where animal models of disease are described. Different methods be selectively inhibited by targeting its
are popular in different model eukaryotes. tra nsc ripts-large-scale RNA interfe re nce-
Table 12.4 outlines the principal eukaryote models that are currently used to based genetic screens have been conducted
infer how human genes work. The extent to which they are val uable depends on in C. elegans and D. melanogaster and gene
the degree to which functions of genes have been conserved during evolution. knockdown with antisense morpholino
Thus, although it may seem evi dent that mammalian models would generally be oligonucleotides is commonly performed
the most useful models (because of the high biochemical and physiological simi- in zebra fish embryos. Blocking proteins
1arity to humans), even unicellular yeasts have also provided many insights into with dominant-negative mutant proteins,
human gene function. aptamers, or intrabodies also have potential
roles in treating disease (by speCifically
d ownregulating harmful genes) and are
12.4 PROTEOMICS, PROTEIN-PROTEIN INTERACTIONS, discussed in more detail in Chapter 21.
AND PROTEIN-DNA INTERACTIONS (8) An alternative approach is to use a
transgene to overexpress a specific gene
The majority of known human genes encode polypeptides that are incorporated or to express it in tissues or developmental
in proteins. Proteins serve numerous functions in cells, and large-scale studies of stages in which it is not normally expressed
how human proteins function have begun. (ectopic expression) in an attempt to produce
a phenotype that will provide clues to the
function of the gene.
Proteomics is largely concerned with identifying and
characterizing proteins at the biochemical and functional levels
In its traditional, broadest sense, proteomics encompasses ail global, or at least
large-scale, studies of proteins. However, it is now usual to use a more restricted
lneaning that excludes protein structure determination (structural genomics)
and homology studies that compare sets of protein sequences and protein
394 Chapter 12: Studying Gene Function in the Post ~ Genome Era
TABLE 12.4 PRINCIPAL MODEL EUKARYOTIC ORGANISMS USED TO ASSESS GENE FUNCTION
Organism Class Meth ods of studying gene Comments ~~~----------------~
Genes and mutants databases
function
Saccharomyces yeast highly sophisticated genetics despite being unicellular, yeasts have SGD (Saccharomyces Genome
cerevisiae (unicellular and mutant screens aided by been usefu l In modeling how some Databa se) at hnp://www
(ungus) high frequency of homologous genes work to perform some highly .yeastgenome.org!
recombination conserved cellular functions such as
in DNA repair and cell cycle control
Caenorhabditis nematode large-scale genetic screens using extensively studied; invariant and WormBase at hnpJ/www
elegans (roundworm) RNA interference fully mapped cell lineage; the only .wormbase.org/
model in which the complete wiring
system is known for the nervous
system (there are 302 neurons, and all
th ei r co nnections are known)
Drosophila fru it fly highly sophisticated genetics extenSively studied and huge FlyBa se at hnp:/Jnybase.bio
melanogaster and mutant screens, aided by Pl num bers of well -characterized .indiana.edul
transposon mutagenesis mutants
Doriorerio zebrafish selective gene inac tivation principal uses are for understa ndi ng ZFIN at http://znn.o rg /
by using morphoJino how genes function in early
oligonucleotides to inhibit development and in offering a rapid
expression at level o(translation; assessment of how a selected gene
also, use of transgenes to express functions in a vertebrate
dominant-negative mutant
proteins
Mus musculus mouse large-scale random mutagenesis the no. 1 model for understanding MGI (Mouse Genome Informatics)
screens (using chemical human gene fu nction because of the at http://Www.in formatics.jax.org/
mutagens or transgenes); larg e comb ination of being a mammal and
programs using homologou s th e capacity for sophisticated genetic
recombination to knock out manipulation
specific genes; expression
of mutant transgenes and
overexpression/ectopic
expression of normal transgenes;
tra nsgenic RNA interference
Nature Rev;ews Genetics has published a series of review articles on the Art and Design of Genetic Screens in mod e! organisms that includes the above
models as follows: yeast: Forsburg 5L (2001) Nat Rev. Genet. 2, 659-668: C. e/egans: Jorgensen EM & Ma ngo SE (2002) Nat. Rev. Genet. 3, 356-369;
Drosophila: 5t Johnston D (2002) Nat. Rev.Genet 3, 176- 188; zebrati sh: Patton EE & Zon LI (200 1) Nat. Rev. Genet. 2, 956- 966; mouse: Klle BT &
Hilton DJ (2005) Nat. Rev. Genet. 6, 557-567. Many of the technologies are explained in detail in Chapter 20.
~ 1 1
2-0 gels
phosphorylation
mass
spectrometry e.g. compa ri ng two
samples representing
1 J. J. J.
1
purifying
protein large-scale
b;OOh! 'caV
cell biology targets for
proteins and
assays for
assays protein complexes
drugs recombinant
proteins,
subcel lular
localization 1
protein
analyses
identifying
interactions
!
antibodies ,
"Inhibito(s,
drugs to disrupt
etc.
substrates of enzymes such as protein kinases. Antibody arrays have been com
paratively widely used.
Much of the focus of proteomics is currently being devoted to identifying
protein-protein interactions and protein complexes. The ultimate aim is to define
protein networks that will aid our understanding of how cells work.
In addition to binding to other proteins and to nucleic acids, proteins interact
with a large number of small molecules that act as ligands, substrates, cargoes (in
the case of transport proteins), cofactors, and allosteric modulators. In all cases,
these interactions occur because the surface of the protein and the interacting
molecule are complementary in shape and chemical properties. Protein-ligand
interactions provide information on how genes function, and they are also impor
tant for therapeutic purposes: the basis of drug design is to identify mol ecules
that interact specifically with target proteins and change theit activities in the
cell. Most drugs work by interacting with a protein and altering its activity in a
manner that is physiologically beneficial. We will consider such interactions in
the context of therapeutic applications in Chapter 21.
TABLE 12.5 COMMONLY USED METHODS TO DETECT AND VALIDATE PROTEIN-PROTEIN INTERACTIONS
Method Description Comments
Yeast two-hybrid (Y2H) relies on mating of haploid yeast cells express ing hybrid the specificity of interactions is often low; yeast cells
screening (Figure 12.7) proteins w ith complementary transcription factor domains; are not ideal enVironments for testing interaction
positive interactions bring together a novel transcription between mammalian proteins
factor with in the resu lting diploid yeast cells
Affi nity purification- mass a specific protein is covalen tly linked to an insoluble matrix; affinity purification is m ore specific than Y2H but does
spectrometry (AP-MS) this bound protein is then used to capture any proteins that not readi ly detect transient interactions; like Y2H, it has
(Figure 12.8A) it interacts with f rom a cell lysate; captured protein can be been used extensively to work out the detail of protein
eluted, purified, and analyzed by mass spectrometry interactomes in model organisms
TAP (tandem affin ity relies on two sequ ential rounds of affinity purification; this
purification) tagging is ac hieved by transfecting cells with a eDNA ofint'erest
(Figure 12.8B) coupled with a sequence that will add a terminal tag to
express two different proteins/peptides known to be bound
by other proteins
Fluorescence resonance microscopy method that can monitor interactions in real provides circumstantial supportive evidence only for a
energy transfer (FRET) time; it tests interact ion of suggested interact ing proteins supposed protein-protein interaction
(Figure 12.9) after labeli ng them with different fl uoroph ores
Co-i mmunoprecipitatio n a specific antibody is used to preCipitate a desired protein the gold standard for validating protein interac tion,
(co-IP; Figure 12.10) from a cell lysa te; if the protein is complexed to other especia lly when using endogenous proteins
proteins, the interacting protei ns should be co-precipitated
Pull-dow n assay a variant of co-IP that is identical in all aspects except that
it uses a specific protein ligand other than an antibody to
capture a protein complex
For more detail on these and other methods including protein microarrays, see the review by Lalonde et al. (2008) under Further Reading. Note that
the division into (A) and (B) is somewhat arbitrary. For example, a directed Y2H assay can be desig ned to test the interaction between two specific
proteins. On the other hand, although co-IP is often used for validating presumed interactions, it can also be used to isolate entire protei n complexes.
ability to bind to individual test proteins, and various assays can be performed to
validate interactions between specific proteins. We summarize the methods in
Table 12.5 and provide details of different methods in the next sections.
(A)
Sl"pt"idln "8 Flgure 12.7 Yeast two·hybrid IY2H )
screening. (A) An active transcription
factor comprises complementary DNA
~ binding (80) and transcription ac tivation
~r~1.a~
AD hybrid (AD) domains. A cDN A for either domain
can be spliced to cDNA for other proteins,
~
(B)
with a speCific test protei n (X) fused to a
(;';\ . bindin g domain for a specific transcription
00 \ ~ \.:) ~ fa ctor. The BD hybrid protein binds to a
r
II,
~ D"'
.~ ~ ,j (UAS) positioned before a reporter gene but
. 16 is inactive in the ab sence o f the relevant
the bait prey library
AD. Possibl e protein partners for protein
X are sought from withi n a prey library of
numerou s yeast strai ns containing different
proteins linked to the releva nt AD. Crossing
of the bait strain with a prey strain that
makes a protein which binds to protein X
(protein 3 in this example) can produce a
diploid cell in which AD and BD domains
(C) associate. As a result, the transcription factor
l' 2' 3' 4' etc. is reconstituted and drives expression of
12 3 4 etc.
r:O[)j)\II~'};
~Ii\Vtj
~
.
the reporter gene. (C) Crossing of yeast cells
c!~~~
from a library of cells expressing SO hybrid
proteins w ith cells from a libra ry expressing
T
AD hybrid proteins permits massively parallel
protein-protein interaction screening.
library of AD hybrids
library of BD hybrids
massively parallel
protein-protein
Interaction screen ing
(A) o
o
. 0
0 • • • •
• • • 0 high [Na;']
00
o
0 0
insoluble
matrix
'"
•••••
•••
•• •
'"
gel purification
mass spectrometry
e
~
I
- +
express
= ~
binding fEv
~
native
TAP tag
targe t in cells; protease \ elution
Pfo(ein prepare \ cleavage EGTA
contaminant fi , st affinity\ second affinity
cell lysate
column column (Ca2+)
Figur. 11.8 Affinity purification-mass spectrometry and TAP tagging for identifying protein-protein interactions.
(A) Srandard affiniry purification- mass spectrometry. Affinity purification starts with covalently coupling a protein of interest to
beads o f an insoluble matrix that is supported in some way, such as in a chromatography column . A cell lysate is passed over the
bead s to allow possible binding of proteins to the immobilized protein on the column. Proteins that are nonspecincally bound
can be removed by wa shing; any interacting protein can then be eluted with buffers of high salt concentration and subjected to
mass spectrometry analysis (see Figure 8.27) to identify constituent peptides and, ultimately, the protein . (8) TAP (tandem affinity
purification) ragging . An expression vector containing the coding sequence for a TAP-tagged protein is transfected into suitable cells
to all ow the target protein to bind to proteins in the cells. The C-terminal TAP tag shown here has a calmodulin-binding peptide
(CBP) and TWO immunoglobulin (lgG)-binding domains from Staphylococcus aureus protein A (ProtA) separated by a peptide
sequence recognized by the tobacco etch virus (TEV) protease. The protein complex is recovered in a two-step affinity purincatio n
procedure. The complex will be separated from some contaminants by binding to IgG beads. The TEV protea se 1s then used to cleave
the TAP tag. The released protein complex now with just a CBP tag can bind specifically to calmodulin beads and is subsequently
eluted and proces sed for analysis by mass spectrometry. EGTA, ethylene glycol tetracetic acid. [Adapted from Puig 0, Caspar y F,
Rigaut G et at (2001) Methods 24, 218- 229. With permission from Elsevier.)
Figure 12.9 Using fluorescence resonance energy transfer (FRET) to test a IAI
protein interaction. Fluorescence microscopy using FRET can be used to test 436 nm
protein interactions within cells in real time. (A) Two proteins of interest (1 and
2) are expressed as hybrid p roteins that have different fluorescent protein tags,
such as cyan fluorescent protein (CFP) and yellow fluorescent protein (YFP) ""'--,
( yep ,
shown here. The cells are then exposed to light of a defined wavelength that
will excite o nly one of the two flu orophores, in this case CFP. When e)(cited by
exposure to light of wavelength 436 nm, CFP emits blue light with a slig htly
longer wavelength of 480 nm and continues to do so if proteins 1 and 2 do not
Interact. (B) If, however, proteins 1 and 2 do interact. their attached fluorescent
proteins com e into very close proximity, and fluoresce nce energy is transferred
fro m the donor CFP to the acceptor YFP in a non-radiative manner via
dipole-dipole coupling (resonance). After excitation o f (FP, only a small residual
1
cS
amount of the absorbed light is now emitted at 480 nm; most of the energy is
transferred from CFP to the adjacent YFP. The acceptorYFP then emits light at a
181
different waveleng th (535 nm) that is readily distingui shed.
436 nm 480 nm
transferred to the other type of fluoro phore-the acceptor-ifthe two proteins to \ •.:'1 535nm
which the donor and acceptor fluorophores are bound come into extremely close
contact (within 1-10 nm). As a res ult, a significant amount of the energy absorbed
by the excited donor is emitted at a much longer wavelength than if the donor
-ef (v{
particular protein are added directly to a cell lysate. If the protein of interest is
bound to one or more other proteins, the antibody can be used to immunopre
\,--)
cipitate the target protein while it is still bound to interacting proteins. often as
part ofa complex (Flgure 12,10). A variant co -I P technique, called a pull-down
assay, uses a specific ligand in place of an antibody to capture a protein of inter
est as part of a complex. In both methods, the bound proteins can be identified
after peptide cleavage and mass spectrometry analysis.
Figul'e 12.1 0 (o-immunoprecipitation
The protein interactome provides an important gateway to (co-IP). This method is popularly used to
systems biology validate a suggested interaction between
two proteins, but it can also be used as
Affinity purification can be used to isolate entire protein complexes, allowing the a screening method to dire ctly identify
different components to be identified by mass spectrometry. This is one of the novel proteins that bind to a protein of
major technology platfo rms in interaction proteomics. It has been widely used for Interest. A cell lysa te (or other protein mix)
.. ~. ""
is incubated with an antibody specific
.• . ~.J •
for a protein of interest to form antigen
~ ~
• antibody antibody complexes. The resu lting immune
..: ..
..
• • ••
0
A complexes are then captured by binding
• •
-±r ~~
to agarose or Se pharose beads to which
• •
~ ~ ~
protein A or protein G has been coupled
~
•• • • .J •
~ ~~
(protein A and protein G are bac terial
•
• • • •. •
protein s that happen to bind strongly to
~ .: ~ :. ¥ ~
immunoglobulins). As a result, the immune
0
• •
complexes are precipitated out of solution
protein
and, after centrifugatio n and washing,
.c-
A(orG)
and capture
1 --' elute L 2~ the target protein bound via its antibody
to the beads. If the target protein was
already bound to an interacting protei n it
• ••••
0
• • •Q
• I,)
•• •Ct <I
••
• • too would be co-precipitated.The beads
can then be treated to elute the target
protein and any interacting proteins (2). To
• 0 • • '"
validate a suspected protein interaction,
~
western blotting is then conducted with an
antibody speCific for the presumed protein
western peptide partner. Alternatively, the method can be
blottir( cleavage used as a screening tool to identify novel
using anubody
specific for
suggested
interacting
protein
1
mass
spe<:trometry
interacting protein s by treating them with
proteolytic enzymes to generate pep tides
that can be analyzed and identified by mass
spectrometry.
400 Chapter 12: Studying Gene Function in the Post-Genome Era
BioGRID (General Repository for curated set of physical and genetic interactions for http://www.thebiogrid.org/
Interaction Datasets) more than 20 species; searchable by species
DIP (Database of Interacting Proteins) curated experimental ly determined interact ions http://dip.doe-mbi.ucla.edu/
between proteins
DrolD (Drosophila Interactions Database) a comprehensive gene and protein interactions http://www.droidb.org/
database designed specifically for the model
organism Drosophila but with useful links to protei n
interaction data from other organisms
HPRD (Human Protein Reference integrates data deposited in the Human Proteinpedia http://www.h prd.org/
Database) along with literature information curated in the
context of individual proteins
MINT (Molecular INTeraction database) experimentally verified protein interactions mined http://mi nt.bio.uniroma2.it/mintf\Ne lcome.do
from the scientific literature by expert curator,S
o_~ (},:~~
"rPowI
.1I"iJootJ ~ 1.'_..11 "IICNt ~, ~lI: 0 •
n~ O" H1' ;)-;: ~ ~~J ~_'~~ oSJ"''!'''';.
(> II" b '" ...:1
"10.. • 'j\, •. - '\ou,
- -0-:;: -""1 . -:;:" •..-, {Jr...n d...1.:t _ OF'>: ~, O'>1l" Oo,.oCiol1
tl'>.:ocsl O•.A.OtJ
"" "
membrane
fusi on ' "
vesicular
protein
degradatior,
t~or~
---- amino acid metabolIsm
_ - - - meiosis
/~.
I
protein
hybrid screen s. Only con nections for which
c yt 0 k '
lneSIS differentiation synthesis
there are 1 5 or more inter<lctions between
prolelo '",,'ooatlon \ / / proteins of the connected groups are
I
lipid fatty-aci d
and sterol
~
oude.,-<ytopla>m,o /
'''''PO'' oe" W e»
"gnal lr'OSdU<tio' \\ . \
~
s-- '"\.\
~ / '01"
/~
l"O\' C''P
,\ ti,
\. -......... RNA
..
/\pcooe»'og
RNA
turnover
• •
•••
FigUN12.13 DNase I footprinting. DNA molecules that are labeled (red)
at one end are allowed to bind to a protein (green). Partial digestion of any
unbound DNA with pancreatic DNase I can result in a ladder of DNA fragments
• •
of different sizes because the location of a cleavage site (gray vertica l bars) on
• •
••
an individual DNA molecule is essentially random (left). If, however, the DN A
contains a protein-binding site, incubation with a suitable protein ell tract may
• •
result in the binding of protei n at a specific region (right). Subseq uent partial
• •
•
digestion with pancreatic DNase I no longer produces random cleavage; instead,
the bound protein protects the underlying DNA sequence from cleavage. As a
• •
result, the spectrum of fragment sizes is skewed against certain fragment sizes,
leaving a gap when size·fractionated on a gel, a DNase I footprint. •
cell extract can be revealed by taking advantage of their comparative inaccessi
biliry to nucleases, such as pancreatic DNase I (Figure 12.13).
1 partial DNase I
!
In gel mobiliTy shift assays, a suspected regulatory DNA sequence can be
assayed for the abiliry to bind proteins, and the binding proteins can subse
•• •
• I
I
quently be purified. The basis of the method is that a labeled DNA clone bound
to protein migrates more slowly in polyacrylamide gel electrophoresis than does
unbound labeled DNA. Fractions of the gel that contain slowly migrating labeled
•• I
•
.........,
(A)
1 •••••••••••••,
(6) •
Q• •
• •
• label I DNase I
cleavage Site
• bounCl
protein
2 •••••• , ••••••:
.: . ~r
0() UC~
3 • •
4 • •• , ••••,.".1
5 •.,.•••. ~.~ ••••• c==::J
..
6 _!.!...'!· . . ···~_itiL . __ .J
~
10 15
!
••••••••••••• ' -,
oligonucleotides
(crosslil1king) within intact cells by using a reagent such as formaldehyde. The formaldehyde
cells a re then lysed and the crosslinked DNA-protein complexes are purified and
analyzed. ~wC
Formaldehyde is a highly reactive dipolar molecule whose carbon ato m acts
as a nucleophilic center. It reacts readily with both amino and imino groups
found in proteins and DNA. The most available reacting groups in proteins are
H/1"H
.. /H
the amino groups of lysine and arginine and the imino group of histidine. In R-N
DNA, the primary reacting groups are the amino groups of adenine and cytosine.
Figure 12.15 shows an example of how formaldehyde can cause crosslinking of macromQlecule
"H
macromolecules such as proteins and DNA.
The most popular method for mapping DNA-protein interactions ill vivo is
chromatin immulloprecipitation (ChIP; see Box 11.3). Proteins are cross-linked
il
to DNA within cells. The cells are lysed. and the chromati n is extracted and frag H H
mented by physical shearing or enzymatic treatment. DNA fragm ents that have a +1 1£>1
R= N -c -OH
crosslinked protein of interest are isolated by using an antibody specific for that
1 1
protein, and purified DNA fragm ents are analyzed to identify the location of the H H
binding site for the protein.
Schiff's base
The basic ChiP method is often devoted to investigating interactions between
a specific protein and a few genomic regions that are known or suspected to have
binding sites for that protein . More recent ChiP va riants permit the identification
of protein-binding sites on a genomewide scale. Am ajor motivation is to identify
Tl
H H
all the binding sites for transcription factors within a genome. In the ChIP-chip 110, . /
method (also called ChIP-on-chip, ChIPArray, and genomewide location analy R-N= C
sis), the ChIP-purified DNA is amplified, fluorescently labeled, and analyzed by T" H
hybridization to genomic DNA sequences distributed within a micro array. As an
b
alte rnative, sequencing of the ChIP-purified DNA (ChIP-seq) is becoming widely
used. In a test run of ChIP-seq by Johnson et al., a zinc finger repressor protein, W--(J -H
the neuron- restrictive silencing factor (NRSF), was found to bind to 1946 loca
tions in the human genome.
In addition to defining positions on DNA occupied by regulatory proteins,
ChIP technologies are routinely being used to m ap the positions within genomes
T!
H H
of various types of modified histone and methylated cytosine.
R _ L~_/YW
CONCLUSION
H
1 '-N
The study of gene function has been transformed in the post-genome era.
Experimental studies on a gene-by-gene basis have been augmented by figure 12. tS Formaldehyde-induced
genomewide screens. In addition, the many databases that have arisen from crosslinking. Formaldehyde treatment
genome projects and new analytical tools for their interrogation have driven can result in covalent linking of prmeins
much of the functional genomics revolution . to DNA o r other proteins, often throug h
The function of a gene can often be inferred from sequence homology searches amino or imino groups. In this example, a
oUncreasing sophistication. Experimental ap proaches are then needed to verify nucleophilic attack causes cova lent bonding
this putative function. Whereas classical genetics began with an observed pheno between formaldehyde and an ami no group
type and searched to find a gene responsible for it, reverse genetiCS now allows us carried by a macromolecule (R), A Schiff
base intermediate can then form that can
to alter a gene or its expression to create an altered phenotype. Model organisms,
be covalently linked to a suitable chemical
in particular the mouse, have enabled such stu dies at an organismal level. These group (in this case, an imino group) on
are explored fur ther in Chapter 20. More recently developed techniques such as a different macromolecule (R'). The end
RNA interference (RNAi) have been widely employed in genomewide studies of result is that a -CH 2- bond contributed by a
gene functi on in cultured human and animal cells. formaldehyde molecule is used to crosslink
Experimental techniques that determine the specific interactions between the macromolecules Rand R'.
proteins and other ligands underpin the new field of proteomics. Yeast two
hybrid assays and affinity purification-mass spectrometry techniques can be
used to screen for binding partners for specific proteins. Techniques such as fluo
rescence resonance energy transfer and co ~immunoprecipitation can then be
used to validate suggested protein-protein interactions.
The wealth of information from both genomewide experimental screens and
bioinformatics is enabling the mapping of vast protein networks of interacting
proteins in cells. Genes that encode functional noncoding RNA have not been
explored so extensively, partly because identifying RNA genes has been a com
paratively recent endeavor and is ongoing. Nevertheless, we are increasingly
aware of the importance of many different noncoding RNAs in gene regulation
and in disease; as we will see in Chapter 20, genetically manipulated model
organisms are being used to dissect their functions.
Chapter 13
KEY CONCEPTS
Human DNA variants can be classified as large scale versus small scale, common versus
rare, and pathogenic versus nonpathogenic.
• Single nucleotide polymorphisms (SNPs) are the most numerous variants.
Short tandem repeat polymorphisms (STRPs or microsatellites) are very common. The
repeat units are most commonly I, 2, or 4 bp long, and the string of tandem repeats usually
extends over less than 100 bp. The number of repeats in a string often varies between people
as a result of polymerase stuttering during DNA replication.
• Minisatellites are tandem repeats with longer repeat units, typically 10-50 bp. Again, the
number of repeat units in a run often varies between people, but this is most often because
of unequal recombination. MinisateUites are commoner toward the telomeres of
chromosomes.
• Many larger sequences (between I kb and I Mb) show variation in copy number between
normal healthy people. About 5% ofthe whole genome can vary in this way. A common
cause of such variation is recombination between mispaired repeated sequences.
• Genomic DNA is subject to constant processes of damage and repair. Most damage passes
unnoticed because it is efficiently repaired. Failures in repairing damage or correcting
replication errors are the ultimate cause of most sequence variation.
Most common variants are not pathogenic, although they might, in combination with other
variants, increase or decrease susceptibility to multifactorial diseases. They are useful as
markers (variants that can be used to recognize a chromosome or a person) in pedigree
analysis, forensics, and studies of the origins and relationships of populations.
Pathogenic effects can be mediated by either a loss of function or a gain of function of a
gene product. Very many different changes in a gene can cause a loss of function. Usually
only one or a few specific changes can cause a gain of function.
• Loss-of-function changes usually lead to recessive phenotypes, and gain-of-function
changes to dominant phenotypes. Dominant phenotypes can also result from loss of
function if a 50% level ofthe normal gene product is not sufficient to produce a normal
phenotype (haploinsufficiency). or ifthe protein product of the mutant allele interferes with
the function of the normal product (dominant-negative effects).
• Molecular pathology seeks to explain why a certain genetic change causes a particular
clinical phenotype. However, well-defined genotype-phenotype correlations are rare in
humans because humans differ greatly from one another in their genetic make-up and
environments.
406 Chapter 13: Human Genetic Variability and Its Consequences
This chapter is about the differences that exist between different human indi
viduals in the DNA sequence oftheir genomes (differences between humans and
other species were covered in Chapter 10). Now that we have complete genome
sequences for several named individuals, we can see that they differ from one
another in millions of ways. Most of the many differences between individual
human genomes seem to have no effect whatsoever. Other differences do affect
the phenotype, producing the normal range of genetically determined variants in
body bu.ild, pigmentation, metabolism, temperament, and so on, that make each
of us individual. All these are normal variants. Some variants are pathogenic
that is, they either cause disease or make their bearer susceptible to a disease that
they mayor may not actually develop, depending on their other genes, their life
style, their environment, or pure luck. In this chapter, we consider first the nor
mal variants and then those that are pathogenic. There is one further level of
genetic variat ion that we do not consider here: epigenetic variation (variations in
DNA methylation and chromatin conformation) between individuals. This was
described in Chapter 11, but it is still a matter of speculation how much Our indi
,~du al phenotypes are due to genetic and how much to epigenetic differences
between us.
The word polymorphism is used by human geneticists to mean commonest mutation that causes cystic fibrosis in northern
several different things at different times. Needless to say, this can European populations (labeled p.F508del; see Box 13.2 for an
cause confusio n. explanation of the notation) has a frequency of0.Q1 - 0.02 in
Molecular geneticists often describe a va riant as a polymorphism northern European populations. As shown in Chapter 3, t his cou ld
if its frequency in the population is above some arbitrary value, not be sustained by recurrent mutation in the face of the selective
often 0.01. For example, SNP rs1447295 (see Box 13.2 for an pressure against people with cystic fibrosis.
explanation of the notation) at 8q24 is a polymorphism with two Clinica l ge neticists often use polymorphism to mean a non
alleles, C and A, with frequencie s 0,93 and 0.07, respectively, in pathogenic variant, regardless of its frequency. Pathogenic
the European HapMap populati on. This is the meaning we use in variants, whether common or rare, would be described as
t hiSbook, unless otherwise specified. Variants whose frequency is mutations.
below the arbitrary threshold might be described as rare variants. The word mutation can also be used with two different senses,
PopUlation geneticists define a polymorphism as the stable referring to either the process or the product:
coexistence in a population of more than one genotype at an event that changes a DNA sequ ence- UV radiation produ ced a
frequencies such that the rarer type could not be maintained mutation in the DNA.
simply by recurrent mutation. On this definition, some pathogenic a DN A sequence change that may have happened a long time
mutat ions would count as polymorphisms. For example, the ago-she had inherited a mutation from her father.
TYPES OF VARIAT ION BETWEEN HUMAN GENOMES 407
BOX 13.2 NOMENClATURE fOR DESCRIBING DNA AND AMINO ACID VARIANTS
Each single nucleotide polymorphism (SNP) in the public dat abase Amino acid substitutio n s
dbSNP (h ttpj/www.ncbLnlm.nih.gov/ projects/ SNP) can be refe rred Use either on e-letter codes (X m ea ns a stop codo n) or three-letter
to by its uniq ue id enti fie r, such as rs212570, whe re rs stands fo r codes. Proteins are nu mbered w ith the in itiator methionine as
reference SNP and 212570 is a unique serial num ber. Short tandem codon 1.
repeat polymorp hisms (STRPs) have identifi ers such as 065282, w here p.Rl 17H or Arg117His- replace argi nine 117 by hist idi ne.
o stand s for DNA segment, 6 is the number of t he chromosome on p.GS42X o r Gly542Stop- rep lace the co don fo r g lyci ne 542
which the marker is located, S means a single copy seque nce, and with a stop codon.
intervening seq uence) or the num ber of th e nearest exon pos ition.
9.621 +1 G> Tor IVS4+lG> T-re pl ace G byT at the fi rst base of
anything from 0.5 downward. We would not expect h igh-frequency SNP alleles to
h ave any major phenotypic effect, because n atural selection should have ensured
that they were eliminated if harmful or fixed (present in everybody) if beneficial.
However, those evolutionary processes might be incomplete at present for some
particular allele. Additionally, heterozygote advantage is a mechanism that can
produce stable polymorphis ms even when one allele is ha rmful in homozygous
form. This hap pe ns whe n asymptomatic carrie rs of a deleterious recessive condi
tion have some selective adva ntage ove r n ormal homozygotes, as wi th sickle-cell
anemia (see Chapter 3, p. 64). The argument from natu ral selection is less power
ful wi th rare SNPs, but even so, these wo uld also n ot generally be expected to
h ave a phenotypic effect, if only because most SNPs are not located in coding or
regulatory sequences.
Some polymorp hic sites have more complex variants. A SNP may have th ree
alleles, say A, G, or C, or two SNPs may be adjacent to one another. The great
majo rity of mo re complex variants in the dbSNP database (about 2 million of the
12 million enu'ies) are deletion / insertion polym orphisms (DIPs or indels).
TypiCally one or two nonrepeated nucleotides are inserted or deleted (Figure
13.16) . Insertions or deletions ofrepeat units have different ca uses and dynam (A)
ics, and are disc ussed below in the section on short tandem repeats (p. 408) . GCeTGT TT T AT AT '1' AC /T GATe CAi'. T TTT TTC A
So metimes a SNP will affect a nucleo tide that is part of th e recognition site for
one or ano ther restriction enzyme. Al though many hundreds of restriction ~
GAGACAGAGTTTC GC ( T) !CTTGTT GCCCAG GCT
enzymes are known, most of the m recognize a palindromic sequence such as
GAATTC. The sequence is palindromic because the complementary sU'and , read ~
in the 5'--73' di rection, also reads GAATIC: CCAAGCCTGGAGCT1 /0GC CG TGGGCCAGGCA AG
5' - ACGATTGAATTf TAGGA TC- 3'
Flguro 13:.1 Type s of single nucleotide
3' - TGCTAA CT TAAGATCC TAG - 5'
•
Only abo ut 10% of all nucleotides fall in such palindromic sequences, so most
polymorphi sm (SNP). (A) A simple SNP.
rs2 12S70 is a CfT va riant o n chro m o so m e
20. (6) A d eletio n/insertion polym o rphi sm.
SNPs do n ot affect any restriction site. When a SNP happens to lie within a restric
rs36 126541 is insertion or de letio n of
tion site, only one allele will re tain the necessary recogni tio n sequence. Thus, the a sing le T nucleotide in a sequence
polymorphism will create or abolish a restriction site. SNPs of this typ e (Figure o n chromosom e 12. (C) A restriction
13. 1C), known as restriction fragment length polymorphisms (RFLPs) or fr agm ent leng th po lymo rph ism (RFLP).
restriction site polymorphisms (RSPs), were the first widely used DNA markers SNP rs36078338 creates or abolishes th e
and we re used to create the fi rst genetic linkage map of the human genome (see sequence GCTAG(, which is the recog ni tio n
Chapter 8). site for the restric ti on enzym e Nhel.
408 Chapter 13: Human GeneticVariability and Its Consequences
DC 'f fUESCE
-I, ! _~~
(UEC), or they could be on sister chromat id s,
producing an unequal sister chromatid
_ j exchange (UESCE). In either case, the result
1'2 3 is to produce two chromatids, one w ith one
or more extra repeat units, the o ther w ith the
unequal unequal sisler
correspond ing units missing . For simplicity,
crossover chromatid exch ange
the breaks are shown as located between
(UEC I (UESCEI
repeat units, but they could equally occur at
J, J. corresponding positions within units.
:2 3 1 2:2 3
B c'=nJ~D=:l (. C ([T'jIr -x::::t=J 0
: + ,: +
1 1 2 3 1 '3 normal replication
G( l: ITj '=cx=J B oC- Cr'
,
'1--~--yy--~ c
5'
1 2 3
[!IJ1!1lI[!!;1 3'
3' riIIllii rill liD liD rill 5'
1 2 3 4 5 6
defining and mapping sufficient STRPs to construct a high -resolution marker
framework map across the hwnan genome; about 150,000 STRPs have been J.
identified. Although STRPs still remain the tool of choice for forensic work, link 1 2 3 4 5 6
age work has moved to using SNPs because they can be genotyped on a huge
5' [!IJ1!1lI1!1lIr&11!1l1~ 3'
Large, scale variations in copy number are surprisingly frequent in backward slippage causes Insertion
human genomes 2
I!1lI
Variants large enough to be seen by a cytogeneticist under the microscope are 1 ~
almost always the result of isolated pathogenic accidents and are not part of nor
5' 311il1l 3'
3' Ilii[iII 1m riIIl!II !ill 5'
mal human variation. Cytogeneticists recognize just three types of relatively 1 2 3 4 5 6
common normal variant:
• The size of the hete rochromatic regions at the centromeres of chro mosomes
J.
1 2 2 3 4 <; 6
1, 9, or 16 and the long arm of the Y chromosome may vary. 5' 1!I'J~[!1iJ3Il!1lIl!1lIrBiI 3'
• The short arms of the acrocenrric chromosomes (chromosomes 13, 14, 15, 21,
and 22, which have the centromere close to one end-see Figure 2.15) vary
considerably in size and morphology. Often the most distal part appears as a forward sli ppage causes deletion
so-called. satellite connected to the main body ofthe chromosome by a short,
1 2 3 5
thin stalk. Note that this use of the word satellite is unrelated to its use to S' [iil1!1lI1!1lI1!1lI 3'
describe DNA tandem repeats. 3' IiDlmriillmriil 5'
1 2 a 5 6
• A variery of fragile sites-segments of uncoiled chromatin-may be seen
when cells are cultured under conditions tha t make DNA replication diffi
rill
4
cult-for example, by starving the cells of thymidine or adding the DNA
polymerase inhibitor aphidicolin (Figure 13,5). Most fragile sites are normal J.
variants, although the FRAXA and FRAXE fragile sites discussed below (see 1 2 3 5 6
p. 424) are pathogeni c.
5' I!1lIlilIlrBilliDDm 3'
All these changes reflect varying copy numbers of tandemly repeated Flgur.13A Slipped-strand mispairing
seque nces. Until recently it was assumed that any deletion or insertion of a large during DNA replication. A new DNA strand
stretch of non-repetitive DNA would be pathogenic, The advent of techniques (pink) is being synthesized, using the blue
strand as template. During normal DNA
such as array- co mparative genomic hybridization (in which test and control
replication the nascent strand often partly
DNAs compete to hybridize to a micro array: see Chapter 2) and whole genome
dissociates from the template and then
SNP arrays made it relatively easy for the first time to sca n whole genomes for reassociates. When there is a tandemly
extra or missing material, and it quickly became apparent that many large-scale repeated sequence, the nascent strand
changes can be found in the genom es of normal healthy people. A major study by may mispair with the template when it
Redon et al. of 269 health y individuals from four geographically distinct popula reassociates. This can result in the newly
tions (the HapMap sample; see Chapter 15) identified 1447 regions with variable syntheSized strand having fewer or more
numbers of copies of sequences of at least 1 kb in length, In total these variable repeat units than the template w and.
410 Chapter 13: Human Genetic Variability and Its Consequences
regions were reported to cove r 360 Mb, or 12% of the genome (although that esti
mate has since been revised downward). and included many known genes. The
mean size of the variable segments was 250 kb. The techniques used were rela
tively insensitive to smaller, kilobase-sized variants. Many such smaller variable
segmen ts would have been missed. These smaller variants can be identified using
the technique of paired-end mapping. This involves selecting random fragments
of genomic DNA of a known size, and seq uencing a few dozen n ucleotides from
each end of a fragment. One can then compare the distance apart of those end
sequences in the selected fragment (whose size is known) with the distance apart
of the same sequences in the reference human genome (Figure l3.6). If the sepa
ration is greater in the selected fragment than in the reference genome, the frag
ment must contain an insertion that is not present in the reference genome; con
versely, a lesser distance indicates a deletion. Recent studies have shown that
these smaller va riants are even more frequent than larger ones (Figure 13.7).
Although SNPs are individually more numerous, copy-number variants f.lgur.1),$ A chromosomal fragile
(CNVs) are responsible for by far the greatest number of nucleotides that differ site. A stretch of relati vely uncondensed
between two genomes. Thus, the human genome is considerabl y more variable chromatin in an otherw ise highly condensed
metaphase chromosome. There are about
between normal healthy people than had previously been assumed. The studies
120 locations in the human genome where a
mentioned above did not reveal, when there is a gain in copy number, whether chromosome of a normal healthy person ha s
the multiple copies of the sequence are scattered across several chromosomal some lendency to show a fragile site when
locations or clustered in tandem. Other studies show that they most commonly cells are cultured under special conditions.
form tandem clusters. The Database of Genomic Variants (TCAG; http://projects Shown here is an X chromosome with the
.tcag.ca/variation /) is a database of variants seen in apparently healthy people, pathogenic fragile site FRAXA (gray arrow).
whereas rhe Decipher database (http://decipher.sanger.ac.uk/) records variants (Cou rtesy of Graham Fews, University of
seen in people with phenotypic abnormalities (although whether any particular Birmingham .)
variant is the cause of the phenotypic abnormality is seldom known for certain).
The genome sequences of single individuals that are now becoming avaiJable
give a striking picture of the overall variability present in ostensibly healthy
individuals.
The pioneer genome scientist Craig Venter was the first individual whose full
diploid genome sequence was determined. The reference human genome
genomic DNA
1
',agment.
select 3 kb fragmen ts.
ligate biotlnylated adaptors
bio blo
~1-------1~
,
1 ligate to promole
circularization
.
~:~ ,.J
bio
Figure 13.6 The principle of paired end
I digest ...... ith restriction enzyme. mapping. In paired end mapping, small
select fragments containing biotin size-sele< ted fragments of sheared genom ic
DNA from BAC or PAC clones are used,
bio which are marked by ligati on to biotinylated
-w s __ I adaptors. In both cases, the aim is to identify
I a seq uence representing the two ends of the
bia origInal DNA joined by the circularization.
In chromosome jumping this was to fin d
l
sequence .
m ap ends to re fere nce sequences (g reen) that were 80-130 kb
from a known sequence (blue), in order to
.
genome seQuence
reference genome speed up construction of a clone contig
sequence ~. . .- .....
..- -c _ • ~ across a large genomic region. In paired end
r; '. ~ mapping, the aim is to identi fy structural
'"
,
study genome --e!<::--::;_-
~
01-
~
" : variants. by finding sequences that are a
different di sta nce apart in the test genome
than in the reference human genome
no variant deletion Insertion
sequence.
DNA DAMAGE AND REPAIR MECHANISMS 411
E
permission from the Database of Genomic
0
c Va ri an ts http://projec ts.tcag .ca/va riation,
3000
accessed January 2009.)
2000
1000
0
1-10 10-50
I
50-100 10Q-.200 200-S00 500-1000
sIze ranges (kb)
-
>1000
adenine
agents can transfer a methyl or other alkyl group onto DNA bases and can
cause cross-linking between bases within a strand or between different DNA
strands.
Although external agents are more visible, the major threats to the stability of
a cell's DNA come from internal chemical events, such as those illustrated in
Figures 13.8 and 13.9:
• Depurination-about 5000 adenine or guanine bases are lost every day from
each nucleated human cell by spontaneous hydrolysis of the base-sugar link
(see Figures 13.8 and 13.9A).
• Deamination-at least 100 cytosines each day in each nucleated human cell
are spontaneously deaminated to produce uracil (see Figures 13.8 and 13.9B).
Less frequently, spontaneous deamination of adenine produces
hypoxanthine.
• Atta.ck by reactive oxygen species-highly reactive superoxide anions (0,-) and
related molecules are generated as a by-product of oxidative metabolism in
mitochondria. They can also be produced by the impact of ionizing radiation
on cellular constituents. These reactive oxygen species attack purine and
pyrimidine rings (see Figure 13.8).
• Nonenzymatic methylation-accidental nonenzymatic DNA methylation by
S-adenosyl methionine produces about 300 molecules per cell per day of the
cytotoxic base 3-methyl adenine, plus a quantity of the less harmful 7 -methyl
guanine (see Figure 13.8). This is quite distinct from the enzymatic methyla
tion of cytosine to produce 5-methyl cytosine, which cells use as a major
method of controlling gene expression (see Chapter 11). The methylated ade
nine or guanine bases distort the double helix and interfere with vital
DNA-protein interactions.
In addition to damage of these various types, errors arise during normal DNA
metabolism. A certain error rate (incorporation ofthe vvrong nucleotide) is inevi
table during DNA replication. Proofreading mechanisms correct the great major
ity of the resulting mismatches, but a few may persist, producing sequence vari
ants. Failure of the proofreading mechanism in somatic cells is one cause of
cancer (see Chapter 17). In addition, occasional errors in replication or recombi
nation leave strand breaks in the DNA, which must be repaired if the cell is to
survive.
DNA DAMAGE AND REPAIR MECHANISMS 413
cyto sine
(e) 6 H H, / H o H
N
!~':d'
H20 NH3
~
H0 N
) LNAo lNAo
o='I-O-~H, /0, I
H20 NHJ
0= 'I- 0-1H,/0, I
~ ~
~
0- 0-
(e)
° H
5-methylcytoslne
° H
H, /H
N o thymine
H,c0
O=P-O-C , ,jl NAo
H
/N
0='1- 0
H'CY'J:
-1/ 0,,1
N 0
I
0
Q o
0
H
0- ~
o H
I I
'°
H/ I
"-c_ C/
I' CH3
S-m eth yl cytosine to thymine.
radiation-induced thymine dimers.
to) Ultraviolet
o =P-o-
I
o
I 0
[:/0",-<r- 1\=0
''oH H H/
C-C/
,
CH,
Human cells use at least six different DNA repai r mechanisms. Three of the surrounding DNA. As in BER, the gap is filled by resynthesis
of them are used to correct abnormal modified bases when these and sealed by DNA ligase. Nucleotide excision repair is also used
are present in just one of the two strands of the DNA- that Is, they to correct single-strand breaks in the sugar-phosphate backbone.
are paired with a norma l base. The damaged base can be repaired, or Dire<.t reversal of the DNA damage is an infrequently
removed and replaced. used me<:hanism of DNA repair in humans. Tti ree human
Base exdsion repair (BER) corrects much the commonest genes have been implicated in this mechanism, of which the
type of DNA damage (of the order of 20,000 altered bases in best-characterized encodes the 0-6-met hylg uanine-DNA
each nucleated cell in our body each day). BER glycosylase methyttransferase, which is able to remove methyl groups
enzymes remove abnormal bases by breaking the sugar-base from guanines that have been incorrectly methylated. In many
bond (t:=igure lA). Humans have at least eight genes encoding organisms, UV radiation-produced thymine dimers can be directly
different DNA glycosylases, each responsible for identifying and resolved by the enzyme photolyase using the energy of visible
removing a specific kind of base damage. After removal of the light (photoreactivation). Although mammals possess enzymes
damaged base, an endonuclease and a phosphodiesterase cut related to photo lyase, they use them for a quite different purpose,
the sugar-phosphate backbone at the position of the missing to control their circadian clock.
base and remove the sugar-phosphate residue. The gap is filled In the repair processes described above, the second, undamaged
by resynthes is with a DNA polymerase, and the remaining nick is DNA strand acts as a template for accurate reconstruction of the
sea led by DNA ligase III. damaged strand. Damage that affects both strand s of DNA requires
Nucleotide excision repair (NER) removes thymine dlmers different mechanisms. There are two main proceSses.
and large chemical adducts (Figure 1B). NER removes and In homologous recombination, a single strand from the
resynthesizes a large patch around the damage, rather than just a homologous chromosome invades the damaged DNA and acts
sing le base as in BER. The sugar-phosphate backbone is cleaved at as a template for accurate repair (F1gurw lA). This type of repa ir
the site of the damage, and exonucleases remove a large stretch normally occurs after DNA replication but before cell division,
U C
I:
C G c c G C G FIgure 1 DNA repair mechanisms.
A
G
T
U
I I
A
G )
A
G
T
t )
A
G
T
C
(A) Base excision repair. (B) Nucleotide
excision repair.
T A DNA glycosylase T endonuclease T A DNA polymerase T A
G C G phosphodiesterase G C DNA ligase III G C
~
er
G
C
A
G
T
C
A
G
T ~
() C
A
I G C
A
G
T
G C G C r G T C.... O G C
T
C
A
G
T
c
A
G et T
C
t
T
C
A
G
iJ ~
A ) A ) A ) A T
A nuclease A DNA helicase A DNA polymerase A T
T A T A T DNA ligase T A
A T A T A A T
G C G C G G C
G C G C G C G C
1
(A) Homologous recombination . The
loss of nucleotides due
l
nucleotides at the junction . [From
end proceSSing and resynthesis •
deletion of DNA sequence Alberts B, Johnson A, Lewis J et al. (2007)
of gaps using sister chromatid
as a template Molecular Biology of the Cell, 5th ed.
Garland SciencefTaylor & Francis LLC.)
DNA DAMAGE AND REPAIR MECHANISMS 415
and involves the sister chromatid. The eukaryotic machinery chromosomal rearrangements. but it is better than leaving breaks
for recombination repair is less well defined than the excision unrepaired. One reason why chromosoma l telomeres require a
repair systems. Human genes involved in this pathway include special structure is to protect normal chromosome ends from this
NBS (N ijmegen breakage syndrome; OMIM 60266n. BLM (Bloom double-strand break response.
syndrome; OMIM 604610), and the breast cancer susceptibility Afinal repair mechanism Is concerned with correcting mismatches
genes BRCA2 (OMIM 6001 85) and BRCA 1(OMIM I 13705). caused by replicatio n errors. Cells deficient in mismatch repair have
In nonhomologous end joining, large multiprotein complexes mutation rates 100-1 OOO-foid normal, with a particular tendency
are assembled at broken ends of DNA molecules, and DNA ligases to replication slippage in homo polymeric runs (see Figure 13.4). In
rejoin the broken ends regardless of their sequence (Figu re 28). huma ns, the mechanism involves at least five proteins, and defects
There is always some loss of DNA sequence from the broken ends. cause Lynch syndrome (OMIM 120435 and OMIM 609310). Details of
This is a desperate measure that is likely to cause mutations or the mechani sm are given in Chapter 17.
Even if a cell is able to survive with damaged DNA, unrepaired damage is likely
to be mutagenic. If mutations occur in germ cells, they can give rise to new vari
ants in the population that provide much of the raw material for evolution as well
as the intraspecific variants considered in this chapter. A major source of muta
tions is error-prone trans-lesion DNA synthesis. The sloppy polymerases t; (zeta)
and l (iota) are able to replicate damaged DNA, bypassing stalled replication
forks-but at the cost of a high error rate. Additionally, modified bases may mis
pair during DNA replication, introduCing permanent sequence changes. For
example, uracil, produced by the deamination of cytosine, pairs with adenine, as
does the 8-hydroxy guanine that is a product of oxidative attack on DNA.
A special case is 5-methyl cytosine. When cytosine is deaminated, the prod uct
is uracil. This is an unnatural base in DNA that is efficiently recognized and cor
rected. However, 5-methyl cytosine is deaminated to thymine, a natural base in
DNA (Figure 13.9C). If the resulting G-T mispairing is not corrected before the
next round of DNA replication, one daughter cell will have a permanent C....T
mutation. Evidence from both evolutionary studies and human diseases shows
the importance of deamination of 5-methyl cytosine as a source of sequence
changes. Methylated cytosines are almost always found in CpG sequences (that
is, the nucleotide 3' of cytosine is guanosine)' and CpG sequences are mutational
hotspots (see also Chapters 8 and 11).
In response to these various forms of damage, cells deploy a whole range of
repair mechanisms. Different mechanisms correct different types of lesion [Box
13.3). The importance of effective DNA repair systems is highlighted by the
roughly 130 human genes participating in DNA repair, and by the severe diseases
that affect people with deficient repair systems.
251260}, and Bloom syndrome (OMIM 210900) . Clinically, patients may suffer
symptoms including hypersensitivity to sunlight, neurological and /or skeletal
problems, anemia, and especially a high incidence of various cancers. In the lab
oratory, cells from patients with these disorders show hypersensitivity to various
DNA-damaging agents.
Complementation groups
cell A
o
cell B
In any case, for many gene products no laboratory test is available that checks all
aspects of the gene's function in vivo. So me variants may be pathogenic only at
times of environmental stress, and others may have subtle effects that manifest
as susceptibility to a disease, perhaps only when in combination with certain
othe r genetic var iants.
In the absence of a definitive functional test, the nature of the sequence
change often provides a clue. First we can ask whether the variant affects a
sequence that is known to be functional. Such sequences would include the cod
ing sequences of genes, sequences flankin g exon-intron junctions (splice sites),
the promoter sequence immediately upstream of a gene, and any other known
regulatory seq uences. The great majority of all known pathogenic variants affect
sequences that were already known to be functional, and these comprise only a
small percentage of our total DNA. However, it is always possible that a variant
located outside any known functional sequence might lie in a currently unidenti
fied functional element. As we saw in Chapter II , the ENCODE project is reveal
ing many previously unsuspected functional elements in the human genome.
Such elements are suspected to be locations for variants that merely alter suscep
tibility ro a disease, rather than directly causing any disease.
[fa variant does affect a known functional sequence, we must try to predict its
effect. A table of the genetic code (see Figure 1.25) can be used to identify the
effect of a coding sequence variant on the protein product of a gene. As described
below, nonsense mutations, frameshifts, and many deletions can be confidently
predicted to wreck the protein. Similarly, changes to the invariant GT...AG
sequences at splice sites are highly likely to be pathogenic. Changes that merely
replace one amino acid with a different one (missense changes) are more diffi
cult to interpret.
Another approach is to look for precedents. Maybe a variant is already docu
mented in dbSNp, the database of single nucleotide polymorphisms (see above).
Alternatively, it may be documented in one of the databases of pathogenic muta
tions listed in Further Reading. A different sort of precedent can be sought by
checking the normal sequence of related genes. These may be in humans (para
logs) or other species (orthologs). If the variant is present as the normal, wild
type sequence of a related gene it is unlikel y to be pathogenic.
Further aspects of this problem are considered in Chapter 18, where we dis
cuss genetic testing. In the rest of this section we consider some of the many ways
in which a change in a functional sequence can be pathogenic.
IA)
v H L T P E E K S
normal H88 gene GTG CAT CTG ACT CCT GAG GAG AAG TCT
v H L T P V E K S
sickle·cell mutation GTG CAT CTG ACT CCT GTG GAG AAG TCT
ID)
cell mutation is pathogenic because it replaces a polar amino acid with a nonpo
lar one on the outside of the globin molecule (Figure J3.11). This makes the
molecules tend to stick together. Protein aggregation, the result of abnormal pro
teins having sticky external areas, has emerged as a common pathogenic mecha
nism in a variety of diseases, especially progressive neurodegenerative condi
tions, and is discussed further on p. 425.
It is seldom possible to predict these effects with much confidence. It helps if
the three-dimensional structure of the protein has been solved, so that one can
model the likely structural effect of a substitution. If amino acid sequences of
related proteins (from humans or other organisms) are known, we can see which
amino acids are invariant and which seem free to vary widely between species.
Most amino acid substitutions probably have no effect on the functioning of a
protein.
Nonsense mutations
Three of the 64 codons in the genetic code are stop codons, and so it is quite com
mon for a nucleotide substitution to convert the codon for an internal amino
acid of a protein into a stop codon. When ribosomes encounter a stop codon they
dissociate from the mRNA, and the nascent polypeptide is released (Figure
13.12.A). However, genes containing premature termination codons seldom
cause production of the truncated protein that might be predicted. Cells have a
mechanism, nonsense-mediated decay (NMD), that detects mRNAs containing
premature termination codons and degrades them. Thus, the usual result of a
nonsense mutation is to prevent any expression of the gene.
NMD works because the spliced mR NA that travels from the nucleus to the
ribosomes retains a memory of the positions of the introns. The splicing mecha
nism leaves proteins of the exon junction complex (EJCj attached to splice sites.
During the first (pioneer) round of translation, as the ribosome passes each splice
site it clears the ETC proteins attached to that site. Ifthere is a premature termina
tion codon, the ribosome will not have traversed every splice site before it
detaches. Some EJC proteins will remain attached to the mR NA, and this marks
dle mRNA for destruction (Figure 13.128).
PATHOGENIC DNA VARIANTS 419
" •
265 V R R change in exon 6 of the PAX3 gene replaces
5 N R R A
"
GTe TGG TTT AGe MC eGC CGT GCA AGA TGG AGG AAG CAli. GeT GGG Gee AAT
K Q A G A. N
the TGG codon for tryptophan 274 with a
TGA stop codon. (B) Nonsense-mediated
mulant
decay (NMD). Thi s mature mRN A was
265 V W
•
5 N R R A R STOP R K Q
GTC TGG TT T AGe MC eGC CGT GeA AGA TGA AGG AAG e M
A G A N
GeT GGG Gee ..... AT transcribed from a gene that ha s seven
,
exons. Splice junctions (red bars) reta in
proteins of the exon junction complex
-,____I
(B) (EJC, red triangl es). As the first ribosome
~ moves along the mRN A, it displaces the EJC
I I I I proteins. If it en cou nters a premature stop
codon and detaches before displacing all
EJCs, the mRN A is targeted for degradation.
Stop codons in the last exon or less than
50 nucleotides upstream of the last splice
(C)
junction (the g reen zone) do not trigger
exon l 2 3 4
NMD. (C) Depending on w hether or no t a
'==CJ:=:iF==[D=;f/ T H !;:;;:;:;;:' J
0=1 premature stop codon triggers NMD, the
Tm consequences of a nonsense mutation can
be very different. Mutation s in the SOXIO
gene that trigger decay (green arrows)
Nonsense-mediated decay is not always fully effective. It does not apply to result in Waardenburg syndrome type 4
premature stop codons that are in the last exon of a gene, or less than about 50 (hearing loss, pigmentary abnormalities;
nucieotides upsU'eam of the last splice junction. [n some cases, some quantity of Hirschsprung di sease; OMIM 277580).
truncated protein is produced even when the stop codon is not in this protected Nonsense mutation s in t he 3' region of the
zone. Truncated proteins are potentially more pathogenic than a simple absence mRNA that escape NMD (red arrows) cause a
of the protein (Figure 13.12C) because they have the potential to interfere with much more severe neurological phenotype.
the functi on of the normal product. Such dominant-negative effects will be dis Shaded areas are cod ing sequence, clear
cussed later in this chapter (see p. 431). [t is assumed that NMD has arisen to areas are the 5' and 3' untranslated regions
protect against this problem. of the gene.
exon 19
- - II 1
taaaa c~gccg ag taa
codon is in green. (B) A Sing le nucleotid e
change deep within intron 19 o f t he cystic
fibrosis transmembrane conductance
12,183 nt 2,468 nt regulator encoded by t he CFTR gene (called
1
3849+ 10 kb C-+ T. although actually the
-
changed nucleotide is not 10 kb but 12.191
nucleotides from the 3' end of exon 19)
, ,pllllllllllljJ,
II -, t aa " ac g G'rc g agt a ~,- II activates a cryptic splice site. The changed
sequence is a strong donor splice site. (C) An
""""",11" apparently silent change that affe<:ts splid ng .
IC)
The SMNl gene is highly expre ssed. w hereas
exon 7
SMN2 produces almost no protei n. The
SMNI
difference is due to a ITT-+TIC change in
a g GGTTTCAGACAAAA
exon 7. Although both ITT and TIC encode
phenylalanine, the change inactivates a
exon 7 spliCing enhancer and prevents correct
SMN2 spliCing at the intron 6-exon 7 ju nction
aq GGTT T ~AG AC A KAA
inSMN2.
420 Chapter 13: Human Genetic Variability and Its Consequences
/~
(Figure 13.14). Two out of every three random changes to the length of a coding
sequence would be expected to produce a frameshift. Because 3 out of 64 possi
normal L G G V N
ble codons are stop codons, reading a message out of ftame will usually quickly CTG GGG GGT GTG AAC
hit a premature termination codon. Nonsense-mediated decay will then most
probably mean that no protein is produced. An example is found in the GJB2 mutant L G v aTOP
CT G GGG ~G TGA
gene that encodes connexin 26, a component of gap junctions between cells. A
run of six consecutive G nucleotides in the gene is prone to replication slippage. Flgu,.. 13.15 A simple frameshift.
This introduces a premature stop codon (Figure 13.15). This mutation is the most A frameshifting deletion of one nucleotide
frequent single cause of autosomal recessive congenital hearing loss (OMIM in the GJB2 gene arises through replication
220290) in most European populations. sllppage in a run of six G nucleotides. The
Several different types of event can produce a frameshift. As well as small mutation creates a premature stop codon in
insertions or deletions within a single exon, abnormal splicing or whole exon exon 2.
PATHOGENIC DNA VARIANTS 421
Intronjexon size (bp) frameshift deletions causing DMD or 8MD F'g\lrtl' 3.16 Effect of deletions in the
fntron 41 31,823
• 1 - , dystrophin gene. The gene consists of
79 small exons su rrounded by large introns.
About 65% of all mutations in t his gene are
intragenic deletions, Because the introns are
intron42 22.380 much larger than the exons, the breakpoints
intron 43 70,465
~
1= -; _ ] nearly always lie within introns. The effect is
to del ete one or more whole exons. When
this results in a frameshift (red bars) it causes
intron 44
~
248,401
~
Intron 46 2334
-
~
I -
-
1 -
I
]
~~J -~ 1- J
lntron 47 54,222
-
~
----. - l
~I
intron 48 38,3 68
- -l
Intton 49 16.634
deletions or duplications usually cause frameshifts. Most exons are not frame
neutral (that is, the numberofnucleotides in rhe exon is not a multiple ofthreel,
so excluding one or more exons because of a deletion or spliCing error will more
often rhan not create a frameshift. Most introns are much larger rhan rhe exons
they surround, so in most genes the breakpoints of random intragenic deletions
or duplications will lie wirhin introns. The result is to delete or duplicate one or
more whole exons [FIgure 13.16).
Changes that affect the level of gene expression
A variant in a control sequence might affect the level of transcription of a gene, so
that although the gene product is entirely normal, too little (or too much) of it is
produced. Most obviously this would happen ifthe variant changed the promoter
sequence. In reality, few such variants have been described. This is partly because
diagnostic laboratories do not routinely sequence promoters. If they did find a
change, rhey would seldom know how to interpret its significance. As described
in Chapter 1, promoters can include consensus binding sites for a variety oftran
scription factors, and the effect of a sequence change can usually only be deter
mined by experiment.
Patients with rJ.- or ~-thalassemia have been thoroughly investigated in rhe
search for mutations affecting transcription. Their diseases are forms of anemia
caused by a quantitative deficiency of a- or ~-gJobin, respectively, and so were
prime candidates for mutations of this type. However, a-rhalassemia is most
commonly caused by reduced numbers of active a-globin genes (see below).
Some mutations in the Jl-globin gene promoter have been identified (see OMIM
141900, listed mutations 370-381, and also Figure 13.HAI, but rhe great majority
of~ -rhalassemia mutations work by producing an unstable mRNA or an unstable
globin protein, rarher than directly repressing transcription. Three common
types of event have been noted:
• Many thalassemia mutations are splicing errors or nonsense mutations that
produce a prematw'e termination codon, resulting in nonsense-mediated
decay ofthe mRNA.
• Changes in the 3' untranslated region may cause the mRNA to be unstable. A
change in rhis region may create or abolish a critical binding site for a micro
RNA or a protein that regulates translation (see Chapter Ill.
422 Chapter 13: Human Genetic Variability and Its Consequences
• Cells are very efficient at detecting and degrading abnormal prote ins that fold
incorrectly. For example, one form of a.- thalassemia is caused by a change of
the normal TAA stop codon of an a.-globin gene to CAA. This encodes
glutamine, and translation of the mRNA continues for a further 31 amino
acids until another stop codon is encountered (Figure 13.17B). This variant
hemoglobin (Hb Constant Spring) is unstable, and clinically the effect is
a.-thalassemia resulting from a quantitative deficiency of a.-globin chains.
intron 8 ! axon 9
Near the 3' end of int ro n 8 (lower-ca se
letters) are variable -length {TG)n (green) and
atctttta tttttga tqtqtqbq~tgtqt9tqt;ttt~~ttaacag l ¥t"ifiifF' Tn (red) sequences. Alleles with t he 5T and
G F GEL F
{TG)o variants mis-splice a high proportion
of transcripts at the nearby acceptor splice
site, and so they produce onl y a low leve l
of functioning CFTR protein . 6y themsel ves
Tandem repeats within coding sequences are not normally polymorphic, but these changes are not pathoge nic. but if that
m ay be liable to pathogenic mu tations because of polymerase stutter. Somatic gene copy also contains a coding variant
mutations of this type are a major cause of disease in people with defects in the such as p.R11 7H that lowers the activity of
post-replicative mismatch repair system (see Chapter 17). Expanded polyalanine the intact pro tein, the combined result o f
runs in certain proteins are responsible for several inherited diseases. Examples low pro tein level and low pro tein function
include the PHOX2B protein in people with congenital central hypoventiIation makes it a pathogenic CF allele. Not e that
syndrome (OMIM 209880) and the HOXDI3 protein in patients with synpolydac in the c ystic fibrosis literature, for hi st oric
tyly I (OMIM 186000). Th ese va rian ts presumably originated through polymer reasons, this intron/ exon pai r is usually
ase stutter, but within a family they are stably transmitted, just like any other called intron 8/ exon 9, bu t in the genom e
STRP aUele. In at least some cases, th e expand ed alanine run interferes with cor databases they are intran 9/exo n 10.
rect localization of the protein with in the ceU.
Dynamic mutations: a special class of pathogenic microsatellite variants
So-caUed dynamic mutations are STRPs that, above a certain size, become
intensely unstable. The molecular causes are not weU wlderstood, but they may
be a consequence of the way in which, wh en DNA is replicated, one strand (the
lagging strand) is synthesized as a se ries of d iscontinuous fragments-the Okazaki
fragme nts (see Chapter I). A special endonuclease, FENI , cuts off the overhangs
in overlapping Okazaki fragments. One proposed mechanism for repeat expan
sion is that FENI fails to make the CUIS, and overlapping fragments end up being
joined end-to-end. Repeats up to a certain size are stable, and it may be signifi
cant that, in most cases, the threshold of instability occurs wh en the repeat
sequence reaches the typical size of an Okazaki fragment. Not all dynamic muta
lio ns are pathogenic, but several are (Table 13_1). Others are responsible for the
nonpathogenic fragile sites seen by cytogeneticists when ceUs of some people are
subj ected to replicative stress (see Figure 13.5). For example the FRA I6A fragile
site on chromosome 16 is due to an expand ed (CCG)" repeat, whereas FRA 16Bo n
th e same chromosome is caused by an expanded 33 bp minisateIlite.
The diseases in Table 13.1 are heterogeneous in many respects. There are
differe nt-size d repeat units, different degrees of expansion, different locations
with respec t to the affected gene, and different pathogenic mechanisms. Within
these, the polyglutamine diseases fo rm a weU-defined group, of which Huntington
disease (HD; OMIM 143100) is the proto type. In these conditions, modest expan
sions of a (CAG)n repeat in the coding sequence of a gene lead to an expanded
polyglutamine run in the encoded protein (Figure 13,19C). This, in turn, predis
poses the protein to form intracellular aggregates that are toxic to ceUs, especially
neuro ns (B.. x 13.4). The result is a progressive late-onset neuro degenerative
disease.
Other dynamic mutations affect gene sequences outside coding regions, and
may involve much larger expansion s. The expanded tandem repeat may be in the
promoter, in the 5' or 3' untranslated region, or in an intron. Usually the effect is
to prevent expression of the gene (Figure 13. 19A), and the pathology comes from
the resulting lack of gene functi on. When th is occurs, the same disease can some
times be caused by other loss-of-function mutations in the same gene. In myo
to nic dystro phy 1 (DM1 ; OMIM 160900) the mechanism is quite different. There
is a gain of function involving a toxic mRNA. A massively expanded (CTG)" run in
the 3' untranslated region ofthe DMPK ge ne produces an mRNA that sequesters
CUG-binding proteins in the nucleus. These are required for correct splicing of
the primary transcripts of several unrelated genes, which therefore no longer
fun ction correctly. The result is a mul tisystem disease whose features bear no
relation to the function ofthe DMPKgene product, a protein kinase, which is still
produced in no rmal quantities. No other mutations in the DMPK gene prod uce
myotonic dystrophy, but a very simil ar cli nical disease (DMZ; OMIM 116955) can
be caused by massive expansion of a (CCTG ) n sequence in the completely unre
lated ZN F9 gene.
424 Chapter 13: Human GeneticVariability and Its Consequences
Fragile-X site A (FRAXA) 309550 X FMRI Xq 27.3 5' untranslated (CGG), 6-54 200-1000+
region
Fragile-X site E (FRAXE) 309548 X FMR2 Xq28 5' untransla ted (CCGI, 4- 39 200- 900
reg ion
Myotonic dystrophy 2 (DM2) 602668 AD ZNF93q21.3 intran 1 (CCTG)n 10-26 75- 11,000
Spinocerebel lar ataxia 10 (SeA 10) 603516 AD ATXN1022q 13.31 intran 9 (ATTCT)n 10-20 500-4500
Myoclonic epilepsy of Unverricht 254800 AR CSTB 21 q22.3 promoter (CCCCGC 2-3 40-80
and Lundborg CCCGCG),
Spinocerebellar ataxia 2 (SeA2) 183090 AD ATXN212q24 coding (CAG), 15-24 32- 200
Machado-Joseph disease (SeA3) 109150 AD ATXN3 14q32.1 coding (CAGI, 13 36 61-84
Spinocerebellar ataxia 7 (SeAl) 164500 AD ATXN73p14.1 cod ing (CAGI, 4-35 37-306
Spinocerebellar ataxia 17 (SeA 17) 607136 AD TBP 6q27 coding (CAG), 25-42 47-63
De n tato ru b ra I-pa Ilidol uysia n 125370 AD DRPLA 12p13.31 cod ing (CAG)n 7-34 49-88
atrophy (DRPLAI
Spinocerebellar ataxia 8 (SCAB) 608768 AD ? 13q21 untranslated RNA (CTGI, 16 34 74+
Spinocerebellar ataxia 12 (SeA 12) 604326 AD PPP2R285q32 promoter (CAG}n 7-45 55- 78
Huntington disease-like 2 (HOl 2) 606438 AD JPH3 16q24.3 variably spliced (CTGI, 7- 28 66- 78
exon
IA)
L )( )
axo n 1
no transcription
FlgufI!I13.19 Three mechanisms bywhkh
dynamic mutations may be pathogenic.
(Al In fragile X syndrome, the expand ed
-
,/1~
II
II repeat in the 5' untranslated region (UTR)
(CGG)" of the gene triggers methylation o f the
in 5' UTR promoter and prevents tra nscription . (B) In
myotonic dystrophy, the expanded repeat in
(8) t ranscription CUG repeat in mRNA the 3' untranslated region causes the mRNA
) sequesters CUG-binding t ranscript to sequester splicing factors in the
exon 14 exon 15 splicing factors
cell nucleus. preventing the correct spliCing
// of several unrelated genes. (C) In Hunt ington
~~
in 3' UTR
disease, the gene containing the expanded
repeat is transcribed an d translated as
normal, but the protein product ha s an
IC)
~~
i'l ===== and coding regions in dark blue.
in coding sequence
Comparisons of repeat sizes in parent and child are hard to interpret mecha
nistically because they may reflect the mitotic or meiotic instability of the repeat
in the parental germ line 01; alternati vely, the ability of gametes to transmit large
repeats. Moreover, repeat sizes are usually studied in DNA extracted from periph
eral blood lymphocytes, and if there is mitotic instability the lymphocytes may
have very different repeat sizes from those in sperm or egg, or in the tissues
involved in the disease pathology. Some repeats are unstable somatically, and so
give a smeared band when blood DNA is analyzed by electrophoresis. Others are
W1stable between parent and child but stable in mitosis, and so blood DNA gives
a sharp band, but of different sizes in parent and child.
The formation of protein aggregates is now recognized as a common it seems that a conformational change ca n propagate th rough a
feature of several adult-onset neurological diseases. These include t he population of protein molecules. converting them from a stable native
polyglutamine diseases of wh ich Huntington disea se is the prototype, conformation into a new form with different properties, in a process
Alzheimer disease, Parkinson disease, Creutzfeldt- Jakob disease, and that Is perhaps analogous to crystallization.
the heterogeneous group known as amylo idoses. The behavior of pr ion proteins in diseases such as Creutzfeldt
Globular protein molecules resemble oil drops, with hydrophobic Jakob disease is the most striking example. Misfo ldi ng might
residues in the interior and polar group s on the outside. Correct start with a chance misfolding of a newly synthesized structurally
folding is a critical and highly speCific process, and naturally occurring normal molecule (sporadic cases), a mutant sequence wi th a greater
proteins are selected from among all possible polypeptide sequences propensity to misfold (a genetic disease), or a misfolded molecu le
partly for their ability to fold correctly. Mutant proteins may be more somehow acquired from the environment (infectious cases). Thus,
prone to misfolding. Misfolded molec ules with exposed hydrophobic this final common pathway brings together a set of diseases with
groups can aggregate with each other or with other proteins, and very disparate origins. This topic is reviewed by Hardy & Orr (see
this is somehow toxic to cells and. in particular, neurons. Sometimes Further Reading).
426 Chapter 13: Human Genetic Variability and Its Consequences
(A)
- - 1- ---
a
"L. :~
"
-~
(X
>-
a
(B)
I I
I I I
al
I I
al ·1 al
<=I , 0
~
a~ a ~ '."."1 nl
0'
One common mechanism generating changes in gene dosage is non-allelic Ftgure13.20 Deletions of a-globin genes
homologous recombination (NAHR). Segmental duplications (often defined as in a-thalassemia. (Al Normal copies of
sequences 1 kb or longer witb 95% or greater sequence identity) may misalign chromosome 16 each carry two a.-globin
genes (red) arranged in tandem. Repeat
when homologous chromosomes pair in meiosis. NAHR then produces deletions
blocks flanking the genes (shown in gray
or duplications. The misaligned repeats have the same sequence but not the and yellow) may misa\ign, allowing unequal
same chromosomal location, so recombination is homologous but the sequences crossover. The figure shows one of several
are not alleles. Many (but by no means all) of the common nonpathogenic vari such misaligned recombinations that can
ants in copy number seen in normal healtby people are generated by this mecha produce a chromosome carrying only one
nism. a- Thalassemia provides a good example ofNAHR producing a pathogenic active a gene. Unequal crossovers between
variation in gene dosage. Most people have four copies oftbe a-globin gene (aa/ other repeats (not shown) can produce
a.a) as a result of an ancient tandem duplication. As shown in Figure 13.20, NAHR chromosomes carrying no functional a
between low-copy repeat sequences flanking the a-globin genes can produce gene. Thus, the number of a-globin genes
chromosomes carrying more or fewer a-globin genes. Reduced copy numbers of may vary in individuals from none to four or
more. (B) The consequences become more
a-globin genes produce successively more severe effects. People with tbree cop
severe as the number of a genes diminishes.
ies (aa/a-) are healthy; those with two (whetber the phase is a- /a- or aa/--)
See Weatherall et al. (Further Reading) for
suffer mild a-tbalassemia; those with only one gene (a-/- -) have severe disease; detail.
and lack of all a genes (- -/- -) causes lethal hydrops fetalis (fluid accumulation
in the fetus).
X-chromosome monosomy and trisomy are particularly interesting because
X-inactivation (inactivation of all except one of the X chromosomes in a cell; see
Chapter 3) ought to render them asymptomatic in somatic tissues. However, as
noted in Chapter 11, a surprisingly large number of genes on tbe X chromosome
escape inactivation. Some of these have a counterpart on the Ychromosome, but
most do not. For those genes tbat escape X-inactivation but lack a Y-linked coun
terpart, normal females would have two functional copies and males only one.
Turner (45,X) females would have the same single active copy as normal males
but perhaps in tbe context of female development, a single copy is not sufficient.
The skeletal abnormalities of Turner syndrome are caused by haploinsufficiency
for SHOX (50% of the normal gene product is not sufficient to produce a normal
phenotype). This is a homeobox gene tbat is located in the Xp/Yp pseudoauto
somal region, and so is present in two copies in botb males and normal females.
Below the level of conventional cytogenetic resolution but above the single
gene level, pathogenic variations in copy number are classified as microdeletions
or microduplications Cfable 13.2). Among tbese, three different molecular
pathologies can be distinguished:
• Single gene syndromes, in which all the phenotypic effects are due to the
deletion (or sometimes duplication) of a single gene. For example, Alagille
syndrome (OMIM 118450) is seen in patients with a micro deletion at 20p11.
However, 93% of Alagille patients have no deletion but instead are hetero
zygous for point mutations in tbeJAGl gene located at 20p12. The cause of the
syndrome in all cases is a half dosage of the JAGI gene product.
Contiguous gene syndromes are seen primarily in males vvith X-chromosome
deletions (Figure 13.2lA). The classic case was a boy BB who had Duchenne
muscular dystrophy (DMD; OMIM 310200), chronic granulomatous disease
(CGD; OMIM 306400), and retinitis pigmentosa (OMIM 312600), together
witb mental retardation. He had a chromosomal deletion inXp21 that removed
a contiguous set of genes and incidentally provided investigators witb the
means to clone the genes whose absence caused two of his diseases, DMD
and CGD. Deletions ofthe tip ofXp are seen in anotber set of contiguous gene
syndromes. Successively larger deletions remove more genes and add more
PATHOGENIC DNA VARIANTS 427
Prader- Wi lli 176270 15ql1 - q13 segmental aneuploidy (lack of paternal copy)
Those syndromes caused by having only one functional copy of a single gene are often also seen in patients who do not have a microdeletion, but a
point mutation in the gene. WAGR, Wilms tumor, aniridia, genital abnormalities, mental retardation; VCFS, velocardiofacial syndrome.
of the X chromosome (such as Xp21 and proximal Xq) but are rare or unknown
Similar contiguous gene syndromes are much less common with auto somes
patient no. 1 2 3 4 5 6 7 8
ii !~ tj
x,
x, 111
A,t
A,t
A, .
II
II
II
II
1
x,
~gUI""f! 13..11 X-linked and autosomal
x, A, microdeletion syndromes. (A) On the
x, A, X chromosome, male patients 1-3 show
a nested series of contiguous gene
syndromes, whereas patient 4 has a different
x,
x,
"'A,t+ contiguous gene syndrome. Deletion of
gene Xl or Xs is lethal in males. (8) On the
l
autosome, contiguous gene syndromes
T A,t
1, 1,
x, are rare because of the balanCing effect of
the second chromosome. In this example,
only gene As is dosage sensitive. Patients
~ =;; 5-7, with different-sized deletions all
phenotype x, x, x, x, A, A, A, A, encompassing gene As, all show the same
of patient: + + +
x, x, x, phenotype as patient 8, who is heterozygous
+ for a loss-of-function point mutation in the
x, As gene.
428 Chapter 13: Human Genetic Variability and Its Consequences
I II
•• .... . 1111
~
II••111111111
• - .... .! •• ~ ••• - .
•
. . ....
,
Flgur.13.23 loss-of-function mutations
,
-_ ..
,
,
,
, . ••
•• • ••, •••
• •• "
in the PAX3 gene in patients with type I
Waardenburg syndrome. A wide range
of both truncating and non-truncating
• .. .
1 2 3 4 5 6 7 8 9 10
mutations are seen, similar to the data for
the ATM gene in Figure 13.2 1. Missense
x-:- • • changes are concentrated in the two shaded
• ,, areas that encode key functional DNA
, binding domains (ora nge, paired domain;
green, homeodomain) of the PAX3 protein.
truncating mut.a tions; non-truncating mutatIons: Although the disease is dominant, the data
• nonsense mutation • missense mutation in the figure show that the cause must be
• frameshlfting insertion/deletion in·frame deletion a loss of function of the PAX3 gene. This is
_ splice site mutation therefore an example of haploinsufficiency.
- deletions of aU Of part of the gene
One might reasonably ask why there should be haploinsufficiency for any
gene product. Why has natural selection not managed things better? Jf a gene is
expressed so that two copies make only a barely sufficient amount of product,
selection for variants with higher levels of expression should lead to the evolution
of a more robust organism, with no obvious price to be paid. The answer is that in
most cases this has indeed happened-which is why relatively few genes are
dosage-sensitive. There are a few cases in which a cell with only one working copy
of a gene just cannot meet the demand for a gene product that is needed in large
quantities. An example may be elastin. In people heterozygous for a deletion or
loss-of-function mutation of the elastin gene, tissues that require only modest
quantities of elastin (skin and lung, for example) are unaffected, but the aorta,
where much more elastin is required, often shows an abnormality, supravalvular
aortic stenosis (OMIM 185500) . However, certain gene functions are inherently
dosage-sensitive. These include:
• gene products that are part of a quantitative signaling system whose function
depends on partial or variable occupancy of a receptor or a DNA-binding site,
for example;
• gene products that compete with each other to determine a developmental or
metabolic switch;
• gene products that cooperate in interactions with fixed stoichiometry (such
as the Ci.- and ~-globins or many structural proteins).
In each case, the gene product is titrated against something else in the cell.
What matters is not the correct absolute level of product, but the conect relative
levels of interacting products. The effects are sensitive to changes in all the inter
acting partners; thus, these dominant conditions often show highly variable
expression. Genes whose products act essentially alone, such as many soluble
enzymes of metabolism, seldom show dosage effects.
IIlndividuals with a single loss-of-function allele are affected, because the output of a single
functional allele is insufficient for normal functioning or development.
MOLECULAR PATHOLOGY: UNDERSTANDING THE EFFECT OF VARIANTS 431
r'1 r'1 r r
locus locus mutation mut ation
1r1
~ ~I
17q 7q
.
e'p>8SS'OO
, boO' m,1
expression
Il1 Il1 e,~~:s;oo
n chains n chains n chains
Figure l3.24 Dominant-negative effects
2n ch ain s n norm al chains n chains
+ n mutant chains of collagen gene mutations. Collagen fibrils
are built of crosslinked arrays of triple-helical
~ ~
~
~
vvvv 166210). mutations that replace glycine w ith
%surplus (12 chains any o ther amino acid usually have strong
~~ ~
(degraded) dominant-negative effect s because they
disrupt the packing. The helix is assembled
~ ~ starting at the C-terminus, and substitutions
10 normal Pfocollagen of glycines close to that end have a more
4 molecules
severe effect than substitutions nearer the
substitution in N-terminal substitution in C-terminal
N-terminus. Null mutations in either g ene
part of triple helix: part of triple helix:
result in fewer but otherwise normal triple
minor disturbance maj or disturbance
helices forming (simple ha ploin su fficiency)
of packing of packing
Overexpression NROB7 300078 male-to-female sex reversal gene du plications cau.se overexpression and sex reversa l
Acqu ire new PI (Pitts burgh allele) 107400 lethal bleeding disorder a rare gain-of-function allele of the Ctl -antitrypsin gene
substrate (Figure 13.25)
Ion channel open SCN4A 168300 paramyotonia congenita specific mutations in this sodium channel gene delay
inappropriately of von Eulenburg closing
Protein aggregation HD 143100 Huntington disease proteins with expanded polygl utamine runs form toxic
aggregates
Receptor GNAS7 174800 McCune-Albright disease somatic mutations only; constitutiona l form would
perma nently on probably be lethal
Chimeric gene BCR-ABL 151410 ch ronic myeloid leukemia somatic mutations only
MOLECULAR PATHOLOGY: UNDERSTANDING THE EFFECT OF VARIANTS 433
acid substitution in the active site converts an elastase inhibitor into a novel,
constitutively active, antithrombin, resulting in a lethal bleeding disorder.
Diseases caused by gain of function of G-protein-coupled hormone 1 .
-~ .. \
\ "
receptors
The G-protein-coupled hormone receptors provide good examples of activating ,
,~__ / t ...1.·,-'""-\
l~'·~
p ~' ,~-\, . ... ~ -- ,-"\
'
. /.r ' :'~-" /.
\)'~~'i" - ;~
c
mutations. Many hormones exert their effects on target cells by binding to the
extracellular domains of transmembrane receptors. Binding of ligand to the
,~ ~ "-' :\ !
receptor causes the cytoplasmic tail of the receptor to catalyze the conversion of
an inactive (GDP-bound) G- protein into an active (GTP-bound) form, and this ,of.', ~,·4
relays the signal furth er by stimulating ion channels or enzymes such as adenylyl ....... ...
.'
I,. '" - .
~. (' . :''''(i -'
.'
cyclase. Some mutant receptors are constitutionally active: they fire (signal to
downstream effectors) even in th e a bsence of ligand: - ,
• Familial male precocious puberty (OMIM 1764 10: onset of puberty by age 4
years in affected boys) is found with a constitutively active luteinizing hor
mone receptor. FIgure 13.25 A mutation causing the
• Autosomal dominant thyroid hyperplasia can be caused by an activating a,-antitrypsln molecule to gain a novel
mutation in the thyroid-stimulating hormone receptor (see OM[M 275200). function. (X, -Antitrypsin is an inactivator of
specific proteases. The protea5e cleaves the
• The bone disorder Jansen metaphyseal chondrodystrophy (OM[M 156400) peptide bond between residues 358 and 359
can be caused by a constitutionally active pa rathyroid hormone receptor. in the (X ,-antitrypsin molec ule. As a result,
the two residues (shown as green balls)
Allelic homogeneity is not always due to a gain of function sp ring 65 Aapart. as shown by the arrOW5
here. The conformationa l change tra p5 and
We stated above that allelic heterogeneity is normally a hallmark of loss-of inactivates the protease. In the wild-type
function phenotypes. However, it is not safe to assume that any condition that allele, reSidues 358 and 359 are methionine
shows allelic homogeneity is caused by a gain of function. There are other possi and se rine, respectively. This creates a
ble explanations: sub5trate for elastase; thus, the wild-type
(X,-antitrypsin functions as an anti-elastase
• In some diseases, the phenotype is very directly related to the gene product
age nt. In the Pittsburgh va riant a missense
itself, rather than being a more remote consequence of the genetic change. mutation, p.M356R. replaces the methionine
The disease may then be defined in terms of a particular va riant product, as in with arginine. The structu re, and the effect
sickle-cell anemia. of cleavage of the peptide bond between
Some specific molecular mechanism may make a certain sequence change in residues 358 and 359, are unaltered, but
a gene much more likely than any other change-for example the (eGG)" now the specificity is for thrombin rather
than elastase. Thus, the Pittsburgh varia nt
expansion in fragile X syndrome.
has no anti-elilstase activity, but has become
• There may be a founder effect. For example, certain disease mutations are a powerful antithrombin agent. The result
common among Ashkenazi Jews. Present-day Ashkenazi Jews are descended is illethal bleeding disorder. Orilnge ilnd
from a fairly small number of founders. [f one of those founders carried a blue mark the parts o f the (X,-antitrypsin
recessive allele, it may be found at high frequency in the present Ashkenazi molecule N-termina l and (-terminal of
population. the cleavage site; carbohyd rate reSid ues
are shown in gray. [Image 53000427 from
• Selection favoring heterozygoteS enhances found er effects and often results University of Geneva ExPA5y mOlecular
in one or a few specific mutations being common in a populatj on. biolog y Web server, httpJlwww.expasy.chl]
-
non-allelic homologous recombination Figure 13.26 In creased or decreased
gene dosage of the PMP22 gene are both
-
between flanking repeats
pathogenic but cause different diseases.
:y.: Unequal crossovers between flanking
repeats (yellow boxes) generate duplications
or deletions of a 1.5 Mb region containing
- -" -
1 the PMP22 gene (blue box). Duplication
~ causes Charcot- Marie-Tooth disease;
-•
activating p oint mutation
point mutations
--
•
loss-of-function
point mutation
d -- -- -- -- -- ------ ~tn
floJ eHe<:t severity depends on the level of residual
function, so that there is a genotype
~ phenotype correlation. (D) If the clinical
consequences are very different, depending
(D) no ''',0<' I. ,."-.' ,,, Co,, " , "eo' on '''''em A.a on the degree of residual function, the
results may be desc ribed as different
syndromes (A and A + B). and they may
100 50 o have di fferent modes of inheritance, as here.
level of residual gene function (%) Spec ific examples are discussed in the text.
436 Chapter 13: Human Genetic Variability and Its Co nsequ ences
>60 no disease
< 1.4 Lesch- Nyhan synd rome (OMIM 300322): spasticity, choreoathetosis,
self-mutilation, menta l re tardation
HPRT is an X-linked gene (OMIM 308000); the phenotype in malesdirectly renects the activity of
same gene can produce two or more dominant conditions, dle milder one
have different mutations in their mtDNA. In addi tion, because cells contain
thousands of mtDNA molecules, they can be homoplasmic (every mtDNA mole
cule is the same) or hereropiasmic (a mixed population of normal and mutant
mitochondrial DNA). Unlike mosaicism, heteroplasmy can be transmitted from
mother to child through a heteroplasmic egg. Egg cells contain more than 100,000
mitochondtia, so every child of an affected mother will inherit at least some of
her mutant mitochondria even if the mother is heteroplasmic. However, mito
chond rial diseases often show very reduced penetrance (see Figure 3.10, p. 68).
Leber hereditary optic neuropathy (OMIM 535000: sudden irreversible loss of
vision) illustrates some of the problems. Eighteen different point mutations in
mtDNA have been associated with this condition; 5 of these have a sufficiently
serious effect to ca use the disease on their own, and the o ther 13 have a contribu
tory effect. Fifry per cent of affected people have a g.1l778G-7A substitution.
Most of these patients are homo plasmic, but about 14%, no less severely affected,
are heteroplasmic. Even in homoplasmic families the condition is highly'varia
ble; penetrance overall is 33-60%, and 82% of affected individuals are male.
There are several possible reasons for this poor correlation:
• Mitochondrial DNA is much more variable than nuclear DNA, and some syn
dromes may depend on the combination of the reported mutation with other
unidentified variants.
• Som e mitochondrial diseases seem to be of a quantitative nature: small muta
tional changes accumulate that reduce the energy-generating capacity of the
CONCLUSION
In this chapter, we have described the many ways in which the genomes of two
humans can differ. Principally, these are through single nucleotide polymor
phisms, variable numbers oftandem repeats, and large -scale copy-numbervari
ants. Studies of these differences are directly relevant to understanding many
features of human populations. The structure of populations in terms of inbreed
ing and reproductive isolation, the relationship of different populations to each
other, the geographical and historical origins of populations- all these aspects
are illuminated by studies of genetic variation.
When we come to studies of health and disease, it is not quite so straightfor
ward. Connecting sequence variants with health and disease requires approach
ing the question from both ends. We have seen how one can try to decide whether
a particular variant may be pathogenic or not. In some cases the answer is clear,
but all too often simply looking at the DNA seq uence does not give a clear ans wer.
Even when we can be confident that a sequence change would be pathogenic. we
can very rarely predict the actual clinical sympto ms it would produce. To relate
DNA to disease we need also to start from the disease end of the connection.
Studying genetic diseases allows us to identify the underlying pathogenic
sequence variants. This approach has been hugely successful with the Mendelian
diseases, as the next chapters demonstrate. Given su itable fa milies, it is relatively
easy to identify the chromosomal location of the gene underlying a Mendelian
character (Chapter 14), which usually allows the researc her to go on to identify
the relevant gene (Chapter 16). Often it remains puzzling why a malfunction in a
particular gene should cause that particular set of symptoms, but at least the
pathogenic variants can usually be identified unambiguously.
Many genetic variants do not cause Mendelian conditions but may neverthe
less have an influence on health. Genetic susceptibility factors are important
determinants, though not the sole ones, of many common diseases. Linking spe
cific variants to specific disease susceptibilities has been a very difficult task that
is still far from fully accomplished. Chapters 15 and 16 review progress in that
area. Hopefully we will eventually be in a position to derive useful predictive
health information from analysis of a person's genome sequence- which will
raise a whole series of issues about genetic testing and population screening.
These are covered in Chapters 18 and 19.
FURTHER READING
Types of variation between human genomes Kid d JM, Cooper GM, Donahue WF et al. (2008) Mapping and
Bentley DR, 8alasubramanian S, Swerdlow HP et al. (2008) sequencing of structural variati on from eight human
Accurate whole-genome sequencing using reversible genomes. Nature 453, 56-64. [A systematic study of variants
terminator chemistry. Nature 456,5 3-59. (Mainly a technical 8 kb and larger in individuals from Asia, Europe, and Africa.]
account of using the lIIumina technology to sequence an Korbel JO, Urban AE, Affo urtit JP et al. (2007) Paired-end mapping
indi vidual's genome.) reveals extensive structural variation in the human genome.
DbSNP www. ncbi.nlm.nih.gov/ projec ts/SNP [The t2 millio n Science 31 8, 420- 426. [Results of usi ng a method of detecting
entries in Build 126 include some rare pathogenic va ri ants CNVs down to 3 kb in size.]
and some changes that are more complex than single Lev y S, Sutton G, Ng PC et al. (2007) Th e diploid genome
nucleotide polymorphisms.] sequence of an individual human. PLoSBioI. 5, e254.
Decipher database of pathogenic large-scale va riants. [Craig Venter's sequence.]
http://decipher.sang er.a c. uk! Li JZ, Absher DM, Tang H et al. (2008) Worldwide human
Den Ounnen JT & Antonarakis SE (200 1) Nomenclature for the relati onship sinferred from genome-wide patterns of
description of human sequence variations. Hum. Gene r. 109, variation. Science 319,1100-11 04. [U sing SNP data to infer
12 1- 124. po pul ati o n rel ationships .]
Horaitl s 0, Talbot CC Jr, Phommarinh Met al. (2007) A d at abase of Lu kusa T & Fryns JP (2008) Human chro mosome fragility. Biochim.
locus-specific databases. Nat. Genet. 39,425. Biophy. Acta 1779,3- 16. [A review of chromosome fragile
fA manifesto for locus-s pecific mutation databases.] sites.]
Jakobsson M, Scholz SW, Scheet Pet al. (2008) Genotype, Mutation databases. http://www.genomic.unimelb.ed u.au/mdil
haplotype and copy-number variation in worldwide human Th e Human Genome Variation Society Mutation Database
populations. Nature 451 , 998-1 003. [Analysis of SNPs and Initiative li sts details and lin ks for many databases, including
CNVs in individuals from 29 population s.] http://www.hgvbaseg2p.org [Human Genome Va ri ation
Chapter 14
Genetic Mapping of
Mendelian Characters
KEY CONCEPTS
<} Genetic mapping is needed to localize genes underlying observable phen otypes, which cannot
be identified by searchin g the DNA sequence of the human genome.
During meiosis the re is n ormally at least one crossover in each arm of each pair of syna psed
homologous chromosomes. Pedigree analysis reveals th e result of these crossovers as genetic
recombination.
• Genes th at are physically close together on th e same chromatid tend to travel together duri ng
meiosis; genes that are fa rth er apart are liable to separate th rough crossovers between
h omologo us ciuomosomes.
• The genetic distance between two loci is defined as the frequency with whi ch recombination
separates th em during meiosis. This is ofte n measured as the fraction of gametes that are
reco mbinant. A recombinant fraction of 0.01 corresponds to 1% recombination and a genetic
distance of 1 centimorgan (eM) .
• Recombinant fractions never exceed 0.5, regardless of whether the tv•.1 0 loci concerned are on the
same or different chromosomes.
Genetic maps are co-linear with physical maps, but distances do n ot necessarily correspo nd
beca use the probability of recombination pe r megabase of DNA is not un iform along the length
of a chromoso me, and differs between males and females.
To map h uman traits, it is necessary to assemble family pedigrees containing a sufficient number
of informative meioses.
Genes are mapped by checking a collection of pedigrees for co -segregation orthe relevant
phenotype and th e alleles of genetic markers. The main markers used are sho rt tand em repeat
polymorph isms (STRPs or microsa teilites) and single nucleo tide polymorphisms (SNPs).
The statistical criterion for assessi ng linkage is the logarithm of the odds or lod score. In simple
situati ons, the threshold of significance fo r linkage is a lad score of +3 and against linkage -2.
• Human pedigrees often have complex structures in wh ich recombinants cannot be identified
thro ugh simple inspection, and so software programs are needed that calculate the relative
likelihood of the observed data on the co mp eting hypoth eses of linkage Or no linkage.
• Linkage analysis can be more efficient if data for more than two loci are analyzed simultaneously,
an a pproach known as mul tipoint ma ppi ng.
• Multipoint li nkage analysis has been used on large fam ily pedigrees to construct marker
framework maps, which can then be used to pinpoint the location of a disease ge ne.
• Autozygosity mapping is a powerful meth od of mappi ng genes for rare re cessive conditions. it
works by hunting for ancestral chromosome segments shared by affected people in large
co nsanguineous families.
The reso lution of linkage analysis depends on the num ber of meioses analyzed. The ultimate
reso lution is o btained by typing individual sperm.
(, The methods described above have been very successful for mapping Mendelian co nditions, but
loci con ferri ng susceptibility to common co mplex diseases require different approaches, which
are described in Chapter 15.
442 Chapter 14: Genetic Mapping of Mendelian Characters
The finished sequence of the human genome is the ultimate physical map, allow
ing the chromosomal location of physical entities (such as DNA sequences and
breakpoints) to be defined to the nearest nucleotide_However, nothing in the raw
sequence wo uld have allowed a researcher to locate the gene, HD, responsible for
Huntington disease, or any ot her phenotypic character. Of course, the HD gene
can now be located by using a genome browser, but this is only possible because
the gene was previously localized and identified by other means, and tha t infor
mation was added to the annotation used by the browser. To locate genes respon
sible for a particular disease or any other phenotypic character, we still need a
genetic map.
Genes defined through phenotypes can be identified in various ways (as will
be described in Chapter 16), but the usual methods all involve first discovering
the chromosomal location of the gene. Such genetic mapping can be performed
for any phenotypic determinant. The first step is to collect a series of clinically
well-phenotyped cases, which are then genotyped in the laboratory for a series of
genetic markers. These genotypes are then analyzed to map the determinant. In
this chapter we describe the basic principles of genetic mapping and their appli
cation to Mendelian characters. Mapping susceptibility factors for complex dis
eases, where there is genetic susceptibility but not a Mendelian inheritance pat
tern, is based on the same principles, but brings in extra complications, which
are considered in Chapter 15.
Recombination is believed to involve the following steps (Figure 1): Figure t The stages of recombination and gene conversion.
A double-strand break is made in one parental DNA molecule Note that here we show the individual strands of each double
(yellow). Exonuc1eases strip back the 5' ends of the broken strands, helix; Figure 14.3 shows chromatids, each of w hich contains an
leaving single-stran ded 3' overhangs. The single-stranded ends intact double helix. [From Alberts B, Johnson A, l ewis J et al. (2007)
are stabilized by cooperative binding of Rad51 or related proteins. Molecular Bio logy of the Cell, 5th ed. Garland SciencefTaylor &
The single strand invades the complementary double helix Francis l lC.]
opposite partner.
1 strand invasion
blvalenls in
prophase of
A, A,
r A, A, A, A, two of the four chromatids of the two
synapsed homologous ch romosomes.
One chromosom e carries alleles A I and
8\, the other alleles A2 and 82. Chromatids
meiosis I in the gametes labeled N carry a parental
combi nation of alleles (A IBI or A2Bl).
Chromatids labeled R carry a recombinant
B, ~I W ~ ~ "~ ~ m~~ combination (A IB2 or A2B\). Note that
recombinant and non-recom binant are
1 1- 1 1
defined on ly in relatio n to th ese two loc i.
For example, in the res ult of the
three-stra nd double crossover, the second
n chromatid from the left has been involved
much lower than the theoretical chance of a double crossover between two mark {AJ
ers 10 cM apart if there were no interference (J 0% x 10% ~ 1%) . The interested
reader should consult Ott's book (see Further Reading) for a fuller discussion of
mapping functions.
Interference makes true close double recombinants very unlikely, but gene
conversion (see Box 14.1) can produce results that are indistinguishable from
very close double recombinants. Only very short sequences are involved in gene
conversion-o f the order of 1 kb. This is far below the resolution of almost all
'-,
genetic mapping, and so most gene conversion events have no influence on .'
genetic maps. However, they could pose a problem for identifying conserved
ancestral chromosome segments, as explained in Chapter 15 (see Section 15.4).
~/~r
~
(b! ue) and MlHl foci in oacytes (red)
'- -- ---- - ~ ...
on chrom osomes 13, 18, and 21. Each
----
--- chro mosome pair is divided in to 5% len g th
in tervals. (8) The sam e crossover d istribution
------
----- ----- patterns shown for ea ch chromosome as
:::--- ... ---::: :: recombination maps. These ma ps highlight
1 f,~~~~~
j 96.5 eM
>~~~;;;;
OeM
OeM
:I
~
-a
•
chromosome 18 95.5 eM
118 eM
OeM
O eM
-
MP:~-
~
...
chromosome 21 53.5 eM
'1i ::::;;~~~-
61.5 eM
pseudoautosomal region at the tip of the short arms ofthe Xand Ychromosomes.
Males have an obligatory crossove r wi thin this 2.6 Mb region, so that it is 50 cM
long. Thus, for this region in m ales I Mb = 19 cM, whereas in females 1 Mb =
2.7 cM. Uniquely, the Y chromosome, outside the pseudo autosomal region, h as
no genetic map because it is not subject to synapsis and crossing over in
me iosis.
Two lines of evidence show that recombination is also highly non- random at
a m uch fin er, kilobase, level of resol utio n:
Whe n individuals are genotyped fo r a de nse map of SNP m arkers, and the
results fro m m any different individuals are com pared, it seems that our
genome consists of a set of conserved ances tral segments separated by fairly
sharp boun daries. This structure implies that, historically, very few recombi
nation events have talen place within the ancestral blocks, whereas block
boundaries are ho tspo ts for recombination. Typical block sizes are 5- 15 kb.
These fi nd ings, from the Interna tional HapMap Project, are discussed in mo re
detail in Cha pter 15.
448 Chapter 14: Genetic Mapping of Mendelian Characters
haplotype blocks
targets for sperm crossove r analysIs _
300
- • - •
••
..",
Figure 14.6 At the highest resolution,
recombination is concentrated in
small hotspots. DNA-level distribution
of recombination in males in a 206 kb
region of chromosome 1q42.3; 9S% of all
c<> recombination occurs at highly localized
.Q ::E 200
~~
hotspots, with average recombination
~"
:.c.e frequencies up to 70 cM/Mb. Outside the
E.?;o
a '5 hotspots, recombination is no more than
g~ 100
" 0
m 0.04 cMlMb. Green bars mark relatively
stable haplotype blocks where crossovers
are infrequent. Black bars mark the short
0
- lSO - 100 -SO 0 sequences that were selected for analysis
chromosome distance (kb) of crossovers in individual sperm. [Adapted
from Jeffreys AJ, Neumann R, Panayi M
et al. (200S) Nat. Genet. 37,601-606. With
Recombination events can be observed directly at I kb Or higher resolution by permission from Macmillan Publishers ltd.}
genotyping single sperm from suitably heterozygous men. Such work by
Jeffreys and colleagues on two different chromosomal regions has clearly
shown that almost all recombination occurs in scattered 1-2 kb hotspots
(Figure 14.6). By and large (although not entirely). these hotspots correspond
to the gaps between ancestral conserved segments. It is not possible to study
female meiosis in the same way, but the general agreement with the block
structure defined tluough population studies suggests that recombination in
females must also be largely confined to the same small hotspots.
The restriction of recombination events to such very localized hotspots has
major implications for studies of the mechanism of recombination. Tantalizingly,
the DNA sequence at these points does not have any obvious special feature to
explain this behavior. Identical DNA sequences can be hotspots in humans but
not in chimpanzees. However, genetic mapping in general is not affected by this 40
very fine-scale irregularity. The best current human genetic map has a resolution males
of only about 500 kb. 35
Individuals also vary in the average number of crossovers per meiosis, and if 30 •• : eO 0.. :'.:: •• :
II
I I " ..
they do, then they will also have different genetic map lengths IFlgure 14.7). • •• . . , t·· .: .!
25 • :.1 \" I, ••• I
Thus, genetic maps are always probabilistic and sho uld not be over-interpreted. •• ', I ,j' j ',' ' . , '
I'::' •• ::.:•• !' ,.' ..":
In fact, the same is true of physical maps. Although physical distances in anyone
individual's DNA can be stated with complete precision, we saw in Chapter 13
.~:: ::.!;,.::::.. ':, .' ..... 0 ••
'w
that human populations are polymorphic for numerous large deletions, inser E
tions, and inversions. In the end, both genetic and physical maps are tools for §. 10
Individual
achieving certain purposes, not absolute descriptions of human beings. •ac
~ 70
c females
The main purpose of genetic mapping is to locate the determinants of pheno ~ 50 .'. :-'. ·:~·'!.I
-.' ., .•- I~"" • '.
types. As mentioned above, these can not be inferred by any inspection of the
unannotated human genome sequence. Genetic mapping depends on estimat
ing the recombination fraction between pairs of loci. This requires an individual
40
30
," .::' .
.::'!·:·,!:il·:"1
..:,III":: ' ....
S":' 1 o ' , ..
I, t.
. , I'
'1: .. -.·
:'..
-;,:,:::'1.
.
who is heterozygous for both loci, as shown in Figure 14.1. People heterozygous 20
for two different genetic diseases are extremely rare. Even if they can be found, 10
individual
tlley ,"viII probably ha ve no children or be unsuitable for genetic analysis in some
other way. Thus, it is seldom possible in humans to map diseases or other pheno
Figul'. 14.7 Individual differences in
types against one another, as one would in Drosophila, numbers of recombinations per meiosis.
Cheung and colleagues studied an average
Mapping human disease genes depends on genetic markers of eight meioses per individual in 34 women
To map a human disease (or the locus determining any other uncommon pheno and 33 men by genotyping members of
the Centre d '~tude du Polymorphisme
type), we need to collect families in which the character of interest is segregating,
Humain (CEPH) Utah families for 6324 SNP
and then find some other Mendelian character that is also segregating in these markers. Each vertical line of dots shows the
same families, so that individuals can be scored as recombinant or non number of recombinations identified in the
recombinant. A suitable character would have the follOwing characteristics: different meioses in one person. The scatter
• It should show a clean pattern of Mendelian inheritance, preferably co within and between individuals shows that
dominant so that the genotype can always be inferred from the phenotype. there is no one correct human map length.
(Fro m Cheung VG, Burdic k JT, Hirsc hmann
• It should be scored easUy and cheaply using readily available material (e.g. a D & Morley M (2007) Am. J. Hum. Geller. SO,
mouthwash rather than a brain biopsy). 526-530. With permission from Elsevier.)
MAPPING A DISEASE LOCUS 449
Blood groups 1910-1960 - 20 May need fresh blood, and rare antisera . Genotype cannot always be inferred from
phenotype because of dominance. No easy physical localization
Electrophoretic mObility 1960-1985 -30 May need fresh serum, and specialized assays. No easy physical localization. Often
variants of serum proteins limited numbers of alleles and low frequency polymorphism
HLA tissue types 1970 One set of closely linked loci, usually scored as a haplotype. Highly informative;
many alleles with high or moderate frequency. Can only test for linkage to 6p21.3
DNA RFLPs 1975 > 10 5 Two-allele markers, maximum heterozygosity 0.5. Assayed previously by Southern
bloning, now by PCR. Easy physical localization
DNA VNTRs (minisatellites) 1985-1990 _10 4 Many alleles, highly informative. Assayed by Southern blotting. Easy physical
localization. Tend to cluster near ends of chromosomes
DNA VNTRs (microsatel lites) 1990 _10 5 Many alleles, highly informative. Can be assayed by automated multiplex PCR.
Easy physical localization. Distributed throughout "genome
DNA SNP, 2000 _ 107 Two-allele markers. Densely distributed across the genome. Massively paralle l
technologies (microarrays, etc.) al low a sample to be assayed for thousands of
$NPs in a single operation. More stable over evolutionary time than microsatellites
RFLP, restriction fragment length polymorphism; SNP, single nucleotide polymorphism; VNTR, variable number of tandem repeats.
III
available. In theory, linkage could be detected between loci that are 40 cM apart, Flgur~ 14.8 Informative and
but the amount of data required to do this is prohibitive ('fable 14.2). uninformative meioses. In each pedigree,
Obtaining enough family material to test more than 20-30 meioses can be individual 111 has a dominant condition that
we know he inherited along with marker
seriously difficult for a rare disease. Thus, as can be seen in Table 14.2, mapping
allele A, because he inherited the condition
requires markers spaced at intervals no greater than 10-20 cM across the genome. from his mother, who is homozygous AlA,.
Given the genome lengths calculated above, and allowing for imperfect informa Is the sperm that produced his daughter
tiveness, we need a minimum of several hundred markers to cover the genome. recombinant or non-recombinant between
Human genetic mapping only became generally feasible with the identification the loci for the condition and the marker?
of DNA markers, starting with restriction fragment length polymorphisms (A) This meiosis is uninformative: the father
(RFLPsj in the 1970s. As well as being more numerous than earlier protein-based (II,) is homozygous and so his marker alleles
markers, DNA markers have the advantage that they can all be assayed with the cannot be distinguished. (B) This meiosis
same techniques. Moreover, their chromosomal location can be easily deter is uninformative because both parents are
mined by reference to the human genome sequence. This allows DNA-based heterozygous for alleles A, and A2: the child
could have inherited A, from the father and
genetic maps to be cross-referenced to physical maps.
A2 from the mother, or vice versa. (el This
meiosis is informative and non-recombinant:
Linkage analysis normally uses either f1uorescently labeled the child inherited A, from the father. (0) This
micro satellites or SN Ps as markers meiosis is informative and recombinant: the
child inherited A2 from the father.
The first DNA-based linkage studies used RFLPs that had to be assayed by the
very laborious Southern blotting technique, which made a genomewide search
for linkage extremely onerous. Subsequent developments have steadily made
linkage analysis quicker and easier. Microsatellite (STRP) markers were intro
duced that were highly informative because they had many possible alleles, and
could be assayed using PCR. In the early days, each microsatellite was amplified
individually and run individually on a manually poured slab gel. Later, compati
ble sets of microsatellite markers were developed that could be amplified together
in a multiplex PCR reaction to give non-overlapping allele sizes, so that they
could be run in the same gel lane. With fluorescent labeling in several colors, it
became possible to genotype a sample for 10 or more markers in a single gel lane
or capillary of a sequencing machine.
A major achievement of the early stages of the Human Genome Project was to
identify upward of 10,000 highly polymorphic microsatellite markers and place
them on framework maps (see Chapter 8). These maps made possible the spec
tacular progress ofthe 1990s in mapping Mendelian diseases. The original micro
satellites were mostly repeats of the dinucleotide sequence CA. Trinucleotide and
tetranucleotide repeats have gradually replaced dinucleotide repeats as the
markers of choice because they give cleaner results.
With the widespread adoption of microarray technology, SNPs have come to
the fore as genetic markers. SNPs normally have only two alleles, which means
that a higher proportion of meioses are uninformative for linkage using a SNP
than using a microsatellite. Two to three SNPs will generally be required to
TABLE 14.2 THE NUMBER OF INFORMATIVE MEIOSES NEEDED TO OBTAIN EVIDENCE OF LINKAGE BETWEEN TWO LOCI
The higher the recombination fraction is between two linked loci, the more meioses are needed to obtain eVidence that they are linked. See Box 14.2
for an explanation of how these figures were calculated.
TWO-POINT MAPPING 451
(A) (8 )
A2~ AlAs
II
Ai ~ I A3 A4
"' Ai A3 A2 A3 AI A~ Al A~ A2 A4 A, A,
"' AIA3 ~A3 AI A4 AIA4 A2A4 A2A3
N N N N N R
(C)
"'
A1 AJ A,A 3 AIAa AIA4 ~A4 A,As At ~ Al A6
Figur. 14.9 Recognizing recombinants. Three versions of a family with an autosomal dominant disease, typed for a ma rker with alleles A , - A6'
(AJ All meioses are phase -known. We can identify !II,-Ills unambiguously as non-recombinant (N) and 1116 as recombinant (Rl. (9) The same famil y, but
phase-unknown. The mo ther, II" could have inherited either marker allele A, or A2 with the disease; thus, her phase is unknown . Either 11I,-lIIs are
non- recombinant and 1116 is recombinant; or Ilil-lils are recombinant and 1116 is non -recombinant. (C) The same family after furth er tracing of relati ves. III,
and ilia have also inherited marker alle le A, along w ith the disease from their father-but we canno t be sure whether their father's allele A, is identical by
descent to the allele Al in his sister 11 ,. Maybe there are two copies of allele A , among the four grandpa rental marker alleles. The likelihood of this depends
on the gene frequency of allele A, . Thus, although this pedigree contains linka ge informati on, extra cting it is problematIc.
452 Chapter 14: Genetic Mapping of Mendelian Characters
recombinants unambiguo usly, even if the first alternative seems much more
likely than the second. Figure 14.9C adds yet more complicatio ns: furrher rela
tives have been traced who have also inherited marker allele Al along with the
disease from rheir father Ih-but we canno t be sure whether this Al allele is iden
tical by descen t to the A l allele in their aunt III . However, if this were a familywirh
a rare disease, no researcher wo uld be willing to ignore it. Some method is needed
to extract the linkage information fro m a family wirh only the incomplete or
ambiguous identification of recombinants and non· recombinants.
Let us take as an example a calculation of lod scores for the families in Family B
Figure 14.9A and B. The mother (11 1 in Figure 14.9B) is phase-unknown.
Given that th e loci are truly linked, with recombination fractio n If she inherited Al w ith the disease, there are five non-recombinants
9, the likelihood of a meiosis being recombinant is 9 and th e and one recombinant.
likelihood of its being non·reco mbinant is 1 - 9. If she inherited A2 with the disease, there are live recombinants and
Ifthe loci are, in fact. unlin ked, the likelihood of a meiosis being one non-recombinant.
either recombinant or non·recombinant is 0.5. The overall likelihood is ~[(1 - 9)s x 9/(0.5)6] + V2[(1 - 9) x 951(0.5)6].
Family A This allows for either possible phase, with equal prior probability.
There are live non-recombinants (1 - 9) and one recombinant (9) .
The lad sco re, Z, is the logarithm of the li kelihood ratio.
The overall likelihood, given linkage, is (1 - 9)5 x 9.
e
The likelihood ratio is (1 - 9)5 x 1 (0.5)6.
r;-r; 0.1 0.2 0.3 0.4 0.5
I:
0 0.2 0.3 0.4 0.5
- 0.509 0.299
0.7
-~ 0.5 77 0.623 0
0.6
The loci score, Z, is the logari thm of the likelihood ratio. 0 .5
0 .7 ~ 0.4
8
0 .6 •
~ 0 .3
~
0.5 " 0.2
•8 D.' 0. 1
~ 0.3
must take likelihoods calculated for each possible genotype of II, 12, and 11 3,
weighted by the probability of that genotype. For Il and 12 , the genotype proba
bilities depend on both the gene frequencies and the observed genotypes of III,
III?, and IIIe. Genotype probabilities for 113 are then calculated by simple
Mendelian rules. Human linkage analysis, except in the very simplest cases, is
entirely dependent on computer programs that implement algorithms for han
dling these branching trees of genotype probabilities, given the pedigree data
and a table of gene frequencies.
The result oflinkage analysis is a table oflod scores at various recombination
fractions, like the two tables in Box 14.2. Positive lod scores give evidence in favor
oflinkage, and negative lods give evidence against linkage. Note that only recom
bination fractions between 0 and 0.5 are meaningful, and that a1llod scores are
zero at e = 0.5 [because they are then measuring the ratio of two identical prob
abilities, and 10g IO (lJ = OJ. The results can be plotted to give curves (see Box
14.2J.
Returning to the two questions posed at the start of this section, we now see
that the most likely recombination fraction is the one at which the lad score is
highest. If there are no recombinants, the lad score will be maximum at e =o. If
there are recombinants, Z will peak at the most likely recombination fraction
(0.167 = lf6 for family A in Figure 14.9, but it is harder to predictfor fanlily B).
Lod scores of +3 and - 2 are the criteria for linkage and e><clusion
(for a single test)
The second question posed at the start of this section concerned the threshold of
significance. Here the answer is at first sight surprising. Z =3.0 is the threshold for
accepting linkage, with a 5% chance of a type 1 error (fal sely rejecting the null
hypothesis). For most statistics, p < 0.05 is used as the threshold of Significance,
but Z = 3.0 corresponds to 1000:1 odds [IoglO (1000) = 3.01. The reason why such a
stringent threshold is chosen lies in the inherent improbability that two loci, cho
sen at random, should be linked. With 22 pairs of a utosomes to choose from, it is
not likely the two loci would be located on the same chromosome (syntenic) and,
even if they were, loci well separated on a chromosome segregate independently.
Common sense tells us that if something is inherently improbable, we require
strong evidence to convince us that it is true. This common sense can be quanti
fied in a Bayesian calculation (Box 14.3), which shows that odds of 1000: 1 in fa ct
correspond preCisely to the conventional p= 0.05 threshold of significance. Thus,
Z = 3.0 is the threshold for accepting linkage with a 5% chance of falsely rejecting
the null hypothesis. Linkage can be rejected if Z < -2.0. Values of Z between -2
The likelihood that two loci should be linked (the prior probability of linkage) has been argued
over, but estimates of about 1 in 50 are widely accepted.
I Hypothesis
i
Loci are linked loci are not linked
(recombinat ion fraction (recombinati on
=9) fraction = 0.5)
Because of th e low prior probability that two randomly chosen loci shou ld be linked, evidence
giving 1000:1 odds in favor of linkage is required in order to give overal l 20: 1 odds in favor of
linkage. This corresponds to the conven tional p = 0.05 threshold of statistical significance. The
calculation is an example of the use of Bayesian statistics to com bine probabilities. See the text
for a description of the lod score.
454 Chapter 14: Genetic Mapping of Mendelian Characters
FiguMI!! 14.10 Lod score curves. Graphs of lod score against recombination
5
fraction (9) from a hypothetical set of linkage experiments. The green curve shows
eviden ce of linkage (Z > 3) with no recombinants. The red curve shows evidence
of linkage (Z > 3)' with the most likely recombination fraction being 0.23. The
purple curve shows linkage excluded (Z < -2) for recombination fractions
below 0.12; inconcluslye for larger recombination fractions. The blue curve is
inconclusiye at all recombination fractions.
and +3 are inconclusive. The same logic suggests a threshold lod score of 2.3 for
establishing linkage between an X-linked character and an X-chromosome
marker (prior probability of linkage ~ 1110).
Confidence intervals are hard to deduce analytically, but a widely accepted
support interval extends to recombination fraction s at which the lod score is 1
unit below the peak value (the lad -1 rule). Thus, the red Curve in Flgure 14.10
gives acceptable evidence oflinkage (Z> 3), with the m ost likely recombination
fraction 0.23 and support interval 0.17-0.32. The curve will be more sharply
peaked for greater amOuntS of data but, in general, peaks are quite broad. It is
important to remember that distances on human genetic maps are ofren very
imprecise estimates.
Negative lod scores exclude linkage for the region where Z < -2. The purple
curve in Figure 14.10 excludes the disease from 12 eM (= 0.12 recombination frac
tion) either side of the marker. Although gene mappers hope for a positive lod
score, exclusions are not without value. They tell us where the disease is not
(exclusion mapping). This can exclude a possible candidate gene, and if enough
of the genome is excluded, only a few po ssible locatio ns may remain.
Table 14,3 shows a real example of two-point lod scores from a study offami
lies with migraine with aura. There are several points worth noting:
There is evidence oflinkage-some lod scores are well above the 3.0 threshold
(green shaded rows).
TABLE 14.3 TWO-POINT LOD SCORES FOR A SERIES OF MARKERS IN FAMILIES SHOWING AUTOSOMAL DOMINANT
INHERITANCE OF MIGRAINE WITH AURA
Marker led (Z) at e= -
0.001 0.01 0.05 0.1 0.2 0.3 0.4
I1AIIItJ ..... ~
01591 ...,
1M 5.41
$21
5.04
4.80 4.2111
UJ
3.ll
2.10
1.94
W
0.73
Markers from chromosome lSq 11-q13 are listed in chromosomal order. For each marker, the maximum lod score is shown in bold .The markers
highlighted in green show signilicant evidence of linkage to the locus (Le. Z> 3.0) determining migraine in these families. A multipoint analysis of the
same data is shown in Figure 14.10. IData from Russo L. Mariotti P. Sangiorgi Eet al. (2005) Am. 1. Hum. Genet. 76, 312-326.]
MULTIPOINT MAPPING 455
• For some markers, the maximum lad score is at zero recombination (like the
green curve in Figure 14.10; actually e = 0.001 is used to avoid problems with
minus infinity scores for unlinked markers). For some others (D15S986 and
D15S113) it is slightly negative at e = 0.001 but rises almost to the threshold of
significance at e= 0.1, like the red curve in Figure 14.10 altll0ugh less extreme;
for others (D15S1012 and D15S971) it is always negative, like the purple curve
in Figure 14.10.
Tables like this should not be over-interpreted. The precise lad score depends
on the vagaries of which meioses were informative with which markers. The thing
to note is the overall trend, showing strongly positive scores for markers toward
the center of the region covered, with negative scores at each end of the region.
ABc/abc (A,B)-x-C 5
abO abc
Abc/abc A-x-(B, C) 47
aBC/abc
One thousand offspring have been scored from crosses between female Drosophila fruit flies
heterozygous for recessive characters at three linked loci (ABC/abc) and triply homozygou s
males (abc/abc). This design allows the genotype of the eggs to be inferred by inspection of the
phen otype of the offspring. The rarest class of offspring will be those whose production requires
two crossovers. Of the 1000 flies, 142 (95 + 47) are recombinant between A and B, 52 (47 + S)
between A and C, and 100 (9S + S) between Band C. Only five flies are recombinant between A and
C but not between A and B, so these must have double crossovers, A- x- C- x-8. Therefore the map
order is A-C-B and the genetic distances are approximately A- (S cM)-C-(1 0 cMJ-B.
serious difficulty for map makers. Something more intelligent than brute-force
computing had to be used to work out the correct order. Even without large-scale
sequence data, physical mapping information was immensely helpful. Markers
that can be assayed by PCR can be used as sequence-tagged sites (STS) and phys
ically lOCalized, either by database searching or experimentally byusing radiation
hybrids. The result is a physically anchored marker framework.
Disease-marker mapping suffers from the necessity of using whatever fami
lies can be found in which the disease of interest is segregating. Such families will
rarely have ideal structures. Ali too often the number of meioses is undesirably
small, and some are phase -unknown. Marker-marker mapping can avoid these
problems. Markers can be studied in any family, so families can be chosen that
have plenty of children and ideal structures for linkage, liJce the family in Figure
14.1. The construction of marker framework maps has benefited greatly from a
collection offamilies each with eight or more children and all four grandparents,
assembled specifically for tile purpose by the Centre d'Etude du Polymorphisme
Humain (CEPH; now the Fondation Jean Dausset) In Paris. Immortalized cell
lines from every individual of these CEPH families ensure a permanent supply of
DNA, and sample mix-ups and non-paternity have long since been ruled out by
typing with many markers. As an example, the widely used 1998 CHLC
(Cooperative Human Linkage Center) map was based on the results of scoring
eight CEPH families with 8325 microsatellites, resulting In more than 1 million
genotypes.
5/ ' 10
I \
a multipoint mapping program (Simwalk).
The genetic map of markers is taken as fixed;
the disease locus is placed successively at
Ii
§
"
,•"
1"1 (marker positions are shown by black arrows;
colored arrows indicate the poSition of
two candidate genes) and the multipoint
lod score is calcu lated for each possible
~ -5
locatio n of the disease locus. Lad scores
Yl
dip to stron gly negative values near the
position of loci that show recombinants
- 10 with the disease. The highest peak marks
the most likely location; odds in favor of
this location are measured by the degree to
which the highest peak overtops its rivals.
-15 The peak 100 score is 6.54; that is, a tenfold
stronger likelihood than the maximum of
5.56 obtained by two~p oint analysis. There
which are usually only very roughly known. The physical distances may be known
were no recombinants with in this region in
precisely, but for this purpose we need the genetic distances. For markers 1 cM this data set, so the candidate region could
apart on the highest resolution map currently available (the rcelandic map of not be narrowed down further. (From Russo
Kong et al.J. the 95% confidence interval for their true distance is 0.5-1.8 cM. L, Mariotti p, Sa ngiorgi Eet al. (2005) Am. J.
Moreover, those figures relate only to the individuals whose data were used in Hum. Genef. 76, 327-333. With permission
constructing the map, who will not be the same individuals as those used in any from Elsevier.]
particular disease mapping study. As Figure 14.7 shows, total map length varies
between individuals. Additionally, none ofthe mapping functions in linkage pro
grams even approximates to the real complexities of chiasma distribution (see
Figure 14.5). However, unless distances on the marker map are radically wrong, it
remains true that the highest peak marks the most likely location. If the marker
framework is physically anchored, as described above, the stage is then set to
search the DNA of the candidate interval and identify the disease gene.
• ' \1
"' 52 52
12 12
12 12
5 2 52
'v
32
6 1
22
3 3
"
6 1
, 2
33
6
02$305 , 5 65 57 57 57 56
02S310 21 12 2 1 12 7 3 7 3 13
.. ..
12 6 3 6 3
, 5
02S144
0 25171
AFMhJ46ye5
6 1
75
13
23
21
3 ,
27
32
"
5 3
34
12
53
34
12
53
32
AfMd052',b5 56 63 6 5 6 5 62
O:;:Sl ~ 22 , 2 21 22 22 22
025174 33 56 53 53 56
0251365 9 7 5 2 28 27 , 7 2 7
025170 56 26 6 2 65 65 6 2
V
, 5 , 6 , 5 5 ~ 55
025305
025310
D2S1<l<l
,
1
2
3
5 5
12
13
5 6
1 1
12
21
12
"
12
13
22
6 3
"
12
13
7 7
12
52
77
12
52
57
22
, 2
7 7
12
52
5 5
2 1
, 1
52
12
12
11
11
52
12
12
5 2
1 2
11
56
13
12
55
11
11
56
11
1 1
11
11
025171 5 1 5 1 52 52 5 1 7 1 5 1 73 73 , 3 7 3 25 52 55 52 5 5 53 55 55 5 5
AFMD3%ylii6 32
M'M£)S2".tr5
0251'\",
"6226 "66
22
33
65
2 4
33
65
24 "
G6 ",226 34
6 6
2 4
3 5
12
24
3 5
12
"6225 24
3 5
33
6 6
3 2
6 1
33
66
22
6 1
22
33
6 6
22
32
62
33
6 6
33
66
3 3
6 6
D2 S-174
0251365
02$ 110
35
7 ,
3 5
7 2 "7 5 7 5
34
22
35
72
35
9 2
22
35
7 2
63
8 7
, 5
6 3
8 7
53
2 7
6 3
8 7
12 22
5 5
22
22
5 3
23
53
27
5 3
23
5 3
2 7
22
36
7 7
6 ,
"
35
7 2
22
3 5
72
22
3 5
72
66 66
" " 66 56 66 25 6 5 25 6 6 6 3 66 6 3 66 66 6 6 66
Flgul'.14.1 3 Autozygosity mapping. A large. multiply consanguineous family in which several members suffer from autosomal recessive profound
congenital deafness (fi lled symbols). Shaded blocks mark a ha plotype of microsatellite markers from chromosome 2 that segregates with the deafness.
MarkersAFMa052yb5 and 025158 (blue) are homozygous in all affected people and no unaffected people. The deafness gene should lie somewhere
between the two markers that flank these (AFMb346ye5 and 025174). (Adapted from ( haib H, Place C, Salem N et at (1996) Hum. Mol. Genet. 5, 155- 158.
With permission from Oxford University Press.)
FINE-MAPPING USING EXTENDED PEDIGREES AND ANCESTRAL HAPLOTYPES 459
X"K, 3 49
X,.K2 147 19
X2,K, 8 70
Xl< K2 8 25
Data from typing for two RFLP markers, the first with alleles XI and X2 and the second with alleles
K, and K2 in 114 British families with a child with cystic fibrosis (CF). Chromosomescarrying the CF
disease mutation tend to carry allele XI and allele K2. [Data from Ivinson AJ, Read AP, Harris Ret al.
(1989) J. Med. Genet 26, 426-430.1
460 Chapter 14: Genetic Mapping of Mendelian Characters
1-.
--.. defined using 16 markers, shown in
chromosomal order acrossthe top of the
tabl e. The beige color markslocations with
allelesid entical to those of the inferred
ancestral haplotype. Allelescolored orange
1-' . • differ from the putative ancestral version
but might be the result of a mutation of the
ancestral allele. Allelescolored blue differ
r-:: substantially from the ancestral allele. and
are probably the result of recombination.
L '= Blanks mark loci for which there are no
data. Only at loc i 11 and 12 are there no
recombinant (blue) alleles. suggesting that
the NBS gene maps to thisposition. (Data
reported in Varon R. Vissinga C, Planer M et
--" at (1998) Ceff93, 467-476. With permission
from Elsevier.]
---- - +
generated 102 hap lotypes of chromosomes that carried an NBS disease muta
tion_ Of these, 74 looked like derivatives of a common ancestral haplotype, most
probably of Slav origin. The most highly conserved region lay between markers
11 and 12 (FIgure 14.14), which therefore marked the likely location of the NBS
gene. Subsequently, a gene encoding a novel protein was cloned from this loca
tion and was shown to carry mutations in NBS patients. As predicted, patients
with the common haplotype all had the same mutation, whereas those with inde
penden t haplotypes had independent mutations.
Linkage disequilibrium is a central tool in efforts to identify susceptibility
genes for complex diseases, and is discussed in detail in Chapter 15.
marker 8
-t"' + 7
1
i I I
1
-•• •••••••
heterozygous for many SNPs across the region of interest (typically a few
kilobases). (8) His sperm DNA is divided into aliquots sufficiently small that each
is estimated to contain, on average, less than one molecule that is recombinant
••••••• ••
aCrOSS the region of interest along with several thousand non-recombinant
molecules. (C) PCR is performed on each aliquot by using allele-specific primers
(red and blue arrows); red circles and blue squares represent the alternative alleles
••
at each heterozygous SNP. Only recombinant molecules (the bottom molecule in
• • •
-• • • • • •
• • • •
•• •
the figure) w ill hybridize to both primers and be amplified. Any peR product can
th·en be genotyped for the individual SNPs to identify the exact position of the
crossove r that produced each recombinant spe rm. • •••
CONCLUSION 463
CONCLUSION
The human genome sequence provides the ultimate physical map, enabling
physical entities such as genes, DNA sequences, and chromosomal breakpoints
to be localized to the nearest nucleotide. However, localizing the determinants of
phenotypes such as genetic diseases requires an alternative approach, namely
genetic mapping. Genetic mapping depends on identifying a physical entity that
consistently travels together (co-segregates) with the phenotypic determinant
through multiple meioses in a collection of families. Because of recombination
during meiosis, only sequences that lie very close together on the same chromo
somal segment will consistently co -segregate.
The usual physical entities used in genetic mapping are microsatellites or sin
gle nucleotide polymorphisms (SNPs). Array-based technologies allow hundreds
of thousands of SNPs, distributed across the whole genome, to be tested quickly
and easily for co-segregation with a phenotype of interest. If there is significant
but imperfect co-segregation, tile fraction of meioses in which the two fail to co
segregate (the recombination fraction) is a measure of their genetic distance
apart. Genetic distances are nor ide ntical to physical distances, measured in kilo
bases, because tile probabiliry of a crossover is not uniform along the whole
length of every chromosome. It varies systematically between the sexes and
between individuals. The relationship between the genetic distance, measured in
centimorgans, and the recombination fraction is not linear, but must be expressed
mathematically by a mapping function. This is necessary because recombination
fractions never exceed 0.5, no matter how far apart physically two loci are.
464 Chapter 14: Genetic Mapping of Mendelian Characters
FURTHER READING
General ideas in genetiC recombination, from the man who proposed
the Holliday junction. The full text is available electronically.]
Laboratory of Stati stical Genetics, Rockefeller University. http:/ /
Hulten MA & Lindsten J (1973) Cytogenetic aspects of human
linkage.rockefeller.edu/ [A very useful website, giving access
to all the programs mentioned in this chapter, and a wealth male meiosis. Adv. Hum. Genet. 4, 327-387. [Classical
of general information about human linkage analysis.] microscope study of male meiosis.)
Jeffreys AJ, Neumann R, Panayi M et al. (2005) Human
Ott J (1999) Analysis Of Human Genetic Linkage, 3rd ed. Johns
Hopkins University Press. [Gi ves a full discussion of linkage recombination hotspots hidden in regions of strong marker
analysisand mapping functions.] association. Nat. Genet. 37, 601-606. [An elegant and
comprehensive exploration of recombination hotspotsin
Terwilliger J & Ott J (1994) Handbook For Human Genetic
a region of human chromosome 1; also a useful source of
Linkage. Johns Hopkins University Press. [Full of practical
references to earlier work.]
advice indispensable to anybody undertaking human linkage
Keeney S (2001) Mechanism and control of meiotic
analysiS.]
recombination initiation. Curro Top. Oev. BioI. 52, 1-53.
The role of recombination in genetic mapping [A detailed review of the molecules involved in meiotic
recombination, focusing on yeast.]
Broman KW & Weber JL (2000) Characterization of human
crossover interference. Am. J. Hum. Genet. 66, 1911-1926.
(An attempt to deline the best mapping function for human Mapping a disease locus
linkage analySiS.) Botstein D, White RL, Skolnick M & Davis RW (1980) Con struction
Cheung VG, Burdick JT, Hirschmann D & Morley M (2007) of a genetic linkage map in man using restriction fragment
Pol ymorphic variation in human meiotic recombination. Am. length polymorphisms. Am. J. Hum. Genet. 32, 314- 331.
J. Hum. Genet. 80, 526-530. [A study of an average of eight [A cla ssic in the history of developing markers for linkage.]
meioses in each of 67 individuals using 6324 SNPs, gi ving Shete S, Tiwari H & Elston RC (2000) On estimating the
detail of inter-individual and chromosomal variations in heterozygosity and polymorphism information content
recombination .) value. Thear. Popul. Bioi. 57, 265-271 . [For those interested
Holliday R (1990) The history of the DNA heteroduplex. BioEssoys in a thorough mathematical analysiSof informativeness of
12, 133-142. [An interesting personal review of the history of markers.]
Chapter 15
KEY CONCEPTS
Many common diseases are multifactorial, with a variety of environmental and genetic
factors each reducin g or increasing an individual's susceptibility to that disease. They are
also usually complex, having many different possible causes.
The major genetic effects on human health and disease come from genetic factors that
affect susceptibility to common complex diseases, rather than those that cause Mendelian
diseases. Amain aim of much current hwnan genetic reseruch is to identify such factors.
• Evidence for the role of genetic factors in many common complex diseases comes from
studies of families, twins, and adopted people. Such studies need careful interpretation to
disentangle genetic effects from the effects of a shared family environment.
• Linkage analysis has shown only very limited success in mapping susceptibility factors for
commo n complex diseases. Standard lod score analysis cannot be used because it is not
possible to specify certain parameters for the susceptibility factors, such as the mode of
inheritance, gene frequencies, or penetrances. Instead, non-parametric methods are used,
which do not require a detailed genetic model to be provided. Non-parametric linkage
methods take affected relatives and look for chromosomal segments that they share more
often than expected by chance. Affected sib pairs are the most freque ntly used sets of
relatives.
• An alternative approach to finding susceptibility factors is to seek populationwide statistical
associations between a certain genotype and a disease. Such associations may arise because
many supposedly unrelated people share a chromosome segment inherited from a distant
common ancestor who carried a susceptibility factor. However, not all populationwide
associations have a genetic cause.
• The International HapMap Project has defined the ancestral chromosome segments in four
human popUlations, and cataloged markers (tagging SNPs) that can be used to identi.fy
them.
• Shared ancestral chromosome segments are extremely small (typically a few kilobases).
Thus, seeking associations requires the use of a dense array of closely spaced markers. It has
only recently become feasible to conduct genomewide association studies.
• Identifying susceptibility factors, either by linkage studies or by association, has proved
unexpectedly difficult. Only recently have studies become sufficiently powerful to identify
susceptibility factors reliably.
• Association studies can only identify factors that are present on chromosome segments that
are shared by many individuals in the study group. Thus, the ability of association stu dies to
identify susceptibility factors for common complex diseases depends on the common
disease-common variant hypothesis, which supposes th at most susceptibility factors are
ancient variants found on shared ancestral chromosQ,m e segments.
An alternative hypothesis, the lIlutation-selection hypothesis, supposes 1Il0st factors to be
the result of a highly heterogeneous set of individually rare recent mutations.
• If the mutation-selection hypothesis is true, identifying susceptibility factors will require
large-scale resequencing of individual genomes from cases and controls. New sequencing
technologies make this feasible.
468 Chapter 15: Mapping Genes Conferring Susceptibility to Complex Diseases
The main genetic contribution to morbidity and mortality is through the genetic
component of common complex diseases. These diseases have no one single
cause, but result from the cumulative effects of a variety of genetic and environ
mental factors, often different in different affected individuals. Identifying the
genes concerned is a central task for medical research. However, this is a much
more formidable task than identifying the mutations that cause monogenic dis
eases. For Mendelian diseases, given a sufficient collection of affected families,
linkage analysis as described in Chapter 14 can usually local ize the causative
gene to a small chromosomal segment containing only a handful of candidate
genes. Similar studies in complex diseases have been much less successful.
A meta-analysis by A1tmillier and colleagues in 2001 reviewed !OI linkage
studies in 31 complex diseases. The result was sobering. Candidate regions
defined in different linkage studies of the same disease rarely coincided. There
were some real successes, reviewed by Lohmueller and colleagues (see Further
Reading) but, despite huge efforts by leading research teams, overall the studies
had made only limited progress in localizing susceptibility genes. Recently this
has changed. A combination of new technology (high -resolution SNP chips) and
new understanding of the structure of human genomes (the HapMap project) is
finally allowing susceptibility factors to be reliably identified. In this chapter, we
describe the ways in which this difficult problem has been approached.
Numbers at risk are corrected to allow for the fact that some at-ri sk relatives were below or only
just within the age of risk (or schizophrenia (say, 15-35 years). A values are calculated assu ming a
population incidence of O.8%.lData from McGuffin P, Shanks MF & Hodgson RJ (eds) (1984) The
papers by Risch (see Further Reading). As an example, TabJe 15.1 shows pooled
data from several studies of schizophrenia. Family clustering is evident from the
raised A values, for example a sevenfold in creased risk for somebody, one of
whose parents is schizophrenic. As expected, A values drop back toward 1 for
more distant relationships such as nephews, nieces, and cousins.
MZ DZ
Kringlen et al. (1968) Norway 14/50 (0.28) 6/94 (0.06)
The numbers show the total number of twin pairs ascertained and the number that were
concordant (both twins diagnosed as schizophrenic). Diagnostic and inclusion criteria varied
between studies; despite the heterogeneity there is a clear tendency for more monozygotic (MZ)
than dizygotic (DZ) pairs to be concordant. [For references, see Onstad S, Skre I, Torgersen S &
Kringlen E (1991) Acta Psychiatr. Scand. 83, 395-401.]
both twins have the condition (concordant, +1+) and the number of pairs in
which only one twin is affected (discordant, +1-). Probandwise concordance is
obtained by counting a pair twice if both were probands, and thus gives higher
values fo r the concordance. For example, in the study by Onstad et a1 cited in
Table 15.2, in 7 of the 24 MZ twin pairs, both twins were independently ascer
tained as affected probands. Thus, the probandwise concordance was 0.48, cal
culated as [(8 + 7)1 (24 + 7)], in comparison with the pairwise concordance of 0.33
(8/24). Probandwise concordances are thought to be more comparable with
other measures of family clustering.
Genetic characters should show a higher concordance in MZ than DZ twins,
and many characters do. However, a higher concordance in MZ twins than in DZ
twins does not prove a genetic effect. For a start, half of DZ twins are of different
sexes, whereas all MZ twins are the same sex. Even if the comparison is restricted
to same-sex DZ twins (as it is in the studies shown in Table 15.2), at least for
behavioral traits the argument can be made that MZ twins are more likely to look
very similar, to be dressed and treated the same, and thus to share more of their
environment than DZ twins.
Separated monozygotic twins
Monozygotic rwins separated at birth and brought up in entirely separate envi
ronments seem to provide an ideal experiment for separating the effects of shared
genes and shared environment. Francis Crick once made the tongue-in-cheek
suggestion that one of each pair of twins born should be donated to science for
this purpose. Such separations happened in the past more often than one might
expect-the birth of twins was sometimes the last straw for an overburdened
mother. Fascinating television programs can be made about twins reunited after
40 years of separation, who discover they have similar jobs, wear similar clothes,
and like the same music. As research material, however, separated twins have
many drawbacks:
Any research is necessarily based on small numbers of arguably exceptional
people.
The separation was often not total-often the twins were separated some time
after birth, and were brought up by relatives.
similar separated twins, but separated twins who are very different are not
newsworthy.
Research on separated twins cannot distinguish intrauterine environmental
causes from genetic causes. This may be important, for example in studies of
sexual orientation (the gay gene), in which some people have suggested that
maternal hormones may affect the fetus in utero so as to influence its future
sexual orientation.
SEGREGATION ANALYSIS 471
Thus, for all their anecdotal fascination, separated twins have contributed
relatively little to human genetic research.
Index cases (47 chronic schizophrenic adoptees) 44/279 {1 5.8%1 2/1'1 (1.8%)
Control adoptees (matched for age, sex. social status of adoptive 5/234 (2 .1 %1 2/117 (1 .7%)
-
family, and number of years in institutional care before adoption)
The study involved 14,427 adopted persons aged 20-40 years in Denmark; 47 of them were diagnosed as chronic sch izophrenic. The 47 were
matched with 47 non-schizophrenic control subjects from the same set o f adoptees. (Data from Kety 55, Wender PH, Jacobsen Bet al. (1 994)
Arch. Gen. Psychiatry 51,442-455.1
472 Chapter 15: Mapping Genes Conferring Susceptibility to Complex Diseases
Major dominant locus l.()() 7.56 1.2 x 10- 5 0.19 2.8 0.42
Data are for families ascertained through a proband with long-segment Hirschsprung disease (OMIM 142623). Parameters [hat can be varied are as
follows: d, the degree of dominance of any major disease allele; r, the difference in liability between people homozygous for the low-susceptibility
and the high-susceptibility alleles of a major susceptibility gene, measured in units of standard deviation of liability; q, the gene frequenc y of any
major disease all ele; H, the proportion of total variance in liability that is due to polygenic inheritance, in adults; z, the ratio of heritability in children to
heritability In adults; x, the proportion of cases due to new mutation. The values shown are those that best account for the family data using the stated
model. The X2 statistic is a standard test that compares the performance of each model with the mixed model, in which a mix of all mechanisms is
allowed. A single major locus encoding dominant susceptibi lity explains the data as well as the mixed model. [Data from Badner JA, Sieber WK,
Garver Kl &'Chakravarti A (l990) Am. J. Hum. Genet. 46, 568-580.)
of these, and the environmental factors may include both familial and non
familial variables. [n complex segregation analysis, a whole range of possible
inheritance patterns, gene frequencies, penetrances, and so on, are modeled by
computer analysis to find the mix of scenarios that gives the greatest overall like
lihood for the observed data. Factors in this mixed model are then omitted or
constrained to identify the minimum that must be included so as to avoid a sig
nificant loss of explanatory power. As with lod score analysis (see Chapter 14), the
question asked is how much more likely the observations are when one hypoth
esis is compared with another.
In the example of Table 15.4, the ability of specific models (sporadic, poly
genic, recessive, dominant) to explain the data was compared with the likelihood
calculated by a general mixed model, in which the computer program could
freely optimize the mixture of single-gene, polygenic, and random environmen
tal causes. All models were constrained by the overall epidemiology, sex ratios,
and probabilities of ascertainment estimated from the collected data. A single
locus dominant model is not significantly worse than the mixed model at explain
ing the data (r
= 2.8, P = 0.42). [n contrast, models assuming no genetic factors
(sporadic), pure polygenic inheritance, or pure recessive inheritance perform
very badly. On the argument that simple explan ations are preferable to compli
cated explanations, the analysis suggests the existence of a major dominant sus
ceptibility to Hirschsprung disease. One such factor has now been identified, the
RET (rearranged during transfection) oncogene (OMIM 164761).
However clever the segregation analysis program is, it can do no more than
maximize the likelihood across the parameters it was given. If a major factor is
omitted, the result can be misleading. This was well illustrated by the data in
Table 15,5. McGuffin & Huckle asked their classes of medical students which of
their relatives had attended medical school. When they fed the results through a
The data are taken from a survey of medical students and their families. The meaning of the
symbolsis exp lained in the foornote to Table 15.4. Affected is defined as attending medi ca l school.
The analysis seems to support recessive inheritance, because this accounts for the data equally well
as the unrestricted model (but see the text). n.S., not significant. {Data from McGuffin P & Huckle P
(1990 ) Am. J. Hum. Genet. 46, 994-999.1
LINKAGE ANALYSIS OF COMPLEX CHARACTERS 473
~'
In the first case, identifying the Mendelian subset is intrinsically valuable, but
it does not necessarily cast any light on the causes of the non -Mendelian disease.
That was the case with breast cancer and Alzheimer disease. In breast cancer this
led to the discovery of the BRCAl and BRCA2 genes, as described in Chapter 16
(see Case 7, p. 526). Mutations in the PSEN1, PSEN2, and APP genes cause rare
Mendelian forms of early-onset Alzheimer disease. However, the common non
Mendelian forms of breast cancer and Alzheimer disease do not usually involve
Ai A3 A1 A2
Pi
Al A3 Al A2
~
mathematically in terms of the Mendelian probability of inheritance from the
defined common ancestor; for IES alleles, the population frequency is used. for
very rare alleles, two independent origins are unlikely, so IBS generally implies
lED. With common alleles, no such inference can be made. Multiallele microsat
ellites are more efficient than two-allele markers such as SNPs for defining lSD, "Ie AC 'A
and multilocus multi allele haplotypes are better still, because anyone haplotype AO I
Be 'h
is likely to be rare. Shared segment analysis can be conducted with either ISS or 80 %
lED data, provided that the appropriate analysis is used. IBD is the more power (8)
ful of the two, but it requires samples from more relatives because it is necessary
!Ii
to work out the likelihood that sharing is lED ratherthan IBS.
Affected sib pair analysis
Picking a chromosomal segment at random, pairs of sibs are expected to share 0,
I, or 2 parental haplotypes with frequencies of 11" V2, and 11" respectively (ratios of AC Ae 'h
AD 'h
1:2:1). However, ifboth sibs are affected by a genetic disease, then any segment of
chromosome that carries the disease locus is likely to be shared. For a fully pen
etrant Mendelian dominant disease, affected sibs would always share the paren (C)
~
tal haplotype that carried the disease allele; if the tlisease is recessive they would
always share both the releva nt parental hap 10 types (Figure 15.2). If a susceptibil ~B CD
ity factor is neither necessary nor sufficient for disease, then not all affected sib
pairs will share the relevant chromosomal segment, but there will still be a statis
tical tendency to share that segment more often than just by chance. This allows ,\C AC 1
a simple form of Iin kage analysis. Affected sib pairs (ASPs) are typed for markers,
Figure 15.2 Affected sib pair analysis. (A) By random segregation, sib pairs
share 2 (both AC), , (AC and either AD or BC), or 0 (AC and BO) parental haplotypes (D)
Y., 1/ 2 , andY. of the time, respectively. (B) Pairs of sibs who are both affec ted by a
~
Mendelian dominant condition must share the segment that carries the disease
allele, and they mayor may not (a 50:50 chance) share a haplotype from the
unaffected parent. (C) Pairs of sibs who are both affected by a Mendelian recessive
condition necessarily share the same two parental haplotypes for the relevant
chromosomal segment. (0) For complex conditions, haplotype sharing above that AC AC >' A
expected to occur by chance (as in panel A) identifies chromosomal segments AD I
ee >'h
containing susceptibility genes. BD <'A
LINKAGE ANALYSIS OF COMPLEX CHARACTERS 475
and chromosomal regions are sought where the sharing is above the random
1:2:1 ratios of sharing 2, 1, or 0 haplotypes IBD. If the sib pairs are tested only for
IBS, the expected sharing on the null hypothesis must be calculated as a function
of the gene frequencies.
Shared segment analysis can be performed using any set of affected relatives
and without making any assumptions about the genetics of the disease. Affected
sib pairs are especially favored because they are relatively easy to collect.
Multilocus analysis is preferable to single-locus analysis because it more effi
ciently extracts the information about lED sharing across the chromosomal
region. The MapmakerlSibs computer program is widely used to analyze multi
point ASP data. Programs such as Genehunter extend shared segment analysis to
other relationships. The programs calculate the extent to which affected relatives
share alleles IBD. The result across all affected pedigree members is compared
with the null hypothesis of simple Mendelian segregation (markers will segregate
according to Mendelian ratios unless the segregation among affected people is
distorted by linkage or association). The comparison can be used to compute a
non-parametric lod (NPL) score. Theoretical arguments by Lander & Kruglyak
suggest genomewide lod score thresholds of 3.6 for lED testing of affected sib
pairs, and 4.0 for IBS testing.
false positive anywhere in the genome is 0.05. In Chapter 14, we noted the dis
Criteria suggested by lander E& Kruglyak l (1995) Nat. Genet. 11,241-247. Th e figures for p values
and lod scores are for ASP studies as given by Altmuller ), Palmer LJ, Fischer G et al. (2001)
of the data on the given hypothesis. For example, as explained in Chapter 14,
in two-point analysis of a Mendelian cond ition, a lod score of 3 corresponds
to an absolute probability of 0.05.
Complex disease studies often use simulation to estimate their significance
thresholds. In a typical permuta.tion test, 1000 replicates of the family collection
are generated by a computer program with random marker genotypes, but based
on correct allele frequencies, recombination fractions, and so on. A genomewide
search is conducted in each simulated data set and the maximum lod score is
noted. The genomewide threshold of significance is set at a score that is exceeded
in less than 5% of the replicates. This is taken to be a lod score that has no more
than a 5% probability of having arisen purely by chance in the real data set.
Striking lucky
threshOld of
The actuallod score for any particular factor depends on how it happened to significance
segregate in the particular families studied. But there are many susceptibility fac
tors, and a genomewide scan tests for all of them. If there are a dozen or more initial study
cohort
susceptibility loci, but the studies are only marginally powerful enough to detect
the effect of any of them, then there is a large and unavoidable element of chance.
In one set offamilies, a certain factor may happen to segregate in a way that gives
a significant lod score, whereas in an independent set of families used in a repli threshold of
cation study, that factor may not happen to show such a favorable pattern (Plgure significance
15_3). In fact, targeted replication requires a study design with a much higher
statistical power than the original study. Thus, failure to replicate the conclusion replication
study cOhort
of a study does not necessarily mean that the original report was a false positive.
Nevertheless, for a long time the paucity ofwell-replicated findings cast a shadow
over linkage studies of complex disease. ten susceptibility loci (LCllO)
suggestive 8p
Williams et al. (1999) ASPs (UK or Irish Caucasian) 196 ASPs suggestive 4p, 18q, Xcen
aNumber of affected individuals. ASPs, affected sib pairs. Significance levels follow the Lander-Kruglyak criteria shown in Table 15.6. These studies
formed part of the meta-analysis by AltmGller J, Palmer LJ, Fischer G et al. (2001) Am. J. Hum. Genet. 69, 936-950. Full references are given in that paper.
~20 ~
g
0'
z_l
1
H~
z_l
-2 -2 -2 -2
-5 95 195 295 -5 95 195 -5 95 195 o 50 100 150
distance (eM) distance (eM) distance (cM) distance (eM)
.~ ~ ~. !~ . ~AA !~ 'f·
.~~
• 2
0'0 §10 ~
0'
z_l Z_l ~~ \. z_l 2_
1
-2 -2 -2 -2
-5 95 195 -5 95 195 -5 45 95 145 -5 45 95 145
distance (eM) distance (eM) distance (cM) distance (eM)
~~ ~
z_l
-2 -2 -2
-5 45 95 245 -5 45 95 145 -5 45 95 145
distance (eM) distance (eM) distance (cM) distance (eM)
~~~ §.21 ~
~ 2
~2 ~
~~~~
g 1
0' 0 0'0
~
z_l z_l 2 _1 --~ z_l
-2 -2 -2 -2
-5 45 95 145 -5 45 95 -5 45 95 -5 45 95
distance (eM) distance (eM) distance (cM) distance (eM)
!~A ~
• 2 • 2 • 2
o
~~ ~
g 1 8 1
z~O
_l ~~
o'O ~
z_l 2_1 v ~ z_l ~
-2 -2 -2 -2
-5 45 95 -5 45 95 -5 45 95 -5 45 95
distance (eM) distance (eM) distance (cM) distance (cM)
~~ ~
0' 0
z_l
-2 -2 -2
-5 15 35 55 -5 20 40 60 -5 45 95 145
distance (eM) distance (eM) distance (cM)
FlgUf215.4 Results of a genomewide association study of schizophrenia. A total of 171 affected individuals from 70 multiply affected sibships were
genoryped for 338 microsatellite markers spaced across the genome. The blue line shows results when only people with schizophrenia (DSM-III criteria)
were counted as affected. For the red line, individuals with DSM schizoaffective disorder were also included. The green line uses a broad definition of
affected (schizophrenia, schizoaffective disorder, paranoid or schizotypal personality disorder, delusional disorder, or brief reactive psychosis). In the
event, no significant or suggestive lad scores were observed. NPL, nonparametric lod score. [From Shaw SH, Kelly M, Smith AS et al. (1998) Am. J. Med.
Genet. (Neuropsychiatr. Genet.) 81, 364-378. With permission ofWiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc.]
ASSOCIATION STUDIES AND LINKAGE DISEQUILIBRIUM 479
-t.j.. ,1.~
•
ancestor, 20 generations ago, of four
present-day individuals. There will be one or
two randOm crossovers in each chromosome
-t.j..,1.~
-t .j.. ,1. ~ arm in eac h ofthe 20 meioses linking
t 2ime,O: S\
,J.. .j.. .j.. ~
,J.. .j.. ~ ~
asswnptions about population structure and history to estimate the size distri
bution of shared ancestral segments. However, the wide stochastic variance and
reliance on unknowable details of population history make even the most elabo
rate calculations unreliable. What we need is data, and recently increasing quan
tities of real data have become available.
Studying linkage disequilibrium
LD is a statistic about populations, not individuals. To stud y it, a sample of indi
viduals from the chosen population is genotyped for a series of linked polymor
phic markers. SNPs are the markers of choice, for three reaso ns:
• They are sufficiently abundantthat they can be used to check very short chro
mosome segments for djsequilibrium.
• Tn comparison with microsatellites, they have a far lower rate of mutation.
This is important when the aim is to identify ancient conserved chrom osomal
segments.
• SNPs are easy to genotype on a large scale, up to 1 million at once, spread
across the genome. This is very hard with other polymorphisms.
Unless genotyping is done on individual chromosomes isolated by laboratory
mauipulation (an expensive option). the raw data will consist of the genotypes at
each individual locus, rather than haplotypes of alleles across multiple linked
loci. The raw genotype data must be phased-that is, converted into haplotypes
before any useful analysis can be done. This can be done by genotyping other
family members to infer haplotypes; alternatively, computer programs can be
used to convert genotypes into haplotypes. These programs use a maximwn
likelihood procedure: they guess possible haplotypes, and keep trying until Illey
find the guess that best explains the whole data set with the minimum number of
plausibly related haplotypes (haplotypes within a population should be related
to one another by a minimum ntu11ber of recombinations),
Various statistics have been used as measures of LD (Box 15,11.
If two loci have alleles A,a and B,b w ith frequenci es PA, Po, PrJ. and Pb, D' = (PAB - P~sJIDma~1 where Omax is the maximu m value o f
there are four possible haplotypes: AB, Ab, aB, and abo Suppose that IPAS - PAPsl possi ble with the given allele frequencies; [he vertical
the frequenc ies of the fou r haplotypes are PAS. PAb. PaB, and Pob. If lines indicate the absolute value or modu lus of the expression .
there is no LD, PAS = PAPs and so on, because the haplotype will be .12 = (PAS - PAP(j)2;(PAPaPBPb).
const ru cted just by random assortment of the constituent alleles. The D' is th e most widely used. It varies between 0 (no LD) and ± 1
degree of de parture, 0 , from thi s random association can be measured (comp lete association) and is less dependent than D on the all ele
by 0 = PASPob - PAbPaB. As a measure of LD, 0 suffers from the property frequencies. As a rule of thumb, D' > 0.33 is often taken as the
that its maximum absolute value depends on the gene frequencies threshold level of LD above w hich associations will be apparent in
at the two loci, as well as on the extent of disequilibrium. Among the usual size of data set. The prolifera tio n of alternative measures
preferred measures are: suggests t hat none is ideal.
distance apart. Closely spaced SNPs often showed little or no LD, whe reas some ·
times there was strong LD between more widely separated SNPs. Box 15.2 shows
how the patterns of disequilibrium can be represented graphically, and also illus·
trates just how complex and irregular the patterns can be. These irregularities
reflect the combined effects of several factors:
• Recombination, the force that breaks up ancestral segments, occurs mainly at
a limited number of discrete hotspots (see Chapter 14). SNPs that are close
together but separated by a hotspot show little or no LD, whereas, conversely,
LD ma y be strong across even a large chromosomal region if it is devoid of
hotspots.
• Gene conversion (see Box 14.l) may replace a small internal part of a con
served segment, producing a localized breakdown of LD, whereas markers
either side of the replaced segment continue to show LD with each other.
• Population history is impo rtant. The older a population is, the shorter are the
conserved segments. LD is more exte nsive and of longer range in populations
derived from recent founders, as often occurs with small, genetically isolated
populations. LD will have a shorter range in popul ations that have remained
constant in size than in populations that have undergone a recent expansion.
Sup erimposed on this regularity are many stochastic effects. Chromosome
segments may carry a mixture of marker alleles thai are IBD and alleles that
are only IBS (through independent mutations, back mutation, and so on), but
LD statistics do not distinguish betwee n these two cases.
Efficient disease association studies depend on a detailed knowledge of the
patterns of LD ac ross the genome. It is important to know how big and how
diverse the ancestral chromosome segments are, so that SNPs can be chosen to
test each conserved ancestral segment. The International Hap Map Project was
established to provide this detailed knowledge. A conso rtium of academic insti
tutions and pharmaceutical companies typed several million SNPs in 269 indi
vid uals drawn from four human populations: 30 white American parent--<:hild
trios from Utah (CEU). 30 Yoruba parent-child trios fro m Ibada n, Nigeria (YRI),
45 individual Han Chinese from Beijing (CHB), and 44 individual Japanese from
Tokyo (JPT ). All were ostensibly healthy and, although not a formal popnlation
sampl e, were hopefully typical of the population from which they we re drawn.
The Phase I report in 2005 (see Further Reading) summarized res ults from
1,007,329 SNPs; Phase 1I has added a further 3 million. The same individuals have
been used in several other studies, for example in surveys of copy-number vari
an ts and of gene expression levels, adding value and depth to the Hap Map data.
The CEU and YRl samples, which each consisted of parent-child trios, could
be phased directly. For each trio, the genotype of the child could be used to infer
phase ill the parental genotypes. Thus, each set of 30 trios yielded 120 phased
parental haplotypes. For the CHB and JPT samples, computer programs were
used to convert genotyp es into haplotypes.
The key findings can be summarized as follows (Table 15,8):
All four pop ulations show a similar structure of blocks of strong linkage dis
I
about the diagonal because coordinate (m,n)
283b16 2
contains the same data as coordinate (n,m). 063b 15- l ::L b , ' ~
0.00
In Figure 2 this redundancy is removed by
showing only one ha lf of the square. Tilted 80 kb 400 kb
over so that what was the diagonal is now the
Flg"u ... 1 The pattern of linkage disequilibrium across a 500 kb section of chromosome 13,
horizontal axis, this axis can be envisaged as
as displayed by the GOLD program. (Courtesy of Wil liam Cookson, University of Oxford.)
a map of the chromosome, wi th the colored
triangles giving an imp ression of the size
of haplotype blocks. In this case, the actual
physical distance between markers is not
represented; t he axes simply list them in order.
Only haplotypes having a frequency of O.OS or greater in the relevant population are reported here.
The CHB and JPT samples have been combined because they show very similar patterns. A different
statistical method of defining blocks gave somewhat different detailed figures but a similar overall
pattern. YRI, Yoruba from Ibadan, Nigeria; CEU, white Americans from Utah; CHB, Han Chinese from
Beijing; JPT, Japanese from Tokyo. [Data from The International HapMap Consortium (200S) Nature
437. 1299-1320.J
484 Chapter 15: Mapping Genes Conferring Susceptibility to Complex Diseases
• Blocks vary in size, but are typically 5-15 kb. Much longer blocks can be found
in chromosomal regions where there is very little recombination, such as cen
tromeres, or that have been subject to recent strong selection.
• The blocks, as defined here, cover only 67-87% (depending on the statistic
used to define a block) of the genomic regions examined; in other words,
13-33% of the regions analyzed show a more chaotic and diverse pattern of
variation between individuals.
• The general size of blocks is similar in the CEU, CHE, and /PT samples
(13.2-16.3 kb), but smaller in the YRI samples (7.3 kb). This accords with the
Out of Africa model of human origins that suggests that humans originated in
Africa and evolved there for a considerable time before the progenitors of
modern non-African populations migrated out. Sub-Saharan African popula
tions are older and more variable than all others. Their founders lie further in
the past, and the present structure reflects a greater number of generations
during which meiotic recombination has had the chance to fragment the
founder chromosomes.
• Human genetic variability is much more limited than the number of SNPs
might suggest. In each population studied, on average 4.0-5.6 haplotypes
acco unt for 93-95% of copies of any given block.
the data are checked for another possible association, there is another chance
of a type I error. The overall threshold of significance must be adjusted for the
trols of alleles at three HLA loci (HLA-A, HLA-B, and HLA-DR) are compared,
Each allele that was checked was a separate, although not fully independent,
positive result, the threshold pvalue for a single question is 0.05, for 10 ques
tions 0.005, and for 1,000,000 questions (typical of studies using high-density
SNP arrays) 5 x IO-B For large values of N this is overconservative, and the
The transmission disequilibrium test (TOT) statistic, which is based on 94 families with one
the standard X2 statistic, is or rTlOf'e diabetic children
(0 - W/(o + b) ,--_--'J.'--_-.
where 0 is the number of times that a heterozygous parent trans mits
I to the affected offspring, and b is the number of times that the
other allele is transmitted, The mathematically inclined can find the
!
neither patent
1
1000 isolated cases and their parents than it is to collect 1000 affected sib pairs, at
least for early-onset diseases, for which both parents are likely to be still alive. The
paper helped trigger a widespread move away from linkage studies and toward
studies of association.
BOX 1S.4 SAMPLE SIZES NEEDED TO FIND A DISEASE SUSCEPTIBILITY LOCUS BY A GENOMEWIDE SCAN
Risch & Merikangas (1996) calculated the sample sizes Tilbl., Comparison of the power of linkage and association studies
needed to find a disease susceptibility locus by a genomewide --
scan by using either affected sib pairs (ASPs) or the transmission Susceptibility factor ASP analysis TOT analysis
disequilibrium test (TDT). This Box summarizes their
formulae and equations, but the origina l paper (see Further
y P IY No. of pairs P(M) No. of trios
Reading) shou ld be consu lted for the derivations and for details.
5 0.01 0.534 2530 0.830 747
A standard piece of statistics tells us tha t the sample
size M required to disti ngu ish a genetic effect from the null
hypothesis with power (1 - fJJ and significance level a is
given by (Za - aZl _ rYII!2, where Z refers to the standard
I
0.1
0.5
0.634
0.591
161
355
0.830
t--
0.830
-83
108
-
a2 are calculated -
I
normal deviate. The mean ,u and variance
om 0.509 33,797 0.750 1960
as functions of the susceptibi lity allele frequency (P) and the
relative risk yconferred by one copy of the susceptibility allele. 0.1 0.556 953 0.750 25 1
- -
The model assumes that the relative risk for a person carryi ng
two susceptibility alleles is y2, that the marker used is always , 0.5 0.556 953 0.750 150
informative, and that there is no recombination with the
, 2 0. 1 0.5 18 9167 0.667 696
susceptib ility locus.
For ASP, the expected allele sharing at the susceptibility
0.5 0.526 4254 0.667 340
-
locus is given by Y = (1 + wl/(2 + w), where w = [pq(y - 1)2]1
(py+ q).,U = 2Y - 1 and if = 4 y(1 - Y). The genomewide 1.5 0.1 0.505 115,537 0.600 2219
threshold of significance (probability 0.05 of a false positive I
anywhere in the genome; testing for sharing ISO) requires a 0.5 0.510 30,660 0.600 950
lod score of 3.6, corresponding to a = 3 x 10- 5, and Za = 4.014. -f
For 800,.6 power to detect an effect, 1-{3 = 0.2andZ1 _/3:C=: -0.84. 1.2 0.1 0.501 3,951 ,997 0.545 11,868
For the TDT, the probability that a parent will be
0.5 0.502 696,099 0.545 4606
heterozygous for the allele in question is h:c=: pq(y+ 1)1 -_ . . -
(py+ q). P(trA), the probability that such a heterozygous parent The table shows the number .of affected si b pairs (ASPs) or parent- child
will transmit the h igh-r isk allele to the affected child, is ")1(1 + y). trios needed for 800,.6 power to detect sign ificant linkage or association
J1 = '-ih(y - 1 )/(y+ 1), and if = 1 - [h(y- , )2/(y+ 1)2]. As discussed in a genomewide search. yis the relative risk for individuals of genotype
above, for a genomewide screen involving 1,000,000 tests, Aa compared w ith aa; p is the frequency of the susceptibility alle le, A. For
a = 5 x 10- 8 ,Za = 5.33, and, as before, Zl _ p = -0.84. affected sib pair analysis, Y is the expected allele sharing; for the t rios analyzed
In Tabl. 1, the Zu, Zl _ p,IL and a2 va lues are used to by the transmission disequilibrium test, P(trA) is the probabi lity that a parent
calculate sample sizes by substitution in the formula heterozygous for allele A will transmit A to an affected child. For low values
M = (Za - aZl _p)2/,u 2. For the TOT, the answer is halved of y, unfeasibly large numbers of affected si b pairs are needed to detect an
because each parent-child trio allows two tests, one on each effect. [Data from Risch N & Merikangas K (1996) Science 273,15 16-15 17.]
parent.
488 Chapter 15 : Mapping Genes Conferring Susceptibility to Complex Diseases
Different studies of the same disease and candidate gene may report associations with different
markers, or with multilocus haplotypes. ilThese odds ratios and all allele or haplotype frequencies
are representative examples of data from one out of several reports on each association.
bThese odds ratios are from the meta-analysis by Lohmueller KE, Pearce CL, Pike M et at (2003)
15 ~
10
bipolar disorder
... Fig-uri! 15.9 Results of the Well come
Tru st Case -Control Consortium (WTCCC)
;.
iIilili
I-' I-' 1-'1-' ...... 1'V1\.lV x
genomewide association study. For each o f
the seven diseases studied, the distribution
W
'" '"
A
~
'"'" "'"
I-' I-'
15 W .b (JIm ....... ClI<DO I-"f\.) of p values (as - log loP) for t he association
...
15 ~
10
; ~ .
'" "
,.
W A
'" '" ~
I
'" '" 15 '"'" "'"
_.
'"w I-' I-' I-' I-' 1-'1-'1'V1\.lV
.b(JIm ....... ClI<DOI-"f\.)
x
the appropriate chromosom al position .
Most of the 469,557 SNPs th at passed all
the qu ality con trol checks showed wea k or
no association with the respective d isease
(blue dots, merg ed together). Those showing
strong er evidence of associa tion (p < 10-5)
Crohn disease
...... ... ... are marked with green dot s. Th e 25 mos t
15
10~ I
5 l1li
o
...
'" "
W
'illI• I-'
W
I-' 1-'I-'I-'I-'I-'I'VfoJ:V
./:10 (J1 OJ ....... CO<DOi->f\J
... x
strongly assoc iated SNPs or clusters o f SNPs
(p < 5 X 10- 7, th e thres hold of significa nce
in this study) are marked with red triangles.
[Adapted from Th e Well come Tru st
Case-Control Conso rt ium (2007) Na ture
15~
'~ ./ .. . iii..
'" " w
'" '" ... '" '" 0'" "''" "'" "' W
I-'
.b
............. I-> I-- I\)t.;r\) >0:
(]I m -J (X) <D O f-'/\)
15 ~
10
... type 2 diabetes
... T
5 .
0 '-
"' " w .. '" '" ... '" '" 15 "''" "'"
1IiII ~
W
1-"
.b (]I m ....... ClIU) O ~
'b I!II
........ .... .... 1-" 1-' 1\)t-J\) )<
chromosome
WTCCC study, and one of the remaining two showed positive signals that did not
q uite reach significance. Almost all of the novel susceptibility factors have subse
quently bee n confirmed in independent follow-up studies.
The WTCCC study is not the only one to have exploited the new tools and
knowledge noted above. Between them, these studies clearly show that the era of
reliable GWA studies has finally arrived.
The size of the relative risk
Considering the 25 SNPs showing the strongest evidence of association in th e
WTCCC study, the odds ratios (risk for a heterozygous carrier of the risk allele
compared with a ho mozygous non-carrier) varied between 1.09 and 5.49.
However, only the previously well-known associations of certain HLA types with
autoimmune disease (type I diabetes and rheumatoid arthritis) produced high
odds ratios. For all other cases, the highest ratio observed was 2.08, and in only
five cases did it even exceed 1.5. As already mentioned, the study was predicted
to have 80% power to detect a factor with an odds ratio of 1.5. The clear conclu
sion is that there are no individually strong ancient susceptibility factors for any
of th ese typ ical common complex diseases.
A common story is emerging from this and other similar studies: the ancient
susceptibility factors can be identified by curre nt methods- but the relative risks
they confer are q uite small.
THE LIMITATIONS OF ASSOCiAnON STUDIES 491
genes suggest that maybe half of all such changes may be mildly deleterious (with
a quarter being seriously deleterious and a quarter neutral). For silent changes
and changes in noncoding sequences, the proportion of neutral variants is likely
to be much higher.
Several preliminary resequencing studies have indeed found an excess of rare
missense variants in candidate genes in disease cohorts, but such studies need to
be conducted on a wider selection of genes and in more diseases to see whether
the effect is general. The new maSSively parallel sequencing technologies allow
this to be done, so that definitive tests of the mutation-selection hypothesis are
now possible.
HapMap
ACl D 1 A l 1 1 ! in th e genome are defined by 5NPs (color
versus white). Information from the HapMap
select SNPs to
(l g haplotypes
,I 1 1 U ,I 1 IJ project is used to select a subset of 5NPs that
will serve to identify (tag) each haplotype
!""""""1 I ,... '"i.
purple and blue for loc us 1, and either
1
genotyping
c
. 00 . 0 0 0 0 0 . 00
red or blue for locus 2. Disease cases and
controls are genotyped with microarrays .
5NPs that are associated with di sea se at an
appropriate statistical threshold (stars) are
300,OOO-SOO,000 SNPs 0 0 0 0 0 0 0 0 0 0 00 th en genotyped in a second independent
typed on high-density arrays
0 0 0 0 0 00 0 0 0 00 sample of cases and controls to establish
which of the associations from the primary
1
case ·control study OOtO ~:~t~
, IJ\JU
tO~Q ~~~~Q
hlt~{J hlQ~Q
! J.
genotype associated SNPs in an
replication test independent case ·control sample
confirm scan findings
FURTHER READING
Family studies of complex disease Linkage analysis of complex characters
Burmeister M, Mcinnis MG & ZoHner 5 (2008 ) Psychiatric genetics: Altmuller J, Palmer LJ, Fischer G et al. (2001) Genomewide scans
pro gres s amid controversy. Nat. Rev. Genet. 9, 527-540. [An of complex human diseases: true linkage is hard to find. Am.
overview of progress since the pioneering studies used as J. Hum. Genet. 69, 936-950. [A sobering meta-analysis of 101
exa mples in this section.] linkage studies in 31 complex diseases.]
McGuffin P, Shanks MF & Hodgson RJ (eds) (1984) The Scientific Lander ES & Kruglyak L (1995) Genetic dissection of complex
Principles Of Psychopathology. Grune & Stratton. traits: guideline s fo r interpreting and reporting linkage
Risch N (1990) Linkage strategies for genetically co mplex traits. results. Nat. Genet. 11 , 24 1-247. [IntrodUCing the widely used
1. Multilocus models. 2. The power of affected relative pairs. categories of sugge stive, significant, and highly significant
3. The effect of marker polymorphism o n analysis o f affected results.}
rela t ive pairs. Am. J. Hum. Genet. 46, 222-228; 22 9-241 ;
242-253. [Three key papers establishing the statistical basis Association studies and linkage disequilibrium
o f familial clu stering and shared segment analysis.) Cardon LR & Bell JI (2001 ) Association study designs for
Rosenthal D & Kety SS (1968) The Transmission Of Schizophrenia . complex diseases. Nat. Rev. Gener. 2, 91-99. fA general no n
Pergamon Press. mathematical discussi o n o f de signs for association studies.]
International HapMap Con sortium (2003) The International
Segregation analysis HapMap Project. Noture 426, 789-796. [A description of the
Badner JA, SieberWK, Garver KL & Chakravarti A (1990) A aims and methods of the project.]
genetic study of Hirschsprung disease. Am. J. Hum. Genet. 46, International HapMap Consortium (2005) A haplotype map of
568-580. [A good example of segregation an alysis applied to the human genome. Nature 437,1299-1320. [The primary
a non-Mendelian condition.] report of Phase I of the HapMap project; available for
McGuffin P & Huckle P (1990) Simulation of Mendelism revisited: download from http://www.hapmap.org]
the recessive gene for attending medical school. Am. J. International HapMap Co nsortium (2007) A second generation
Hum. Genet. 46, 994-999. [A warning about some pitfalls of human hapl o t y pe m ap of over 3.1 million SNPs. Nature 449,
segreg ation analysis.] 851-861 . (Report of Phase II.]
Chapter 16
KEY CONCEPTS
• Positional cloning has been the main route through which the genes underlying Mendelian
diseases have been identifi ed. In positional cloning, a gene is identified from its
chromosomal location alone.
Positional candidate genes are identified by drawing up a list of genes at that location, on
the basis of their function, expression pattern, and homologies.
• Model organisms can provide clues to gene identification because the structure and
function of many genes are conserved between humans and other organisms.
Mice are parti cularly useful for identifying disease genes and studying pathogenic
mechanisms because of the ease with which genes can be mapped and manipulated in
mice, and because of the relatively close simil arity of mice and humans.
• Patients with chromosome abnormalities can provide essential pointers to the location of a
disease gene. Large abno rmalities may be recognized by conventional cytogenetics, and
small ones by comparative genomic hybridization or using SNP arrays. These approaches
have been especially useful for severe dominan t conditions, in which most cases are new
mutations, making linkage analysis impossible.
Some disease genes have been identified by alternative, position-independent approaches.
These include functional studies of cells in culture, direct identification of altered proteins
or changed gene expression, and extrapolation from an imal models.
• Candidate genes are tested by seeking mutations in collections of unrelated affected cases.
Functional studies may be needed to confum an uncertain identification.
• Susceptibility factors for complex disease are mainly sought through association studies.
These may be targeted at genes whose function makes them likely candidates, or studies
may be conducted on a genomewide scale.
• Association studies identify haplotype blocks that are associated with susceptibility to a
disease. Identifying the actual causal variant is difficult because of linkage disequilibrium
between all the variants in the block. Both statistical and functional evidence are needed.
Association studies have identified susceptibility factors for many complex diseases, but for
most diseases the known facto rs only weakly influence susceptibility and collectively
acco unt for only a small part of the overall genetic determination, leaving a problem of
hidden heritability.
• The hidden heritability in complex diseases may consist ofthe cumulative effects ofvery
many extremely weak genetic factors, or ofthe effects of a highly heterogeneous collection
of individually rare mutations, or possibly of epigenetic changes.
498 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
Few subjects have moved as fast as human disease gene identification. Before
1980, very few human genes had been identified as disease loci. The few early
successes involved a handful of diseases with a known biochemical basis, for
which itwas possible to purify the gene product. [n the 1980s, advances in recom
binant DNA technology allowed a new purely genetic approach, sometimes given
the rather meaningless label reverse genetics. The number of disease genes identi
fied started to increase, but these early successes were hard-won, heroic efforts.
With the advent of polymerase chain reaction (PCR) techniques for linkage stud
ies and mutation screening, it all became much easier. Now that the human and
other genome proj ects have made available a vast range of resources, the ability
to identify a Mendelian disease gene depends almost entirely on having access to
suitable families.
The 1990s saw triumphant progress in identifying the causes of Mendelian
diseases, so that the genes underlying most common Mendelian diseases have
now been identified. The exceptions tend to be those such as nonsyndromic
mental retardation, in which the hunt for causative genes is bedeviled by irreduc
ible genetic heterogeneity or, alternatively, very rare conditions for which it is
difficult to find enough suitable families. However, identifying the factors confer
ring susceptibility to common complex diseases is much more challenging. As
described in Chap ter 15, only association studies have sufficient power to iden
tify most susceptibility factors; linkage studies are seldom productive. For years,
researchers were unable to make much progress. After many false dawns, the
new generation of large-scale genomewide association studi es, reported in
increasing numbe rs from 2005 onward, has finally broken the logjam. Many sus
ceptibility factors have now been mapped, but identifying the actual causal
sequence change remains very difficult.
[n this chapter, we first consider positional cloning, which has been the most
successful general approach to the identification of Mendelian disease genes.
Most commonly, the candidate chromosomal region is identified through link
age studies, but alternatives are to use chromosomal breakpo ints or animal mod
els. We next consider some other approaches: cloning based on knowledge of the
gene product, and identifying causal variants through genomewide association
studies. An important question is how one knows when one has succeeded. How
can we know that a candidate sequence variant really is the cause of the pheno
type? This connects with the discussion in Chapter 13 of pathogenic versus non
pathogenic sequence changes. Toward the end of the chapter is a series of case
studies that show how the various methods have been used to identify important
phenotypic determinants.
The approaches described are just as applicable to identifying determinants
of normal variations such as red hair or red-green color blindness as they are to
identifying disease genes. The determinants that we wo uld wish to identify may
not necessarily be genes, in the sense of protein- coding sequences. The primary
determinant might affect noncoding RNAs or alter chromatin structure, and it
might be located some distance from any protein-coding sequence. Whatever
the mechanism, however, we are looking for DNA seq uence variants that cause
phenotypic variation. Epigenetic changes that do not alter the DNA sequence
(such as changes in DNA methylation) also cause phenotypic variation, ulti
mately mediated by changes in the expression of protein-coding genes.
It is important not to be misled by phrases such as the gene for cystic fibrosis,
the gene for diabetes, and so on. You would not describe your domestic freezer as
a machine for ruining frozen food. Genes do a job in cells; if the job is not done,
or is done wrongly, the result may be a disease. However, many human genes
were first discove red through research on the diseases caused by mutations in
them, so terminology such as the cystic fibrosis gene does reflect historical, if not
biological, reality.
Ai 91 C1 D1 E1
mapped V candidate N collect gene tic
candidate --+ chromosomal --+ families --+ mapping: ~ successfully
located?
homolog? region? for mapping genome search
V )
v1 .L N
1 T
92 C2 D2 E2
thi nk again
check chromosoma l clone the
databases
( deletions or
) chromosoma l
about mode of
inheritance.
for genes translocations breakpoints
heterogeneity
A3
NI
cloned
candidate
homolog?
+1.
y
93
possible
~n!Jla.jjote
Refle
(I
I
D3
)
identify
new hUman
genes
(
E3
think again
about candidate
genes
B4 D4
N
has it ) work out
been fully full sequence
characterized? ( and structure
'~
N
I cs DS ES
o collect look pa thogenic
I unrelated ) for mutations ~ mutations
patients found?
v1
- A'
...,:191 IdOOtiJiOr~
WOI~
~m_
major efforts in screening cDNA libraries to identify and characterize the genes.
define the candidate region
By 1995, only about 50 inherited disease genes had been identified by this
approach.
The first step in positional cloning is to define the candidate obtain clones of all the DNA of the region
I •~ I :
01
1 2
II II
1 2 3 4 5 6 7 1 2 3 4
~~ m~u~
I ~~
I"~ O_
il
~~ m ~" m~o-
j 5 ! 2 012S105
31
8 3
2
~3 012S105
8 2 0125234
1 012S234
6 9 Dl2S129 2 3 4 3 0125129
2 5 0125354 7 ,1 7p5 2 5 0125354
5 51 2 5 2 8 012579 l!2I 6 ~i 8 3 8 012579
II I
012584
0125105
0125234
0125129
D125354
012579
FJgurt! 16.31 Defining the minimal candidate region by inspection of haplotypes. The two pedigrees show a dominantly inherited skin disorder,
Oarier-White disease (OMIM ' 24200), which had previously been mapped to chromosom e ' 2q. The 12q marker haplotype that segregates with the
disease is hig hlighted in yellow. Gray boxes mark inferred haplotypes in dead people. Numbers within the boxes identify different alleles of the relevant
marker. (A) The recombination in individual 116 in this pedigree shows that the disease must be located distal of marker 012584. 0 125105 (gree n) is
uninformative because II was evidently homozygous for allele 5 (II) and Ii] evidently inherited opposite haplotypes from 1\, but both have 0125105 allele
5). The recombination shown in 1111 suggests that the disease gene maps pro ximal to 0125 129, but this requires confirmation because the interpretation
requires the genotypes oflh and 112 to be inferred correctly, and that 1111 no t be a non-penetrant gene carrier. (B) The recombination in indi vidual 114 in
this pedig ree provides the confirmation. The combined data locate the Darier-Whitegene to the interval between 012584 and 01 25129. (Fro m Carter SA,
Bryce SO, Munro CS et al. (1994) Genomics 24, 378-382. With permission from Elsevie r.]
POSITIONAL CLONING 501
ScalI! 1OOkb l
r ------,----I
Chr6: 4300000O I 43050000 I 43100000 I 43150000 43200000 ) 43250000 43300000 I 43350000 I 43400000 I
Auembl~I1• • • • • • •II:==::J• • • • • • •===~e~m~~!]~bly~frp~m~F~"~$j~=~
n!'!:
RelSeq Genes
R;>L n.d . 1 PTCRA I-;tt GNMT " MEo\l 1 MRPL.HHI SRFtf.. csarll0e H SLC22A7 III
C60r1 22 ~ I CNPY3 HH PPP2RSO 1- -1 CUL7 111. CUUlIe-Hll-III-II SLC22A7 HI
PE Xt3 fI--f-I KLHOC3 1-11 KLC4 +I C60rfI CB H CRIP3 1
The third step Is to prioritize genes from the candidate region for
mutation testing
To identify mutations, samples from a collection of unrelated affected people
need to be sequenced. Typically, a candidate region defined by linkage analysis
would contain several genes with 100 or more exons in total. One approach, made
increasingly possible by new sequencing technologies. is simply to sequence
every exon across the candidate region. More usually, however, a candidate gene
is selected, and each exon of that gene is individually amplified by PCR and
sequenced. In choosing genes for sequencing it makes sense to start with the
most promising ones, although ease of analysis is also a consideration. All things
being equal, one would deal with a gene with 4 exons before tackling one
with 65.
Appropriate expression
A good candidate gene should have an expression pattern consistent with the
disease phenotype. Expression need not be restricted to the affected tissue,
because there are many examples of widely expressed genes causing a tissue
specific disease, but the candidate gene should at least be expressed in the place
where the pathology is seen, and at or before the time at which the patholOgy
becomes evident. For example, genes responsible for neural tube defects should
be expressed in the neural tu be shortly before it closes (which happens during
the third and fourth weeks of human embryonic development; see Chapter 5).
The expression of candidate genes can be tested by RT-PCR, northern blotting
(see Chapter 7), or serial analysis of gene expression (SAGE; see Box 11.1). Much
of the preliminary work can be performed with databases (dbEST or the SAGE
database, both accessible through the NCBI homepage at hrtp:llwww.ncbi.nlm
.nih.gov!) rather than in the laboratory. In situ hybridization against mRNA in
tissue sections, or immunohistochemistry with labeled antibodies, provides the
502 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
most detailed picture of expression patterns. Studies are usually done on mouse
tissues, especially for embryonic stages. The common assumption that humans
and mice will show similar expression patterns is not always justified, and cen
tralized resources of staged human embryo sections have been established to
allow the equivalent analyses to be performed, where necessary, on human
embryos (see http://www.hdbr.orgl).
Appropriate function
When the function of a gene in the candidate region is known, it may be obvious
whether or not it is a good candidate for the disease. For example, rhodopsin was
a good candidate for retinitis pigmentosum, and fibrillin for the connective tis
sue disorder Marfan syndrome (OMIM 154700). For a novel gene, sequence anal
ysis will often provide clues to its function through the recognition of common
sequence motifs such as transmembrane domains or tyTosine kinase motifs.
These may be sufficient to prioritize a gene as a candidate, given the pathology of
the disease. For example, ion transport in the inner ear is known to be critical for
hearing, so an ion channel gene would be a natural candidate in positional clon
ing of a deafness gene.
Homologies and functional relationships
Sometimes a gene in the candidate region turns out to be a close homolog of a
known gene (a paralog in humans, or an ortholog in other species). If mutations
in the homologous gene cause a related phenotype, the new gene becomes a
compelling candidate. For example, Marfan syndrome is caused by mutations in
the fibrillin gene. A clinically overlapping condition, congenital contractural
arachnodactyly (CCA; OMIM 121050), was mapped to chromosome 5q. The can
didate region contained the FBN2 gene, which is a paralog of fibrillin. FBN2
mutations were soon demonstrated in CCA patients. Weak homologies may be
missed by the programs used for automatic genome annotation. Amore directed,
hypothesis-driven search may come up with significant weak homologies that
can point to the gene's function.
Candidate genes may also be suggested on the basis of a close functional rela
tionship to a gene known to be involved in a similar disease. The genes could be
related by encoding a receptor and its ligand, or other interacting components in
the same metabolic or developmental pathway. For example, mutations in the
RET gene were known to cause some but not all cases of Hirschsprung disease.
RETencodes a cell-surface tyrosine kinase receptor. The genes encoding the RET
ligands GDNF and NRTN were then obvious candidates for hunting for further
Hirschsprung mutations.
Now that we have genome sequences of many different species, ithas become
clear how far structural and functional homologies extend across even very dis
tantly related species. The great majority of mouse genes have an identifiable
human counterpart, and the same is true of other mammalian species. As
described in Chapter 10, extensive homologies can be detected between genes in
humans and model organisms such as zebrafish, the fruit fly Drosophila, the
nematode worm Caenorhabditis elegans and even the yeasts Saccharomyces
cerevisiae and Schizosaccharomyces pombe. For these distantly related organ
isms, homology may be more evident in the amino acid sequence of the protein
product than in the DNA sequence of the gene. Looking at the amino acid level
allows synonymous coding sequence changes to be ignored.
A very powerful means of prioritizing candidates from among a set of human
genes is therefore to see what is known about homologous genes in well-studied
model organisms. Such data might include the pattern of expression and the
phenotype of mutants. Papers listed in Further Reading show how data from
Drosophila and yeast can be used to infer the function of human genes. Mice are
especially useful for such investigations, and their use is considered in more
detail below. Even more than gene sequences, pathways are often highly con
served, so that knowledge of a developmental or control pathway in Drosophila
or yeast can be used to predict the likely working of human pathways-although
mammals often have several parallel paths corresponding to a single path in
lower organisms.
By contrast, mutant phenotypes are less likely to correspond closely. A strik
ing example is the wingless apterous mutant of Drosophila. Ahuman gene, LHX2,
POSITIONAL CLONING 503
Several methods are available for easy and rapid mapping of Recombinant inbred strains
phenotypes or DNA clones in mice. Together with the ability to These are obtained by systematic inbreeding of the progeny of a cross..
construct transgenic mice, they make the mouse especialty useful for For example, the widely used BXD strains are a set of26 lines derived
comparisons with humans.
by more than 60 generation s of inbreeding from the progeny of a
Interspecific crosses (Mus musculus crossed with either Mus sprerus or
CS7BU6J x DBA/lJ cross. They provide unlimited sup plies of a panel of
Mus castaneus) chromosomes with fixed recombination points. DNA (s availableas a
Different mouse species have different alleles at many polymorphic public resource, analogous to the human CEPH families. Recombinant
loci, making it easy to recogniz.e the origin of a marker allele. Thi s is inbred strains are particularly suited to mapping quantitative traits,
which can be dermed in each parent strain and averaged over a
exploited in two ways:
Constructing marker framework maps. Several laboratories numberof animals of each recombinant type. In comparison with mice
have generated large sets of F2 backcrossed mice. Any marker from interspecific crosses, it may be harder to find a marker in a given
or cloned gene can be assigned rapidly to a small chromosomal region that distinguishes the two original strains, and the resolution is
segment defined by two recombination breakpoints in the lower because of the smaller numbers.
collection of backcrossed mice. For example, the Collaborative Congenic strains
European backcross was produced from a M. spretus x These are identical except at a specific locus. They are produced
M. musculus (C57Bl) cross. Five hundred F2 mice were produced by repeated backcross ing and can be used to explore the effect of
by backcrossing with M. spretus, and 500 by backcrossing with changing just one genetic factor on a constant background.
C57Bl. All microsatellites in the framework map are scored in
every mouse.
Mapping a new phenotype. A cross must be set up spedfically
to do th is but, unlike with humans, any number of F2 mice can be
bred to map to the desired resolution. M. musculus x M. casraneus
cros ses are easier to breed than M. musculus x M. spretus.
is able to complement the deficient function of the mutant, so that the flies grow
normal wings (see Figure 10.27). We humans must have a virtually identical
developmental pathway to Drosophila, but clearly we use it for a different pur
pose. No human mutation has been described, but Lhx2 knockout mice die
before birth with defects in development of the eyes, forebrain, and erythrocyt es.
Branchio-oto-renal syndrome (OMIM 1l3650), the subject of case study 3 at the
end of this chapter. provides another example of a conserved pathway used for a
different purpose.
.1
chromosomes in mouse
.3·2 .a
0 4 (] 7 8 10 . 13 016.19
• 5 . 11 8 1" .17 0X
. 6 !J 9 . :l.2 . ~ 018
triangles mark the human ce ntromeres.
(From Church DM, Goodstadt l, Hillier
l Wet ,I. (2009) PloS 8;0/.7(5), e 1000112 .
1 dOi:l0.1371/journal.pbio.lOOO112.]
504 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
Clinicians have made major co ntrib utions to identifying disei:'!se Additional mental retardation
genes by identifying patients who have causative ch romosome A patient may have a typical Mendelian disease but may in addition
ab normalities. be mental ly reta rded. This may be coincidental, but such cases ca n be
A cytogenetic abnormality in a patient with the standard caused by deletions that elimi nate the disease gene plus additional
neighboring genes. Large chromosomal deletions almost always
clinical presentation
cause severe mental retardat ion, re flecting the involvement of a
If a disease gene has already been mapped to a certain chromosomal
high proportion of our genes in fe tal brain development. When the
location, and then a patient with that disease is found who has
a chromosome abnormality affecting that same location, the
patient is also a de nCNO case of a condition that is normally familial.
chromosome abnorma lity most probably caused the disease. cytogenetic and molecular analysis Is warranted.
tr12 t
-
ASH
probe
A
. G
Hybridization
pattern
8, der(8 )
8, der(16)
_.---------<
This result would normally be confirmed by
using clones from chromosome 16.
, 3 B 8, der(8)
r--,----
4 F 8, der(16)
I-c--
5 E 8, der(16)
6 D 8, det(S). der (16)
8 der(8) det( 16) 16 -
pie by splicing together exons of two genes to create a novel chimeric gene-this
is rare in inherited disease but common in tumorigenesis (see Chapter 17). in
either case, the breakpoint provides a valuable clue to the exact physical location
ofthe disease gene. The position of the breakpoint is most easily defined by using
FISH (Figure l6.6). This will often be sufficient to identify the gene involved. if
necessary, once the breakpoint has been localized to a single clone, its exact posi
tion could be defined by finding a sequence that can be amplified by PCR from
DNA of the patient by using a pair of primers, one from each of the two chromo
somes involved. An alternative approach would be to make a genomic library
from the patient's DNA and then sequence appropriate clones. Breakpoints of
deletions or duplications can be easily mapped on microanays by looking for a
change in the hybridization intensity.
An example of the power of this approach is the identification of the Sotos
syndrome gene (Figure 16.7)_This syndrome (OMIM 117550) involves childhood
overgrowth, dysmorphic features, and mental retardation. Because it always
occurs de 110VO, there were no multigeneration families that might allow the caus
ative gene to be localized by linkage analysis. A single patient with a de novo bal
anced 5;8 translocation, 46,XX,t(5;8)(q35;q24.1), provided the vital clue. As shown
in Figure 16.7, the translocation breakpoint disrupted the NSDl gene. This could
have been purely coincidental. Proof that NSDl was the gene mutated in SolOS
syndrome came from showing that 4 out of 38 independent patients with
SOlOS syndrome had point mutations in the NSDl gene, and 20 out of 30 patients
had microdeletions involving this gene (l'lgure 16.8).
-
c2S === C4D were then demonstrated in unrelated
(C) , CTC-549a4 (AC0085 70)
genome sequence ! CTC·286c20 (AC027314)
patients with Sotos syndrome (nor shown).
[From Kurotaki N, Imaizumi K, Harada N
JAZ r(jrl'<4 NSOj
gene
.... ~
12 3
et al. (2002) Nat. Genet 30, 365-366. With
permission from Macmillan Publishers Ltd.]
506 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
-+ part of autosome
inactivated Figure 16.9 Non-random X-inactivation
occurs in female Duchenne muscular
one active
dystrophy patients with Xp21-autosome
OMO dys~rophin gene
trans locations. The translocation between
random
inactive the X chromosome (X) and an autosome
X·lnactivalion (A) is ba lanced, but the X chromosome
Ia~
breakpoint disrupts the dystrophin gene
000 act;'e X (red). X-inactivation is random, but cells that
two actIVe inactivate the translocated X die because
~ autosomes ~ OMO? of lethal genetic imbalance. The embryo
X X/A
A/X "
develops entirely from cells in whic h the
no active normal X is Inactivated, leading to a woman
carrier of a balanced
dystcoph;o geoe
reciprocal X-aulosome with no functional dystrophin gene. The
translocation resulting failure to produce any dystrophin
inactive causes DMD.
THE VALUE OF PATIENTS WITH CHROMOSOMAL ABNORMALITIES 507
.-•
(AI cells from conlrol cells from patient (BI
2.0 lr - - - - -- --
1.01---1- - - - - - -
J, extract DNA
! 0 1- ~;*~;i;i!·"" "-"'''",;;.'i.;<M,·
....~ '. '.
~'1: \. .' ,,_.:r-'l' --'-.....,...,' ~~
'~"''j~rtY-~~'~
~
•
. ~
• - 1.0 1 i!:i i
v
1 !
. -
la bel with two different -2.0 I chromosome 15
flUOfOchromes
o =
-
~
•
______ e . ~
.!l_ • ~
- 0"
C'
f
, I from Read A & Donna; D (2006) New Clini cal Genetics. With
<:> .., 0 permiSSion from Scion Publishing Ltd.]
508 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
intensity. The SNP genotypes can also be informative: SNPs from within a deleted
region will show only a single allele (apparent homozygosity), and SNPs from a
duplicated region may show three alleles.
Whichever technique is used, it is important to confirm the results by FISH.
Balanced abnormalities would not be detected, and the results for a duplication
or amplification would not distinguish tandem repeats from dispersed repeats.
The main limitation of these techniques is in the interpretation of the results. As
mentioned in Chapter 13, large-scale application of these methods has identified
innumerable copy number variations in normal healthy people. Just because a
micro deletion is seen in a patient with a clinical phenotype, it does not follow
that the deletion caused the phenotype. Databases of normal variants. such as
TCAG (see Further Reading), can be consulted to eliminate known nonpatho
genic varian ts. but there always remains the possibility that the patient carries a
novel but harmless variant. Thus. as always with chromosomal abnormalities,
variants detected by CGH suggest hypotheses; these must then be checked by
other means.
studying gene expression in cells from that patient, or by finding unrelated chro
mosomally normal cases with point mutations in the relevant gene.
deaf; vestibular -----+ use mouse crosses to map sh2 to a -----+ isolate SAC clones -----+
malfunction causes 1 cM region of mouse chromosome 11 covering the candidate
circ ling and head- region from the wild-type mice
tossing this corresponds to 17p11.2 In humans,
the candidate region for the DFNB3 inject into sh2/sh2
deafness gene fertili zed eggs to create
transgenic mice
1. isolate human
MY015 gene, using
D.IB92F primers designed p.,C674 V sequence SAC 425 p24
p.N800Y from the mouse
p.Kl300.x myo15 mutalion + computer analysis
2. use somatic cell IdentifIed in sh2 identifies a novel myosin
MY015 mutations hybnds to check that mouse gene, myo15
identified In three it maps to 17pl1.2
unrelated OFNB3
patients
• The Fanconianemia (see OMIM 227650) groups A, C, and G genes (see Chapter Figwe 16.12 Functional complementation
13, p. 4(6) were identified by screening cDNA libraries for clones that could In transgenic mice as a tool for identifying
cure the DNA repair defect in cells from patients. a human disease gene. In mice, the shaker-2
mutation causes deafness. This mutation was
• The second of the two ways in which the gene responsible for multiple sulfa identified by finding a wild-type clone that
tase deficiency was identified (see case study 4 at the end of this chapter) corre<:ted the defect. Members of a human
involved finding a chromosome fragment that was able to correct the defect family with a si milar phenotype that mapped
in a cell line from a patient. to the corresponding chromosomal location
proved to have mutations in the orthologous
• A similar functional complementation approach, but in this case in a whole
gene, MYOI5.
animal, was used to identify the mouse shaker-2 and human DFNB3 deafness
genes (Figure 16.12). In fact, this approach did use some positional informa
tion, but it is described here because it is another example of functional com
plementation. A form of autosomal recessive hearing loss was mapped to a
unique locus, DFNB3, in a very large extended Indonesian kindred. Because
all the patients were likely to be homozygous for the same single mutation, it
was unlikely that the pathogenic change could be lU1ambiguously identified
by sequencing DNA from the candidate region in family members.
Comparative mapping showed that DFNB3 and shaker-2 were likely to be
orthologs. Transgenic shaker-2 mice were constructed that contained wild
type BAC clones from the shaker-2candidate region. A BAC that corrected the
phenotype was identified and turned out to contain a n unconventional
myosin gene, myo15. The human MY015 gene was then isolated on the basis
of its close homology with the mouse gene; its position within the DFNB3
candidate region was confirmed, and the Indonesian patients were then
shown to have a mutation in MY015.
When a disease affects a biochemical pathway, each component of the path
way is a non-positional candidate for the disease. The components may be iden
tified through protein or RNA studies.
• Protein-protein interactions can be identified through yeast two-hybrid
experiments or by using mass spectrometry to identify the components of a
multiprotein complex. For example, BRIP 1 and PALB2 (see case study 7 at the
end ofthis chapter) were identified as susceptibility factors for familial breast
cancer because the proteins they encode interact with the proteins encoded
by the known breast cancer susceptibility genes BRCA1 and BRCA2.
POSITION-INDEPENDENT ROUTES TO IDENTIFYING DISEASE GENES 511
• Microarrays are often used to compare mRNA profiles from patients and con
trols, or from cells before and after the expression of some protein is knocked
down by RNAi (RNA interference; see Box 9.6, p. 284 and Chapter 12). This
produces a list of genes whose expression is altered-but the list is usually
very long. It is then necessary somehow to distinguish primary effects from all
the downstream changes.
Waardenburg- Hirschsprung
1
no famHes
Dominant mB$Jco/Qr) mouse:
a possible orthofog
Waardenbufg-Hir5ch5prung di5ea5e.
Individuals affected by th is condition
(OMIM 277580) have pigmentary anomalies,
suitable for
hearing loss, Hirschsprung disease, and
linkaee
L so metimes other neurological problems.
1
posiUOnal
Dom loOus mapPed lao
mouse ellrOO1oscrne .15
The condition is very rare, and affected
people seldom reproduce, so it was not
possible to identify the gene responsible
clori l~no t
possible 1 by first mapping it in families and then
testing positional candidates.The Dominant
1
human SOXI0 Dorn gene idMtifiM megacolon (Dom ) mouse has a similar
cloned as SoJt1D combination of pigmentary anomalies and
lack of enteric ganglia. Breeding experiments
in mice allowed the ca usative gene (Soxl0)
to be identiAed. Th e human homolog was
SOXI0 mutations found In then cloned, and DNA from patients was
some human patients
tested directly for mutations.
512 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
In so me cases, the desired mutations may be hard to find. Apart from the
practical problem of screening a large gene with many exons, some mutations
are not readily discoverable by PCR testing of genomic DNA. For example, one
frequent mutation of the cystic fibrosis tra nsmembrane receptor gene (CFTR)
that causes cystic fibrosis, c.3849+lOkb C>T, is a single nucleotide change deep
within an intron that activates a cryptic splice site. It would be missed by the
standard exon-by-exon peR and sequencing protocols. Initial detection of this
mutation required RT-peR, although once it had been identified a peR assay
could easily be designed to detect further cases by using genomic DNA.
10 another example, most mutations that cause severe hemophilia A
(OMIM 306700) could not be detected even afrer sequencing all 26 exons of the
F8A gene. It turned out that a recurrent inversion disrupted the gene. The
intragenic breakpoint was in a32 kb intran, and so each exon of the gene remained
intact and appeared normal on peR or sequencing (Figure 16.14). The inversion
was detected by Southern blotting. It is caused by non-allelic homologous recom
bination between a repeat located in intron 22 of the F8A gene and copies of the
repeat located 360 kb and 435 kb away. Once the relevant sequences were known,
a long PCR test could be designed to detect it.
the central nervous system, many more or less exclusively. All of them are candi
date genes for mental retardation. Although mental retardation is an extreme
example, almost all observable phenotypes depend on the action of more than
one gene, and so locus heterogeneity is a very common and unsurprising obser
vation. Indeed, at first sight it seems surprising that any condition should not be
grossly heterogeneous. Locus homogeneity probably results in part from the way
in which most conditions are defined by combinations of symptoms. For exam
ple, each individual feature of cystic fibrosis can have a variety of causes, but the
combination is specific to people who have no functional copy ofthe CFTR gene.
Additionally, we are good at seeing very subtle differences between people and
labeling them accordingly. If mutations in two different genes cause very similar
but subtly different phenotypes, we would probably label them differently in
humans but not in mice, flies, or worms.
region is usually too large to search for mutations. However, starting in about
2005, genomewide association studies in many different complex diseases have
identified SNPs that are reproducibly associated with susceptibility. As described
in Chapter 15, these studies typically genotype 500,000 or more tagging SNPs (as
defined by the HapMap project) and copy-number variants in very large case
control designs. For example, the Wellcome Trust Case-Control Consortium
(wrCCC) genotyped 500,568 SNPs in 14,000 patients with seven diseases and
3000 controls.
disease-common varia.nt hypothesis (see Chapter 15). Because such va riants are
necessarily ancient, they cannot have been subject to strong negative selection.
Thus, they are likely to have fairly subt le effects on gene expression or function,
!
a large panel of the cases and controls will reveal all variants in the study sample.
These will include copy-number variants and polymorphisms with minor allele
frequencies below the 5% threshold normally used in association studies.
Next, each individual variant will be tested for association with the disease,
ideally in an independ ently ascertained set of patients and controls. Haplotype
blocks, such as those in the HapMap data, are defined by an arbitrary cutoff of
linkage disequilibrium that is always less than 100%, so that the degree of asso
ciation is not necessarily identical for every variant within a block. Moreover,
most variants are two-allele SNPs, but within the population there will usually be
more than two alternative haplotype blocks at any location. Thus, particular va ri
ants will probably be present on more than one ofthe alternative blocks. If a vari
ant that is present on two different blocks is causative, it will show a much stronger
association with the disease than variants that are present on only one of the
blocks. Conversely, a nonfunctional variant that is present on one block that also
carries the function al variant but is also present on another block that does not
do so will show a weaker association with susceptibility (FIgure 16.16 shows this
in a simplified example). If the investigators are lucky, testing every variant for
association will show one peak of especially strong association for some small
cluster of SNPs, against a background of the general association of all the SNP
alleles that are found on the rele vant haplotype block.
Several different variants in a region may each contribute, independently or
in combination, to susceptibility. Sorting out the causal relationships is extremely
difficult when all the variants are in linkage disequilibrium with each other. The
main tool is logistic regression: a principal variant is selected, and the effect of
other variants is studied, conditioned on the effects of the principal variant. If
this procedure still shows an effect, the second variant does indeed make an
independent contribution. An additional problem is that the main determinants
of susceptibility may be different in different populations. Apart from differences
due to interaction of genetic factors with different environments, associations
identify shared ancestral chromosome segments, which may be different in dif
ferent populations. Readers interested in a more detailed discussion orthe statis
tical methods used In association studies should consult the review by Balding
(see Further Reading).
When a candidate causal variant has been identified, some sort of functional
test must be used to confirm that it has a biological effect. Missense changes in
coding sequences are the easiest to investigate, through direct tests of protein
function . The balance of splice isoforms can be checked for variants located
within gene sequences, and variants in promoters or enhancers can be studie d
with the use of appropriate expression systems. But laboratory studies seldom
capture the full function of a complex gene in all cells and all tissues, and under
all conditions. In a systemat.ic study of gene knockouts in the budding yeast
Saccharomyces cerevisiae, only a minority of all knockouts had any detectable
effect. It is highly unlikely that most yeast genes really have no function-what
the result tells us is that standard laboratory biochemical tests are not able to
assess the full function of a gene, even in a simple organism. In a similar study in
the mouse, 96% of all knockouts did show some ab normality, but the abnormali
ties co uld be subtle, often behavioral. If this is true of knockouts, it is likely to be
A B C D
4 haplotypes, 3 SNPs J I J J
Flgur. 16. 16 SNP associations with
frequency in population 0.1 0.2 0.3 0 .' disease susceptibility. In this imaginary
population there are four alternative
(15k for someone haplotypes (A-D) at a certain location. Three
2x 2x lx lx
with this haplotype different SNPs are shown as three colored
bars; the alternative allele in each case is
relative likelihood that a 0.2 0.3 blank. The red SNP is a true susceptibility
0.' 0 .'
case has this haplotype
factor for the disease under study, doubling
probability that a case the risk. The blue and green SNPs have no
D.2/ T a. 4f T O.3fT O.4fT
has thiS haplotype =0.23 causal effect. When each SNP is tested for
(T =0.2 + 0.4 + 0.3 + 0.4 = 1.3) =0.15 =0 .31 = 0.31
association with the disease, all will show
an association because each is present on
pr'oportlon of cases haplotype A, but it will be strongest for the
having a given SNP I 0.'6 • 0.1 5 I 0 .38 true causal SNP.
518 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
doubly so for the more subtle functional changes that are suspected of causing
susceptibility to complex diseases.
In a few cases, such investigations have convincingly explained the associa
tion of a variant with disease susceptibility or resistance. A likely example comes
from the genomewide association study of Crohn disease described in the case
studies. The p.T300A variant in the ATG16Ll gene is an amino acid change in a
protein that has convincing functional links to the disease, and variation at this
amino acid accounts for the whole ofthe observed association.
Functional analysis of SNPs in sequences with no known function
is particularly difficult
Often the candidate variant will be in intronic or intergenic DNA that has no
known function. This makes it extremely difficult to produce the sort of func
tional data that would confirm a variant as being truly causal. In the current state
of knowledge there are many variants that have statistical but not functional sup
portfor a role in disease susceptibility. How many ofthese will eventually be fully
confirmed remains to be seen. The two cases below illustrate this frustrating
situation.
Calpain-l0 and type 2 diabetes
A pioneering study of type 2 diabetes (T2D) by Horikawa and colleagues in 2000
identified one susceptibility factor as heterozygosity for a haplotype of three
intronic SNPs within the calpain-l0 (CAPNlO) gene on chromosome 2q37.
Linkage analysis in Mexican-Americans had identified a 7 cM candidate region at
2q37. Physically, this corresponded to 1.7 Mb of DNA (recombination rates tend
to be higher near telomeres, so there was a 7% chance of a crossover occurring in
this 1.7 Mb region). Polymorphisms from the region were tested, not just for asso
ciation with T2D but also for association with the evidence for linkage. The ratio
nale was that only a subset of cases was linked to the 2q37 locus, but these should
be the ones carrying the susceptibility determinant. Initial analyses suggested a
66 kb target region. This was sequenced in a panel of io Mexican diabetics and it
turned out to contain three genes-CAPNlO, RNPEPLl, and GPR35-and 179
sequence variants. Eventually a variant, UCSNP-43, was identified in which the
homozygous GIG genotype showed association with the evidence for linkage
and also probably with diabetes (odds ratio 1.54, confidence interval 0.88-2.41).
This SNP lies deep within an intron of the CAPNIO gene, and the G allele has a
frequency of 0.75 in unaffected controls. A search was then initiated for haplo
types that were (a) increased in frequency in the patient groups with greatest evi
dence of 2q37 linkage, (b) shared by affected sib pairs more often than expected,
and (c) associated with an increased risk of diabetes. A heterozygous combina
tion of two haplotypes (defined byUCSNP-43 and two other SNPs within introns
of the CAPNlO gene) fulfilled all the criteria. Neither haplotype in homozygous
form was a risk factor. CAPNIOwas an entirely unexpected gene to be involved in
T2D. Later studies have suggested some reasons why this gene might be relevant,
but nine years after the original publication, the role of the SNPs, and indeed
their validity as susceptibility factors, remains unclear.
Chromosome 8q24 and susceptibility to prostate cancer
Several studies of susceptibility to prostate cancer have implicated a region on
chromosome 8q24. In 2007, three groups independently reported large-scale
association and resequencing studies across the candidate region (see Further
Reading). Very strong associations were seen for several SNPs-but these came
from three separate regions of 8q24, spread across 600 kb and not in linkage dis
equilibrium with each other (Figure 16, 17) . SNPs in the three regions seem to be
independent risk factors that have a multiplicative overall effect. The three stud
ies are mutually confirmatory-yet none of the SNPs lies near a gene and none
has any obvious functional effect. The MYC oncogene lies only 260 kb telomeric
of region 1. Clearly an effect on MYC expression would be a very attractive
explanation of the data, but none of the groups was able to provide any data to
support this.
The case studies of breast cancer and Crohn disease below illustrate success
ful genomewide association studies. In each case, novel susceptibility loci were
EIGHT EXAMPLES OF DISEASE GENE IDENTIFICATION 519
),
-.
~
" . " "
p..... .
I .f'.
"
~ •• ...,
•
-" "
128.10 128.20 128.30 128.40 128.50 128.60 128.70
posilion on 8024 (Mb)
identified, and for most ofthese new loci plausible candidate genes could be sug
gested. The odds ratios for the susceptibility alleles were low, mostly in the range
1.1-1.5 (an odds ratio of 1 means thatthe factor has no effect on risk). The studies
demonstrate that it is possible to identify weak susceptibility factors while leav
ing open the question of whether doing so is wo rth the trouble.
was autosomal recessive, and families were generally small in the countrjes where
it was prevalent. Thus, mapping relied on combining large numbers of affected
sib pairs and hoping that there was no locus heterogeneity. Frustratingly, the first
linkage had been to the locus encoding the enzyme paraoxonase--whose chro
mosomal location was unknown. This was rectified whe n paraoxonase, and
hence CF, were localized to chromosome 7q31.
Unlike with Duchenne muscular dystrophy, there were no patients with CF
who had appropriately located chromosomal breakpoints ro speed the discovery
process. Instead, years of grind ing effort were needed ro clone as much as possi
ble of the DNA from the candidate region. This became an intensely competitive
race between three major groups. The winners, a US-Canadian group led by Lap
Chee Tsui, used 12 different genomic libraries, including one made from a
human-hamster hybrid celJ that contained only human chromosome 7. A major
problem was to build up a clone contig across the candidate region. Chromosome
jwnping was used to bypass the size limitation of cosmids, which have a maxi
mum cloning capacity of 45 kb. This was an early version of the pall-ed-end
mapping technique that is now used to scan the human genome for struct
ural variants (see Figure 13.6). In this early manifestation, DNA fragments of
80-130 kb from a partial Mbal digest of genomic DNA were recovered from pre
parative pulsed-field gels. The fragments were mixed in very dilute solution with
an excess of a short marker sequence that had compatible sticky ends; they were
then exposed to DNA ligase. The hoped-for ligation products were circles of
80-130 kb of genomic DNA with the marker fragment located at the position
where the ends joined. Circles were fragmented with a restriction enzyme and
the fragments were cloned. The desired clones contained the marker and a known
7q31 sequence, joined to an unknown sequence, which was hoped to come
from a position in 7q31 that was 80- 130 kb away from the known sequence
(FIgure 16. 19).
genomic DNA
~
~
~
~
--1
contain a known sequence ( _ ) at one end,
-
0°
-0--0 circles With one or more copies
of marker sequence at junction plus
concatamers of marker sequence
, I
digest into fragments of -10 kb, Mbol digest of genomic DNA, isolated
c lone Into A phage,
from preparative pulsed·field gels.These
se lect phag-e containing both
_ and were exposed to DNA ligase in very dilute
SOlution in the presence of a large excess
of a short marker sequence (SupF) that
phage had compatible sticky ends. Some of the
restriction fragments formed circles with the
two original ends joined by a SupF sequence.
KEY: _ '" known sequence
The circles were cut and fragments isolated
(sta rting point for Jump)
that contained a known sequence from 7q31
linked by a SupF sequence to a novel 7q3 1
- '" unknown sequence
ca . 100 kb aW(yf from sequence.The latter represented new loci
(target for jump) 80-130 kb away from the original sequence.
522 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
A restriction map was constructed across the region and the available genomic
clones were hybridized to zoo blots to fi nd genomic sequences that were con
served across species. Such sequences were putative gene sequences. Segments
that showed conservation across species were used to screen cDNA libraries
made fro m tissues that were affected in CF. In add ition, a search was made for the
CpG islands that often mark the 5' e nds of genes, and clones were seque nced to
look for open reading fram es. Eventually, and with great difficulty, a 6.5 kb cDNA
clone was isolated that led to the characterization of the CF transmembrane
receptor (CFIR ) gene. The o riginal 1989 paper is weIJ worth reading to get a feel
for just how heroic these early efforts were.
! sulfatase·modifylng factor 1
nonfunctional.
o
II
- C-CH-NH
I
HC =O
formylgly<:ine
mass spectrometry. This allowed them to identify a partial amino acid sequence.
Searching genetic databases for sequences that could encode their peptides
identified a single human cDNA. This corresponded to a gene they named SUMFl
on chromosome 3p26. They then sequenced the SUMFl gene in a series of
patients with MSD and demonstrated mutations (Figure 16.21 ).
Meanwhile, Andrea Ballabio's group in Naples took a different approach.
Ballabio had long been interested in sulfatases as a result of early positional cIon·
ing work in which he had identified four sulfatase genes on the short arm of the X
chromosome. He used a purely genetic approach, micro cell-mediated chromo·
some transfer, to identify the MSD gene. This technique had had an important
role in physical mapping during the earlier phases of the Human Genome Proj ect.
The starting material is a set of human-mouse somatic cell hybrids each contain·
ing a single human chromosome tagged with a dominant selectable marker (see
Chapter 6, p. 182). Panels of such cells have been established as publicly available
resources (US National Institute of General Medical Sciences Human/Rodent
Somatic Cell Hybrid Mapping Panel #2, http://ccr.corieILorg/Sections/
Collections/NIGMS/Map02.aspx?Pgld=496).
The monochromosomal hybrid cells were subjected to prolonged mitotic
arrest by continued exposure to cytochalasin B, an inhibitor of mitotic spindle
formation. This causes the chromosome content to become partitioned ioto dis
crete sub nuclear packets (micronuclei). The micronuclei can be physically iso e:o::tract of bovine testis
lated from the cells by centrifugation in the presence of cytochalasin B, resulting
Chromatographic
in the formation of microcells, which are particles consisting of a single micron u separation,
cleus and a thin rim of cytoplasm, surrounded by an intact plasma membrane. guided by
Microcells, each containing a single known human chromosome, were fused functional assay
with cells from a patient with MSD, which were then assayed to see whether they
had recovered any sulfatase activity. These difficult experiments showed that highly enriched enzyme preparation
chromosome 3 could restore the activity.
To narrow down the candidate region, thechromosome3 monochromosomal
! mass spectrometry
hybrid cells were irradiated before micro cells were made. The radiation frag
partial peptide sequence
mented the chromosomes, so that in the subsequent microcell-mediated chro
1
mosome transfer, only random small chromosome fragments were transferred. database searches for
Fusion cells that had recovered sulfatase activity were tested for the presence of cDNA that could encode
the se peptides
donor- specific alleles of micro satellite markers from chromosome 3. By this
method, the candida te region was defined as a 2.6 Mb region of3p26. This con candidate eDNA
tained seven genes, all of which were sequenced in a panel of patients. Mutations
in one gene were found in every patient (Figure 16.22).
The two competing groups had identified the same gene by very different
routes. Their results were published back to back in Cell in 2003 (see Further
1 design peR pnmers : test
for mutations in patients
have high levels of persistence, whereas persistence is rare in those that do not. panel of mouse--/luman hybrid cells, each
For example, whereas the majority of African adults are lactose-intolerant, the containing the whole mouse genome plus
one specifIC human chromosome
Beja, Tutsi. and Fulani, who are traditionally pastoralist, are mostly tolerant.
Biochemical and genetic evidence suggested that the main genetic determi treat with the mitotIC
nant of persistence was linked to the lactase (LCD gene on 2q21, but no relevant inhibitor cytochaiaslO 8 .
change could be found in the coding exons or promoter. Eventually a single CIT
SNP in intron 13 of the unrelated but neighboring gene MCM6was shown to be
perfectly associated with persistence in Northern Europeans. Functional assays
confirmed that this polymorphism, -1 3910 CIT, affected transcription of the
lactase gene. In Africans, however, no such association could be demonstrated.
1
then select those
mlcrocells that conLaln
a hUman chromosome
The -13910 T allele that determines persistence in Northern Europeans is rare Or fu se each microcell in
absent in lactose-tolerant Africans. Instead, other SNPs in the same intron 13 of turn with cells from
the MCM6 gene are associated with lactase persistence and have been shown to patients With mu ltiple
sulfatase defiCiency.
affect levels of transcription of the LCT gene. It is interesting to speculate that, look for one that
had these effects been found in a genomewide association study, and if we were corrects the
biochemical defect
unaware of the function of the LCT gene, MCM6 might have been the gene for
lactase persistence. human chromosome 3 corrects the defect
Lactose intolerance is the ancestral state, in man and other mammals.
Tolerance has been strongly selected for in pastoralist populations because of the repeat the whole
same result. The haplotypes that carry the persistence alleles are very much
a 2.6 Mb fragment from human 3p26
longer than those associated with non persistence (Figure 16.23). In Eurasians, corrects the defect
1 identify genes in
lhis regIon
".. ""r'"
0.3 1•• ~lJ . ...-~~.
0
-0.3 ~.:!...~I
,/
, U 13 15 111921X II
-0.2
chromosome poSition
58 Mb 59 Mb 60 Mb 61 Mb 62 Mb 63 Mb
(8) cen--f ' , ,
t 1~t,--- rox
~ lel
~ M3C3'l3~5 ~ 010 f- -7 .-~ -----7 f -
()I<Ft:"'1. f~2 AC8.119W/ ' ; , /iSM.O.' coe Iljll,ll r;>C.o1 I M3C34348 ~-
l'l'F'l~l ~ .J
~ ~ ~ ~
(C)
r ,••
ATG
• II I I 1111..
• •
•• •• • 38
1111 11111111.1111111.,
- chromo chromo
- - SNF2- TAA
helicase
F1gu,.16.24 Discovering the cause of CHARGE syndrome. (A) Initial detection of a deletion (arrow) in one CHARGE patient by array-CGH . (8) The
candidate region on 8q12 and genes in the region. Combining data from the first deletion case with data from a second patient with a chromosomal
rearrangement identified two candidate regions (red arrows) for the CHARG E gene. The genes in green were screened for mutations. (C) Genomic structu re
of the CHD7 gene. showing exons (black rectangles) and functional domains (colored blocks). Mutations found in lOnon-deletion patients are shown as red
dots. [Adapted from Vissers LE, van Ravenswaaij CM, Admiraal R et al. (2004) Nat. Genet. 36, 955-957. With permission from Macmillan Publishers Ltd .)
526 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
1500 families
l comPlex segregation
analysIs
\~-----------------i
lad score a nalYSIS
locus at 17q21
017S250 · - - -- _ __ (l = 3.28-5.98
with 017S84 )
BRCAl gene I
24 exons 8 eM 10') I BRCAl
1863 amino
I
-iiE
acids 17eMI?) Figure 16.25 How the BReA 1 gene was
pOSitional 2 14 selected found. Initial segregation analysis of 1500
cloning multi-case fam ilie s families with multiple c as~s of breast
+---=--
( D17S588 . ( (international cancer suggested that some cases might be
= mutation
analysis
1M score analysis
collaborative study)
heritable in a Mendelian manner. Linkage
== SRCU candidate region
3' ~
analysis of 23 ofthese families identified a
susceptibility locus at 17q21, which led to
the identification of the BRCA 1 gene.
EIGHT EXAMPLES OF DISEASE GENE IDENTIFICATION 527
seemed that BRCAl might account for 80- 90% of families with both breast and
ovarian cancer but a much smaller proportion of families with breast cancer
alone. Male breast cancer was seen mainly in BRCA2 families. Evidence concern
ing the function of these two genes, and why mutations cause cancer, is discussed
in Chapter 17.
Important though these discoveries were, it would be a mistake to suppose
that we nOw understand th e genetic basis of breast cancer, or even of familial
breast cancer. Mutations at these two loci account for only 20-25% of the familial
risk of breast cancer. A survey by Ford and colleagues of 257 families with four or
more cases of breast cancer suggested th at. of those with four or five cases of
female breast cancer, but no ovarian or male breast cancer, 67% probably did not
involve mutations in BRCAl or BRCA2, Despite intensive research, no thi.rd major
gene has been discovered that could explain the remaining familial tendency.
Segregation analysis is consistent with the remaining genetic susceptibility being
polygenic-that is, due to a large number of genes, each having only a modest
effect.
The BRCA1I2 genes were identified by focusing on small numbers of special
families, Identifying the other susceptibility factors requires large-scale case
control studies, Many candidate genes have been the subject of focused studies
that have identified additional susceptibility factors.
A specific variant, c, I 100de1C, in the CHEK2 gene (encoding a protein impor
tant in repair of DNA damage, and hence a good candidate for involvement in
cancer) was seen in 201 out of 10,890 cases and only 64 out of9065 controls,
giving an odds ratio of 2.34.
• By sequencing aU 62 exons of the ATM gene (described in more detail in
Chapter 17), mutations were found in 12 out of 443 cases but only 2 out of 521
controls. The odds ratio was 2.37. AU cases were familial, but they did not have
BRCAll2 mutations.
By sequencing all 20 exons of the BRlPl gene (encoding a protein that inter
acts with the BRCAI protein), mutations were found in 9 out of 1212 cases but
only 2 out of 2081 controls, giving an odds ratio of 2.0. Again, the cases were
chosen to be familial but negative for BRCA1I2 mutations.
• By sequencing all 13 exons of the PALB2 gene (encoding a protein that inter
acts with the BRCA2 protein). mutations were found in 10 out of 923 cases
but none of 1084 controls, suggesting an odds ratio of 2.3 for mutation carri
408 cases
ers. As before, the cases were familial and had been checked for BRCAl l2 stage 1 400 con trols
These factors are relevant in the small minority of families in which they are
!
select the 5'¥.. most
segregating, but at a population level their effect is small , For example the CHEK2 associated SNPs
c. ll OOdelC variant was seen in only 1.9% of cases. In fact. all these susceptibility
factors together, including BRCAl and BRCA2, accounted for only abo ut 25% of 3990 cases
the familial tendency. A very large genomewide association study attempted to 1 2,711 SNPs
were considered plausible candidates (FGFR2, TNRC9, MAP3Kl, and LSPl) , The 21,860 cases
fifth did not fall close to any likely candidate gene but, interestingly, was close to stage 3 22,578 controls
the region on chromosome 8q24 that has been implicated in a complex pattern 30 SNPs
of susceptibility to prostate cancer (see Figure 16,17), The odds ratios forthe five
loci were 1.26 (FGFR2), 1.11 (TNRC9), 1.1 3 (MAP3Kl), 1. 07 (LSPl), and 1.08 (the
anonymous 8q24 SNP) . All the putative risk alleles were common in the popula
tion so, despite the low odds ratios, these factors may have an appreciable role in
!
the overall incidence of breast cancer. Whether knowing one's genotype for these 4 contained plausible candidate genes
<0.00001 13 0.96
0.00001-0.000 1 12 7.0
0.0001-0.001 88 53
Data are from stages 1 and 2 of the study outlined in Figure 16.26. aObserved data after certain
linkage and association studies have suggested locations for a long and growing list of
susceptibility factors. In many cases a candidate gene has been identified. <lRegions supported by
sugg estive linkage data (lad scores ~ 2.2). Other regions were defined by association data, although
in some cases there was also weak linkage data.
CARD15 mutations account for only part of the susceptibiliry to Crohn dis
ease. As usual in such cases, many other candidate regions were proposed in
these early linkage studies, two of which are noteworthy. A region on 5q31 (lBD5)
showed linkage and association in several studies, with an odds ratio of 1.5. The
candidate region contains several cytokine genes that are plausible candidates,
but at tempts to identify the precise variant causing susceptibiliry have not so far
been successful. A coding variant in th e IL23R gene at chromosome Ip31
(p.R381Q) is strongly protective, with an odds ratio of O.2S-but this variant
accounts for only a small proportion of overall susceptibiliry, because the
arginine-381 allele that confers susceptibiliry is present in 93% of controls as well
as in 98.1% of patients.
530 Chapter 16: Identifying Human Disease Genes and Susceptibility Factors
TABLE 16.3 STATISTICAL ANALYSIS OF THE DATA FROM A GENOMEWIDE STUDY OF CROHN DISEASE
~
Chromosome Initial GWA study Replication Replication cohort Overall Gene
no. cohort 1:TOT result 2: allele frequency
Allele frequency p (transmissions/non- (cases/controls) Odds p
(cases/ controls) transmissions) ratio
The data analyzed here are also shown in Figure 16.27.The frequency of the minor 5NP allele in cases and controls is shown for the initial scan and
second replication set. TDT data show the transmission/non transmission of the minor 5NP allele from parent to affected offspring. GWA, genomewide
associations; N.D., not done. 8Data from an earlier report by the same group. bSNPs showing significant replication in the TDT series.
HOWWELL HAS DISEASE GENE IDENTIFICATION WORKED? 531
~ w ,. -..I 00 I-" I-" I-" I-" I-" I-" I-" I-" I-"l-"tvi'J0 ><
" '" '"
(D
o ~ " W ./>. 01 (j) -..I 00 (DOf...>tv
chromosome
This work has been chosen as a good example of the progress that has been
made over the past few years. The list of susceptibility loci in Table 16.2 is long
and still growing. The work has demonstrated that determinants of susceptibility
to the two types of inflammatory bowel disease, Crohn disease and ulcerative
colitis, are mostly different, and identifying the main Crohn disease susceptibility
factors has produced a better understanding ofthe pathological process.
the only way to identify causative mutations. A pilot study by Tarpey and col
leagues (see Further Reading) attempted to do this for X-linked mental retarda
tion (XLMR). After excluding cases with mutations in known XLMR genes, 208
families were chosen for study, and an attempt was made to sequence every exon
on the X chromosome in an affected male from each family. The study identified
nine novel disease genes, but in 50- 75% of families (depending on the degree of
certainty required) the cause remained unknown. Worryingly, in several genes
the study identified clearly inactivating mutations in normal male controls.
Extencting this to the much larger number of autosomal disease genes is clearly a
formidabl e task.
Genomewide association studies have been very successful, but
identifying the true functional variants remains difficult
Since 2005, genomewide association studies of a great variety of complex dis
eases have produced a torrent of reproducible data. Many disease-associated
SNPs have been identified and confirmed in replication studies. These SNPs sel
dom cause the susceptibility directly, but they define haplotype blocks that con
tain the actual functional variant(s). To understand the pathogenesis of the dis
ease it is necessary to pinpoint the true functional variant but, as discussed above,
this can be very difficult. Crohn disease (see case study 8) is one of the clearest
success stories, as described above. Significant progress has also been made in
understanding the pathogenesis of man y other complex diseases, for example
both types of diabetes.
Psychiatric conditions remain the most intractable. Given their high inci
dence and immense costs, both in money and suffering, this is singularly unfor
tunate. The big problem with these conditions is that we have so little under
standing of the biology. At the level of brain anatomy or physiology we simply do
not know what has gone wrong in schizophrenia, bipolar disorder, or autism.
Studies rely on clinical labels that probably lump together the consequences of a
highly heterogeneous set of causes. Maybe the way forward is to identify
endophenotypes-characters that are correlated with the clinical diagnosis but
lie closer to biology, such as reaction times or eye movements. The problem here
is to know whether the chosen character has any causal relevance or is just a
downstream consequence of the clinical problem. An alternative line of attack is
to look for common variants that, in different combinations, predispose to a vari
etyofpsychiatric conditions. Copy-number variants at lq21 and 15ql3 have each
been associated with a variety of conditions, including schizophrenia, autism,
seizures, and mental retardation. The same variants are also seen in normallndi
viduals, often the parent of an affected child. Thus, they are neither necessary nor
sufficient to cause disease, but evidently confer significant susceptibility to a
range of conditions. Understanding these effects ma y help us understand the
biology.
that modern living conditions (dry air and frequent washing) aggravate the skin
barrier defects in filaggrin-deficien t babies. These babies then first encounter
antigens th ro ugh a defective skin barrier rather than th ro ugh the normal intesti
nal route, maybe during a short sensitive period.
CONCLUSION
There are many ways in which a disease gene may be identified. Knowledge of
the protein product or of a related gene (in humans or another species) can lead
directly to identification of the gene. More often, first the chromosomal location
of the gene is defined, and then a search of that region is made for a gene that
carries mutations in patients but not in controls (positional clo ning). Sometimes
chromosomal abnormalities can suggest a location. Transiocations, inversions,
or deletions may cause clinical problems by disrupting or deleting genes. If two
unrelated patients share a chromosomal breakpoint and also a particular clini cal
feature, maybe a gene responsible for that feature is located at or near the break
point. More often, however, disease genes are localized by linkage analysis in a
collection of families where the disease is segregating, as described in Chapters
14 and 15.
Once a candidate location has been defined, it is necessary to identify all the
genes in the region and prioritize them for mutation analysis. The genome data
bases are a powerful but not infallible tool for identifying candidate genes. It may
be necessary to perform additional searches or laboratory experiments to iden
tify all exons of a gene and other transc ripts. Genes are prioritized for testing on
the basis of their likely functi on and expression pattern, and also on practical
considerations about the ease of testing. As sequencing has become faster and
cheaper, it has become increasingly attractive simply to sequence all exons across
the candidate region, without trying too hard to decide priorities.
FURTHER READING 535
This approach has been outstandingly successful at identifying the genes and
mutations underlying Mendelian conditions, but less so for complex diseases. As
described in Chapter 15, linkage analysis, even using non-parametric methods,
has had very limited success with complex diseases, probably because of the
great heterogeneity of causes. Association studies are much more successfuL
Genomewide association studies of many different complex diseases have used
high-resolution SNP genotyping in large case-control stud ies to identify numer
ous SNPs that are associated with either susceptibility or resistance to the dis
ease. Generally these SNPs will not directly affect susceptibility. Instead, they will
be in linkage disequilibrium with the functional variant. Identifying the true
functional variants is extremely challenging.
The question will remain : whatis the practical value of having a list of suscep
tibilityfactors that confer relative risks of 1.2 or less? Funding agencies and phar
maceutical companies have invested heavily in research in these areas in the
belief, or at least the hope, that such knowledge would revolutionize medicine. In
Chapter 19 we ask how far these hopes look like being realized. Meanwhile, per
haps the area of medicine in which analyzing genomewide changes using micro
arrays is closest to clinical application is oncology. We explore this area in the
next chapter.
FURTHER READING
Positional cloning ways in which Drosophila can be used to help understa nd
human disease.)
Church DM, Goodstadt L, Hillier LW et al. (2009) lineage-speCific
Cuzin F, Grandjean V & Rassoulzadegan M (2008) Inherited
biology revealed by a finished genome assembly of the
variation at the epigenetic level: paramutation from the
mouse. PLoS 8iol. 7(5), e1000112.
plant to the mouse. Curro Opin. Genet. Dev. 18, 193- 196. [A
Comprehensive Mouse Knockout Con sortium (2004) The
review of paramutation by the group that des cribed the Kit
knockout mouse project. Nat. Genet. 36, 921 -924.
para mutation in the mouse.]
Database of Genomic Variants. http://projects.tcag.ca/variation/
Jirtle RL & Skinner MK (2007) Environmental epigenomics and
[A database of copy-number and other structural variants
disease susceptibility. Nat. Rev. Genet. 8, 253-262. (A review
found in apparently normal subjects.]
of possible cases in which epigenetic effects allow acquired
European Mouse MutageneSis Consortium (2004) The European
characters to be heritable.]
dimension for the mouse genome mutagenesis program.
Nat. Genet. 36, 925-927.
Identifying causal variants from association studies
Peters LL, Robledo RF, Bult U et al. (2007) The mouse as a
resource for human bio logy: a resource guide for complex Balding DJ (2006) A tutorial on statistical method s for population
trait analysis. Nat. Rev. Genet. 8, 58-69. [An impressive association studies. Nar. Rev. Genet. 7, 781 -790. [A detailed
detailing o f the very extensive resources in mouse genetics discussion of the statistical methods used in association
that can help in the identification of human disease factors.] studies.]
Steinmetz LM, Scharfe C, Deutschbauer AM et al. (2002) Horikawa Y, Oda N, Cox NJ et al. (2000) Genetic variation in the
Systematic screen for human disease genes in yeast. Nat. gene encoding calpain-l 0 is associated with type 2 diabetes
Genet. 31,400-404. mellitus. Nat. Genet. 26, 163-175. [A heroic early attempt to
identify the precise variant causing susceptibility.)
The value of patients with chromosomal Wellcome Trust Case Control Consortium (2007) Genome-wide
abnormalities association study of 14,000 cases of seven common diseases
and 3,000 shared controls. Nature 447, 661-678. (A good
Kurotaki N, Imaizumi K, Harada N et al. (2002) Haploinsufficiency
overview of the state of the art in genomewide association
of NSOI causes 50tos syndrome. Nat. Genet. 30, 365-366.
studies.]
Vissers LELM, Veltman JA Geurts van Kessel A & Brunner HG
Witte JS (2007) Multiple prostate cancer risk variants on 8q24.
(2005) Identification of disease genes by whole genome CGH
Nat. Genet. 39, 579-580. [A co mmentary on three papers in
arrays . Hum. Mol. Genet. 14, R215-R223.
this number of the journal that report association studies of
Position-independent routes to identifying disease susceptibility to prostate cancer.]
Cancer Genetics
KEY CONCEPTS
• Can cer is the result of somatic cells acquir ing genetic changes that confer on them six general
features: (1) independence of external growth signals, (2) insensitivity to extern al anti-grovoJth
signals, (3) the ability to avoid apoptosis, (4) the ability to replicate indefinitely, (5) the ability of a
mass of such cells to trigger an giogenesis and vascularize, and (6) th e abili ty to invade tissues and
establis h secondary tumo rs.
• These feat ures are acqnired th ro ugh a norm al Darwinian process of nat ural selection acting on
random m utations. The microevolution occ urs in several stages, with each successive change
givi ng an extra selec tive advan tage to descendan ts of a cell.
• Highly evolved and sophisticated defense mechanisms protect the body against the proliferatio n
of such mu tant ceUs, but tumor cells have mutations that disa ble these defenses.
Stem cells, un Hke most somatic cells, have the ability to replica te inde finitely. and so they already
possess one of the features of cancer cells. Thus, stem cells are likely candidates to evo lve into
tumor cells.
• Oncogenes normally act to pro mote cell division, but a co mplex regu latory network limits this
activity. In tumor ce Us, one copy of an oncogene is orren abn ormally activated-th ro ugh point
mutations, copy-nu mber amplificati on, or ch romos omaJ rearrangements- so th at it escapes
regulation.
• Chrom osomal rearrangements in tumor cells often create novel chimeric oncogenes but may
alternatively upregulate the expression of an oncoge ne by placing it under the influence of a
powerful enhancer.
• The normal role of tumor suppressor genes is to limit cell divisi on. In tu mor celis, both copies
of a tumor suppressor gene are often inactivated by deletions, point mutations, or methylation
of the promoter.
• Many tumor suppressor genes have been identified through inves tigation of familiaJ cancer
predisposition synd ro mes. In th ese syndro mes, individuals inherit a muta ti on that inactivates
one allele of a tumo r suppressor gene.
• Oncogenes and tum or suppressor genes normally funct ion in the cell signaJ ing that co ntrols the
cell cycle, or in the respo nse to DNA damage. Understanding th ese processes is central to
W1 derstand ing what goes WTo ng jn cancer.
• Genomic instability is a normaJ feature of tumor cells. Because of this instability, tumors co ntain
large populati ons of cells carrying a great variety of mutati ons, on which natural selection can
act. Driver mu tations co ntribute to tumor dev·elopment and are subject to positive selection;
passenger mutations are chance by-prod ucts of the geno mic instability of mos t tumor cells.
• Most instabili ty is seen at the duomoso maJ level. However, so me tumors are cytogenetically
no rmal bnt have a h igh level of DNA replicati on errors.
New sequencing te ch no logy allows the total ity of ac quired genetic changes in tum or cells to be
cataloged. The results of these studies emphasize the individuality an d large number of such
changes in tumor cells.
Histologi cal and molecular profiling of a tu mor can provide impo rtant prognostiCinfo rmatio n
and a guid e to treatment. Drugs can targe t specific acqu ired genetic changes. Genomevvide
studies using expression arrays are an important too l for this.
Tumorigenesis is best und erstood by th inking in terms of altered pathways rather th an individual
m utated genes.
538 Chapter 17: Cancer Genetics
-
,
. ' .\
• •
•. ,
co
o
oral
mucosa
[ ~-~~~~;Jz)
.......... ~
A
'.
Fl9ut. 17.1 Stages in the development of a carcinoma. Three histological sectionsof oral mucosa, stained with hematoxylin and
eosin, showing stages in the development of oral cancer. (A) Normal epithelium. (B) Dysplastic epithelium, which is a potentially
premalignant change.The epithelium shows disordered growth and maturation, abnormal celis, and an increased mitotic index
(the proportion of celis undergoing mitosis). (C) Cancer arising from the surface epithelium and invading the underlying connective
tissues. The islands of the tumor (carcinoma) show disordered differentiation, abnormal celis, and increased and atypical mitoses.
Pathologists use such changes in tissue architecture to identify and grade tumors. (Courtesy of Nalin Thakker, University of
Manchester.)
Cancer-a condition in which cells divide without control-is not so much a dis
ease as the natural end state of any multicellular organism. We are all familiar
with the basic Darwinian idea that a population of organisms that show heredi
tary variation in reproductive capacity will evolve by natural selection. Genotypes
that reproduce faster or more extensively will come to dominate later genera
tions, only to be supplanted, in turn, by yet more efficient reproducers. Exactly
the same applies to the population of cells that constitutes a multicellular organ
ism. The determining factor can be an increased birth rate or a decreased death
rate. Cellular birth and death are under genetic control, and if somatic mutation
creates a va riant that proliferates faster, the mutant clone will tend to take over
the organism. Cancers are the result of a series of somatic mutations with, in
some cases, also an inherited predisposition. Thus, cancer can be seen as a natu
ral evolutionary process.
Tumors can be classified according to the tissue of origin; thus, carcinomas
are derived from epithelial cells, sarcomas from bone or connective tissue, and
leukemias and lymphomas from blood cell precursors. Solid tumors can be con
sidered as organs, in that they consist of a variety of cell types and are thought to
be maintained by a small population of (cancer) stem cells. An important distinc
tion is between benign (noninvasive) and malignant (invasive) tumors.
Pathologists classify tumors based on the histology as seen under the microscope
(FIgure 17.1). Genetic tests for specific chromosomal rearrangements or genom
ewide expression profiling (discussed later in this chapter) allow further refine
ments ofthe classification. These classifications are important for deciding prog
nosis and management, but they do not explain how the cancer evolved. The aim
of cancer genetics is to understa nd the multi-step mutational and selective path
way that allowed a normal somatic cell (but maybe a stem cell) to found a popu
lation of proliferating and invasive cancer cells. As the key molecular events are
revealed, new prognostic indicators become available to the pathologist. It is
hoped that some of these events will also present new therapeutic targets to the
oncologist.
THE EVOLUTION OF CANCER 539
Turn ing a normal epithelial cell into an invasive cancer cell requires perhaps six specific mutations
in the one cell. It would seem extremely unlikely that anyone cell should suffer so many mutations
(wh ich is why most of us are alive) . If a typical mutation rate is 10-7 per gene per ce ll generation,
the probabil it y of this happening to anyone of t he 10 13 cells in a person is 1013 X 10-42, or 1 in
1029. Cancer nevertheless happens because of a combination of two mechanisms:
Some mutations enhance cell proliferation, creating an expanded target population of cells
for the next mutati on (Figure 17.2). This may require a combination of two or more mutations.
Some mutations affect the stability of the entire genome, at either the DNA or the
chromosomal level, increasing the overall mutation rate. Malignant tumor cells usually
advertise their genomic instability by their abnormal ka ryotypes.
Because ca ncers depend on these two mechanisms, they develop in stages, starti ng with
tissue hyperplasia or benign growths. Within a stage, successive random mutations gen erate an
increasingly diverse cell population until eventually, by chance, one ce ll acquires a change or
combination of changes that gives it a growth advantage.
540 Chapter 17: Cancer Genetics
mutation can increase the likelihood that a cell will pick up subsequent muta
tions, either by conferring a growth advantage (Figu.-e 17.2) or by inducing
genomic instability. Accumulating all these mutations nevertheless takes time, so
that cancer is mainly a disease of post-reproductive life, when there is little selec
tive pressure to improve the defenses still further. \ selective growth of clone
Cell types differ in the extent to which their normal capabilities approach \ "" th muLat.on no. 1
these tumor cell capabilities-for example, some cell types divide rapidly, some
/
are relatively resistant to apoptosis, and so on. Tumors arise most easily from
cells in which that approach is closest. The existence of populations of rapidly • '.:'<\:.; . . ~ _ _ _ mutation no. 2
dividing and relatively undifferentiated cells in fetuses and infants explains the Z,.: ._..,,_~r
...-.:::. ';~
special cancers seen in young children. Stem cells are thought to be important as
..;.. .•-: .. .,...
tumor progenitors, because they already possess the capacity for indefinite pro
liferation. There is controversy over how far all tumors arise from mutated stem
. " :.',
\ ".
cells, but there is agreement that tumor precursor cells have stem cell-like prop
erties, whether these are innate or are acquired by mutation. .
,~,e"i,e gmwth of clone
With mutations 1 + 2
The genes that are the targets of these mutations can be divided into two /
broad categories, although, as always in biology, these are more tools for thinking •• . . ' l ' ..
about cancer than watertight exclusive classifications. ~Il~
• Oncogenes are genes whose normal activity promotes cell proliferation. Gain .~ ' ..
•• "'?;C .r mutation no. 3
of-function mutations in tumor cells create forms that are excessively or inap
propriately active. A single mutant allele may affect the behavior of a cell. The . ' .' .
'i.~.~,··
.
..
• ;0 ,
nonmutant versions are properly called proto-oncogenes.
• Tumor suppressor genes are genes whose products act to limit normal cell J,
proliferation. Mutant versions in cancer cells have lost their function. Some I continuing evolution by
tumor suppressor gene products prevent inappropriate cell cycle progres "" natural selection
sion, some steer deviant cells into apoptosis, and others keep the genome sta
ble and mutation rates low by ensuring accurate replication, repair, and seg J,
.. ..
•
regation of the cell's DNA. Both alleles of a tumor suppressor gene must be ' ~ '.~". '
... ·f·. :::': .
inactivated to change the behavior of a cell.
By analogy with a bus, one can picture the oncogenes as the accelerator and .i).r" '"f;!
the tumor suppressor genes as the brake. Jamming the accelerator on (a domi
tr'l8l..gt'lan1 tumor
nant gain of function of an oncogene) or having all the brakes fail (a recessive loss
of function of a tumor suppressor gene) will make the bus run out of control. Figunl17.2 Multi-stage evolution of
Traditionally, both categories have been seen as protein-coding genes, but simi cancer. Each successive mutati'on gives the
lar reasoning could be applied to other genetic control elements-in particular cell a growth advantage, so that it forms an
expanded clone, thus presenting a larger
microRNAs (see Chapter 9, p. 283).
target for the next mutation.
17.2 ONCOGENES
Oncogenes were discovered in the 1960s when it was realized that some animal
cancers (especially leukemias and lymphomas) were caused by viruses. Some of
the viruses had relatively complicated DNA genomes (SV40 virus and papilloma
viruses, for example), but others were acute transforming retroviruses that had
very simple RNA genomes. The genome of a standard retrovirus has just three
transcription units: gag, encoding internal proteins; pol, encoding a reverse tran
scriptase and other proteins; and env, encoding envelope proteins. Post
transcriptional processing produces several proteins from single transcripts.
Great excitement ensued when it was discovered that the ability of acute trans
forming retroviruses to transform cells in culture (that is, to change their growth
pattern to one resembling that of tumor cells) was entirely due to their posses
sion of one extra gene, the oncogene (Flgu.-e 17.3). For a short time, some enthu Figure 17.3 An acute transforming
siasts hoped that the whole of cancer might be explained by infection with viruses retrovirus. (A) The genome of a standard
carrying oncogenes, and that once these had been identified anti-cancer vac retrovirus has three transcription units: gag,
cines could be developed. encoding internal proteins; pol, encoding
proteins needed for replication; and env,
encoding the proteins needed for packaging
(A) 8500 nt RNA genome new virus particles. There are also long
terminal repeated sequences (LTR). (S) In a
LTR gag pol LTR random processing error, part of the normal
retroviral genome has been replaced by a
(8) cellular oncogene. The virus is no longer
competent to replicate, but it can ferry the
LTR oncogene LfR oncogene into a cell. nt, nucleotide.
ONCOGENES 541
Researchers soon discovered that the viral oncogenes were copies of normal
cellular genes, proto-oncogenes, that had become accidentally incorporated into
the retroviral particles. The viral versions were somehow activated, enabling
them to transform infected cells.
Most human cancers do not depend on viruses, but nevertheless their resi
dent proto-oncogenes have become activated. In the late 1970s an assay was
developed to detect activated cellular oncogenes. NIH 3T3 mouse fibroblast cells
were transfected with random DNA fragm ents from hum an tumors. Some of the
ceUs were transformed, and the human DNA responsible could be identified by
constructing a genomic library in bacteriophage from fragments of the DNA of
the transformed 3T3 cells and then screening for recombinant phage that con
tained the hum an -specificAlurepeat. The oncogenes identified in the transform
ing fragments included many that were already known from the viral studies.
Epidermal growth factor receptor £GFR 7p11 .2 v-erbb chicken erythroleuke mia
Receptor tYfos ine kinase HRAS l l p15.5 v·ras Harvey rat sarcoma
DNA-BINDING PROTEINS
D-type cyclins:
"The viral genes are designated v-sis, v-my" etc., their cellular counterparts can be described as c-sis. c-my', etc. bKaposi sarcoma-associated herpesvirus
encodes a virus-specifi c D-type cyclin, which is an independen t gene, not an activated version of one o f the human cyclin 0 genes. However, oncogenic
activated versions of all three human D-type cyclins, the result of somatic mutations, have been identified in certain leukemias.
542 Ch apter 17: Cancer Genet ics
MYCN neuroblastoma
Chromosomal rearrangement BCR ABLI chronic myeloid leukemia (see also Table 17.3)
creating a novel chimeric gene
meric gene is the Philadelphia (Ph ' ) chromosome, a small acrocentric chromo
some seen in 90% of patients with chronic myeloid leukemia. The Ph'
translocation joins the 3' part of the ABU genomic sequence onto the 5' part of
the BCR (breakpoint cluster region) gene on chromosome 22, creating a novel
fusion gene. This chimeric gene is expressed to produce a tyrosine kinase related
to the ABU product but with abnormal transforming properties (FIgure 17.6) .
Many other tumor-specific recurrent rearrangements that produce chimeric
oncogenes have now been recognized (Table L7.3). This mechanism is seen in
15-25% of leukemias, lymphomas, and sarcomas, but has been reported in only
1% or less of the common solid epithelial tumors. The apparent scarcity of gene
fusions in epithelial tumors may, however, be deceptive, and be due to technical
difficulties. A review by Mitelman and colleagues (see Further Reading) gives a
valuable discussion and many examples. All known examples are cataloged in
the Mitelman database of chromosomal aberrations in cancer (http://cgap.nci.
nih.gov/Chromosomes/). At present, 358 different fusions have been recognized,
involving 337 different genes. Many genes are promiscuous-that is, they are
found in fusions with various different partners. The MLL gene at llq23 is the
most extreme example, having been noted in fusions with more than 40 different
partners. Fusions can be produced not only by translocations but also by inver
sions or, occasionally, by deletions that remove sequences that normally separate
two genes.
Clinically, the specific gene fusions present in a patient's leukemic cells have
an important bearing on the management and prognosis. They can be identified
by a targeted FISH assay. For example, to check for the 9;22 translocation, differ
ently colored FISH probes for the BCR and ABU genes can be hybridized to an
interphase cell. If the translocation is present, there will be one BCR signal, one
ABL signal, and two fusion signals from the reciprocal products of the transloca
tion (see Figure 2.17).
Activation by translocation into a transcriptionally active chromatin
region
Burkitt lymphoma is especially common in malarial regions of Central Africa and
Papua New Guinea. Mosquitoes and Epstein-Barr virus are believed to have some
role in the etiology, but activation of the MYConcogene is a central event. A char
acteristic chromosomal translocation, t(8;14)(q24;q32), is seen in 75-85% of
patients (FIgure 17.7)-. The remainder have t(2;8)(pI2;q24) or t(8;22)(q24;qll).
Each of these translocations juxtaposes the MYC oncogene (normally located at
ONCOGENES 545
Malignant melanoma of soft parts t(12;22)(q13;q 12) fWS- ATF1 transcription factor
Desmoplastic small round cell tumor tl1';22)(p13;q 12) EWS-WT1 transcription factor
Papillary thyroid carcinoma invl1 )(q2 1;qlll NTRK1-TPM3 (TRK oncogene) tyrosine kinase
Acute pro myelocytic leukemia t(15; 17)(q22;q1') PML- RARA transcription factor + retinoic acid receptor
Note how the sa me gene may be involved in several differe nt rearrangements. For a more complete list see http://www.sa nger.ac.uk/genetics/ CGP
/ Census. ALL, acute Iymphoblastoid leukemia; AML, acute myeloid leukemia; CM L, chronic m yeloid leukemia.
8q24) close to an immunoglobulin (IG) locus. This may be IGH at 14q32, IGKat
2p12, or IGLat 22q ll. Unlike the translocations shown in Table 17.3, these trans
locations do not create novel chimeric genes. Instead. they bring the oncogene
under the influence of regulatory elements that normally ensure high expression
ofthe immunoglobulin genes in antibody-producing B cells. In the 8; 14 translo
cation, the MYCand IGH genes are in opposite transcriptional orientations, head
to head. Often, depending on the precise breakpoint, exon 1 of the MYC gene
(which is noncoding) is not included in the translocated material. Deprived of its
normal upstream controls and placed in an active chromatin domain, MYC is
expressed at an inappropriately high level. Between 25% and 65% of all B-cell
malignancies involve the activation of one or another oncogene by an
B 14 der (8) der (14)
~!
E. This ca uses high levels of MYC expressi on
E I I EI in B celis. The tran slocated MYC gene often
C>L.~~
lacks exon 1 (depending on the exact
c c'f2. c1'2t> c: cl'J c ell-
a C, 6 JDV exon 1 exon 2 exon 3 position of the breakpoint), but this exon is
noncoding and its absence does not affec t
IGH MYC function of the MVC protein .
546 Chapter 17: Cancer Genetics
m"kec A Re
marker B
'112
1
+
2
wild-type allele In retinoblastoma.
(A) Loss of a whole chrom osom e by mitotic
nondi sjunction. (B) l oss followed by
redu plicatio n to give (in this case) three
copies of the RB chromosome. (C) Mitotic
'I 'I'.'ft 11
' 2't '1!2'1!2
w as th e first demo nst ration o f mitotic
reco m bination in humans, or indeed in
mammals. (D) Deletion o f the wild-type
RB R8 RBT RB - RB RB RB n8 RB
marker A 1
markerS 1 11 1 1 · 1 1 T 1 2
allele. (E) Path ogenic point mutatio n o f the
wild-type allele. (Adapted f(Om Cavenee WK,
Dryj a TP, Phillips RA et aL (1983) Norure 305,
779- 784. With permission from M ac millan
n0
sample: N TNT N TNT NT N T NTNT NT N T Publishers ltd.1 The figures un derneath show
all.'es l !:, T
B
,
---I
fJ [J - I
--
'---- ___
- - __ i,
results of typi ng no rmal (N) and tum or (T)
DNA for the two markers A and B located
as shown. Note the patterns of los5 of
heterozygosity.
marker: A B A B A B A B A B
References to the genes and diseases may be found in OMIM under the number cited. A longer list is
given in Tabl e 1 of Futreal PA et 03 1. (2001) Nature 409,850-852.
Some genes freq uently lose the function of one allele in tumors, but the
retained allele seems fully functional. Sometimes this is because the second allele
is inactivated by methylation, which is not detected by some of the experimental
methods used, but some cases do seem to be genuine one-hit events. It is not
unreasonable to postulate that there are tumor suppressor genes for which hap
loinsufficiency is enough to provide a growth advantage.
More unusually, there is at least one case in which three hits are seen. In famil
ial adenomatous polyposis (FAP or APC; OMIM 175100; described later in this
chapter) numerous premalignant polyps (adenomas) develop in the colon.
Certain inherited germ-line mutations are weak and confer an attenuated phe
notype with fewer polyps. Adenomas from such patients often have two somatic
mutations in addition to the inherited mutation. One somatic mutation inacti
vates the wild-type allele. This gives the cells a growth advantage, but they gain a
further advan tage from a chromosomal deletion that eliminates the partly func
tional germ-line APC allele.
tumor DNA ~
2 19.94 231.53 ........ length of (he PCR product
LOH analysis has been a majo r industry in cancer research during the 1990s, stromal tumor apparent
contamination cells LOH
but it is fair to ask whether the results have justified the effort. Advanced cancer
cells may show LOH at as many as one-quarter of all loci, so large samples are
needed to tease out the specific changes from the general background. Array
CGH and other genomewide app roaches have supplanted studies with single
!pTSGl
markers. The results suggestthe presence of a remarkably large number of tumor
suppressor genes, yet attempts to follow up such studies by position al cloning TSG - .
have had quite a low success rate. !J!TSG2
The tumor DNA used in LOH studies always co mes from a collection of cells,
and these may well not be genetically homogeneous. Agure 17.1 1 illustrates one
pitfall that arises from the inevitable contamination of primary tumor m aterial
by stromal (normal) ceils. It is better to sp eak of a quantitative allelic imbalance
rather than an all-or-nothing loss of heterozygosity. CeHlines can be used to cir
figu.rlli!: 17.11 A pitfall in interpreting
cumvent the proble m of heterogeneity. but tumor ceHlines normally h ave many
LOH data. The (rue event in this tu mor is
more chromosome abnormalities than primary tumor cells. A more general homozygo us deletion o f the TSG gene. In
proble m is that there is almost never any information on the karyotypes of the the region of hom ozygous deletion, all the
tumors used in LOH studies-and LOH is a prod uct of the chromoso mal instabil peR prod uct is derived from contaminating
ity and deficient repair of double-strand DNA breaks that are so characteristic of stromal m ate rial (g reen lines), w hich is
tumor cells. Maybe so me of the losses observed are by- products of specific pat genetically normal. Thus, LOH data show
terns of chromosomal instability rather than selectio n for loss of a tumor sup retention of heterozygosity at the TSG locus
pressor gene. (pale pin k), In the flanking regions one allele
is present in the rumor ce lls, so the peR
prod uct shows LOH (dark pink). The pattern
Tumor suppressor genes are often silenced epigenetically by suggests the existence o f two spurious
methylation tumor su ppressor loci, VlfSG1 and VITSG2.
The true situation wou ld be revealed by
Tumor suppressor genes may be silenced by deletion (reflected in loss ofhetero
fluore scence in situ hybridization (FISH) or
zygosity) or by pOint mutations, but a very co mmon third mechanism is methyla by immunohistochemical testing for the
tion of the promoter. This mech anism is seen with genes whose prom oters con TSG gene product.
tain CpG islands. Most CpG islands in normal cells are un methylated (see Chapter
11 for a discussion of exceptions to this). [n tumor cells, we frequently see meth
ylation of normally unmethylated CpG islands in the promoters of certain genes.
The effect is to prevent expression of the gene. Specific genes diffe r in their sus
ceptibility to this type of silencing (Figure 17.J2). Some tumor suppressor genes
(for example MSHZ) are often silenced by mutation but never by methylation. For
others (for exam ple MLHl) , meth ylation occurs as a common alternative to point
mutation, whereas in some (for example RASSFl A at 3p21, and HICl at 17pI3.3).
meth ylation is the only known mechanism causing tumor-specific loss of
fun ction.
""
,"""
R,I:;SFJ,J. rt
~ A~ ~
~ ~ w~ . . ~ ~~. II I'C~.,
'""
SM.otRCA31 I I I
J." r of P ,.
1 2 3 6
• 9 10 11 12 13
r
~ ~
" 15
ff=
16
COHl
fi: fi"w~ HS~ll ~
17
aRC",u
18 19 2
0 8
21
It'"UP,
22
s'U:'HCEI'I
x y
F1gure 17.12 Ways of silencing tumor suppressor genes. Genes shown in green are silenced only by mutation, and those in red only by methylation,
whereas those in purple may be Si lenced by either mechanism. [From Jones PA & Baylin S8 (2002) Nat. Rev. Genet. 3, 415- 4 28. With permissio n from
Macmillan Publishers Ltd.]
550 Chapter 17: Cancer Genetics
Any cell has three choices of behavior: it can progress through the cell cycle to
complete another round of division, it can opt out of the cell cycle to become a
non-dividing cell, or it can die (apoptosis). Some cells also have the option offol
lowing a program of differentiation, like the blood cell rypes shown in Figure 4.17.
Cells select one of these options in response to internal and external signals.
Dysregulation of the cell cycle is the cardinal feature of cancer cells. These cells
progress relentlessly through round after round of mitosis, even when it would be
more appropriate to pause, exit from the cycle, or commit suicide by apoptosis.
Not surprisingly, the genes that generate and interpret the signals controlling cell
cycle progression featUle strongly in lists of oncogenes and tumor suppressor
genes.
Progression through the cell cycle is controlled by cyclins and cyclin
dependent kinases and regulated at a series of checkpoints. The factors that con
trol the process were described in Chapter 4; they are briefly recapitulated here.
Progression is controlled by fluctuating levels of a series of cyclin proteins,
which are required to activate cyclin-dependent kinases (Cdks; see Figure 4.14B).
Upstream of the Cdks is a crowd of activators and inhibitors that act on the
cyclin-Cdk complexes. The whole network of kinases, cydins, inhibitors, and
activators integrates many different signals, ensuring a flexible and responsive
control of the cell cycle. As mentioned in Chapter 4, Gl is the phase in which cells
spend most of their time and do most of their work, even if they are destined
eventually to proceed to S phase and so through the cycle. Controls on progres
sion through Gl are especially important targets of oncogenic mutations. Some
of the players in these controls are described below.
Progression from one phase of the cycle to the next is controlled at a series of
checkpoints (Flgu{e 17.13). These ensure that a cell can only progress round the
cycle when it, and in particular its DNA, is in a suitable condition. There are three
main checkpoints.
• The GdS checkpoint. This checkpoint is especially important, because a cell
that passes the GrlS boundary is committed to mitosis. The checkpo int is
controlled by Cdk2/cyclin E. Entry into S phase is blocked when there is unre
paired DNA damage. Irreparable damage leads to apoptosis. Within S phase
there are additional checkpoints at which DNA damage prevents new origins
of replication from becoming active.
• The G2 /M checkpoint. Cells are blocked from entering mitosis unless DNA
G;;,fM checkpoint
replication and the repair of any damage are complete. Entry into mitosis
depends on the activation of Cdkllcyclin B by the phosphatase Cdc2Se. spindle checkpoint
Incomplete DNA replication or unrepaired damage generates a signal that
activates inhibitors of Cdc25C, thus preventing Cdkl from becoming active.
• The spindle (or mitotic) checkpoint. Separation of chromatids at anaphase
r
of mitosis is triggered by the anaphase-promoting complex (APC) or cyclo
some. This multi protein ubiquitin ligase degrades cyclins A and B, and (indi
rectly) the cohesin glue that holds sister chromatids together. Kinetochores
that are not attached to spindle microtubules secrete a signal that inhibits the S·phase
APe. If this Signaling is defective, chroma tids can start to separate before all of checkpoints
them have been correctly attached to spindle fibers, and there is then no way
of ensuring that exactly one chromatid of each chromosome goes into each
daughter cell. (Note that APC can refer either to the anaphase-promoting
GdS checkpoint
complex or to the APC tumor suppressor gene, described below, that is
mutated in familial and sporadic colon cancer. This is confuSing, because one 11gure 17.1 ] Cell cycle checkpoints.
function of the APCgene product is to stabilize the attachment of microtu A series of controls prevent the cell from
bules to chromosomes, and APC mutations cause chromosome instability. entering S phase (G 1/S checkpoint) or
But the Apc protein is not part of the anaphase-promoting complex.) mitosis (G2IM checkpoint) with unrepaired
DNA damage, or from starting anaphase of
In cancer cells, all the checkpoint mechanisms are rypically defective. Cells
mitosis until all chromosomes have been
replicate their DNA despite damage, enter mitosis with unrepaired damage, and correctly attached to the spindle fibers
become inefficient at segregating their chromosomes correctly. All of this desta (spindle checkpoint). During S phase, DNA
bilizes the genome and lays the ground for further evolution toward full-blown damage prevents the sequential activation
malignancy. of new origins o f replication.
CELL CYCLE DYSREGULATION IN CANCER 551
1
~ --T--
irreparable damage:
MOm' ---l P53) ( ~ trigger apoptosis
1
~ P14/1f1F I 1 reparable damage:
f-
\ CdKZ
cyc lin
,
I:'
pause for repair
G,
1 -- S
mechanisms of activation and inhibition
often involve the phosph orylation of serine
or tyrosine residues in the various proteins.
552 Chapter 17: Cancer Genetics
p14ARF p16INK~A
r n_ ~ A
COKN2A gene. This gene (also known as
MrS and INK4A) encodes two completely
unrelated proteins. p1 61NK4A is translated
'. /
------
from exons 1U, 2, and 3, and p1 4ARF from
exons 1~ la. 2 3 exons 1p, 2, and 3-but with a different
reading frame of exons 2 and 3. The two
gene products are active in the pRb and p53
p16 coding
sequence
p14 coding
sequence - untranslated
arms of cell cycle control, respectively, as
shown in Figure 17.14.
INSTABILITY OFTHE GENOME 553
I 7 8 9 10 11
II
12
6
-
II II
13 14 15 16 17 18
II
19 20
They probably arise in three ways:
21 22
I
x y
In many tumor cells the spindle checkpoint is defective. Chromatids can start
to separate before all of them are correctly attached to spindle fibers. There is
then no way of ensuring that exactly one chromatid of each chromosome
goes into each daughter cell. Cancer cells often contain extra centrosomes,
which may produce abnormal spindles. Failure of the spindle checkpoint is
probably the main source of the many numerical abnormalities seen in can
cer ceUs.
• Tumor ceUs are able to progress through the cell cycle despite having ume
paired DNA damage. Normally unrepaired damage generates a signal that
stalls the cell cycle until the damage is repaired, and irreparable damage trig
gers apoptosis. In cancer cells, either the damage signaling system or the
apoplatic response is often defective. Structural chromosome abnormalities
can be a by-product of attempts at DNA replication or mitosis with damaged
DNA.
• Tumor cells may replicate to the point that telomeres become too short to
protect chromosome ends, leading to all sorts of structural abnormalities.
Telomeres are essential for chromosomal stability
As we saw in Chapter 2, the ends of all human chromosomes carry a repeat
sequence (TTAGGG1". This forms a special DNA structure that binds specific pro
teins and protects chromosome ends from being treated by the cell as DNA
double-strand breaks. Telomere length declines by 50-100 bp with each cell gen
eration because of the inability of DNA polymerase to use the extreme 3' end of a
DNA strand as a template for replication (see Figure 2.13). Telomere length is
INSTABILITY OF THE GENOME 555
1""n,,:"-.J,~ O
assembly of by DNA double·strand breaks (DSB) and
repair complexes
phosphorylates numerous substrates, some
of which are involved in DNA repair
(e.g. Nibrin and BRCA 1). ATM also activates
the CHEK2 protein, which in turn activates
phosphocylaUoo CD pS3, the principal target of the DNA damage
~~~~) )
monom~
detection system. Phosphorylated pS3 acts
ATM dimer
(inactive)
1
phosphorylation
1M
(activa) er
many other
many other substrates
to block further progress through the cell
cycle and also promotes apoptosis.
CHEK2
(inactive)
1 (act',e)
PhoSPhorylation )
of damage. Any cell that is unable to repair DNA damage should be eliminated by
apoptosis, but in a whole population of such cells there must be selective pres
sure for variants that can escape death. The different types of damage and differ
ent repair mechanism s were described in Chapter 13 and are reviewed by
Hoeijmakers (see Further Reading) . Defects in each are associated with specific
forms of cancer:
• Nucleotide excision repair is defective in several cancer-prone syndromes,
particularly the various forms of xeroderma pigm entosum (XP; OMIM
278700). Patients with XP are homozygous for inherited loss-of-function
mutations in one or other of the genes involved in nucleotide excision repair.
They are un-able to repair DNA damage caused by ultraviolet radiation. Tiley
are exceedingly sensitive to sunlight and develop many tumors on exposed
skin.
• Base excision repair defects are not common in human cancers, but a survey
of patients with colon cancer by Farrington and colleagues (see Further
Reading) showed that rare homozygous defects in the MUTYH repair enzyme
were associated with a 93-fold increased risk of colon cancer. Overall, these
accounted for about 0.5% of all cases of colon cancer.
• Double-strand break repalr defects primarily affect homologous recombina
tion rather than nonllomologous end joining, although both mechanisms
require the ATM-NBSI-BRCAI-BRCAZ machinery mentioned above.
• Replication error repair defects cause a large generalized increase in muta
tion rates. These defects came to light when studies of familial colon cancer
revealed the phenomenon of microsatellite instability, described below.
nonmalignant DNA ~
16 0.50
567
1.73.19
369
L-JL
211.96 227 .50
16'50 1301
Figure 17.1' Microsatellite Instability.
~
Electropherograms of two PCR-amplified
~
microsate!lite markers. upper traces: blood
DNA. Lower traces: tumor DNA. Note the
lumor DNA 211. .99 227.53
160.49 173.21 extra peaks in the tum or DNA. (Courtesy of
20'75 467
378 197 Lise Hansen, University of Aarhus.)
558 Chapter 17: Cancer Genetics
aOata from only one group, and interpretations are ambig uous. bAtypica ll ate -onset lynch
synd rome I and endometrial cancer. For details, see Jiricny J & Nyst rom-Lahti M (2000) CUff. Opin.
i
,. /,.
~
hMutSCl
r
deletion loops. These are recognized by
hMutSa, a dimer o f MSH2 and MSH6
proteins, or sometimes by the MSH2-MSH3
(1) strip back exonuclease dimer hMutSp. The proteins transloca te
(2) resynthesize pol~e~ase along the DNA, bind the MlH l-PMS2 dimer
replication factors
hM utLa, then assemble the full repairosom e,
which strips back and resynt hesizes the
newly synthesized strand.
GENOMEWIDE VIEWS OF CANCER 559
consequence of the many cell divisions and genomic instability of cancer cells.
Passenger mutations probably greatly outnumber driver mutations. For example,
the glioblastoma multifomle study by Parsons et al. identified one or more non
silent coding sequence changes in 685 of the 20,661 genes studied, with a mean
of 47 mutations per tumor. Combining data from sequencing (including limited
testing of a further 83 tumor samples)' copy-number changes, and expression
data suggested that 42 ofthese genes might harbor driver mutations.
These and other similar results clearly show the complexity and heterogeneity
of tumorigenesis. Individual tumors have large numbers of mutations and, even
among tumors of the same type, there is only limited overlap. To extract signifi
cant patterns, the International Cancer Genome Consortium plan to sequence at
least 2000 genes in 250 samples each from 50 tumor types, a total of 12,500 sam
ples. The review by Stratton, Campbell, and Futreal gives a good picture of the
achievements of tllis work and the challenges it faces.
Further techniques provide a genomewide view of epigenetic
changes in tumors
Genomewide methylation patterns can be investigated by using specific anti
bodies to precipitate metllylated DNA. The sequences present in the immuno
precipitate are identified by hybridization to microarrays Or by large-scale
sequencing (ChIP-chip or ChIP-seq; see Chapter 11). Comparison with the meth
ylation pattern in the corresponding normal cells allows the cancer epigenome
to be characterized. Array-CGH can be used to perform the comparison directly.
Widespread hypomethylation is a feature of almost all tumor cells. Presumably,
the result is to activate many genes whose expression is normally repressed by
methylation. This is one ofthe earliest observable changes, and, as discussed in a
review by Feinberg, Ohlsson, and Henikoff, some workers speculate that this epi
genetic change is central to tumorigenesis.
Superimposed on the general hypomelhylation is hype/methylation of CpG
islands in the promoters of specific genes. This can be studied by hybrid izing
bisulfite-treated DNA to a custom microarray. As described previously (see Box
11.4), sodium bisulfite converts cytosine to uracil but leaves 5-methylcytosine
unchanged. Oligonucleotide probes on the micro array hybridize specifically to
either the converted or the unconverted sequence. With this technique, the
Cancer Genome Atlas Research Network tested 2305 genes for tumor-specific
promoter methylation in their studyofglioblastoma multiforme described above.
This revealed specific methylation at the promoter of MGMT, a gene encoding a
DNA repair enzyme that removes alkyl groups from guanine residues. Tumors
with this feature showed pervasive hypermutation, with a specific spectrum of
nucleotide sequence changes.
Tumors in many organs can be divided into those showing methylation of
many CpG islands (CpG island metllylation phenotype; CIMP') and those that
are CIMP-. It is claimed that CIMP' cancers form a clinically and etiologically
distinct subgroup, although the mechanism and implications of this are not cur
rently understood. The factor underlying these patterns may be activities of the
po/ycomb group of chromatin remodeling proteins (see Chapter 11).
series oftumors can be used to search for common patterns or meaningful differ
ences. Meticulous control of th e experimental conditions and very careful filter
ing and processing of th e raw data are necessary to produce reproducible results.
The processed data are displayed as a matrix in which each column shows data
for one sample and each row shows data for one gene. Cells of the matrix are
colored to indicate raised, lowered , or unaltered levels of expression.
The challenge is to extract meaningful patterns from the matrix. The object
may be to find gene expression signatures-patterns of overexpression or under
expression of a limited number of genes-that are correlated with known clinical
variables and may serve to predict that variable inan unknown case. Alternatively,
the object may be to identify unexpected between-sample similarities and differ
ences that can defin e new classes o f tumor. Hierarchical clustering methods rear
range the rows and columns of the matrix. to maximize similarities.
MicroRNAs (miRNAs) , small noncoding RNAs that are thought to have a regu
latory role, have also been targets of genomewide expression profiling. The pro
files show characteristic changes in different tumor types. Most miRNAs are
downregulated in most tumors. This suggests a general dedifferentiation of the
ceils, a reversion to a more embryonic state. MiRNA expression tends to increase
as cells differentiate.
Superimposed on an overall decrease in miRNA expression, cancer cells show
abnormal rniRNA expression profiles. Lu and colleagues reported profiles of 217
miRNAs in 334 samples of normal human tissue and various tumors (see Further
Reading). The profiles were remarkably diverse and performed better than mRNA
profiles at discriminating between different tumor types, and between tumor
and normal cells. Other researchers showed that overexpression of one miRNA,
miR-lOb, in nonmetastatic breast tumors triggered invasion. Clinical progression
of primary breast tumors correlated with miR-l0b levels. Other miRNAs
(miR-126 and miR-335) had the opposite effect: metastatic breast cancer cells
lose expression of these two miRNAs, and restoring their expression decreased
the metastatic potential of the cells.
The data are currently co nfusing, but some indi,~dual miRNA genes do fulfill
the criteria for o ncogenes or tumor suppressor genes. This may be because they
control the expression of one of the weU-known classical oncogenes or tumor
suppressor genes, or it might reflect some independent action of the miRNA.
Expression profiling has been widely and successfuUy used for identifyi ng dis
tinct categories of tumor (class discovery). Flgw-e 17.21 shows typical examples.
The eventual aim must be to develop prognostic and predictive ind icators.
Prognostic indicators wo uld predict the likely survival of the patient, regardless
of treatment. Pred ictive indicators would predict which treatments wo uld be
most likely to succeed "'~th that particular tumor. How far expression profiling
has moved toward such d irect clinical relevance is discussed in Chapter 19.
p,
(100%)
BI
(95%)
B,
(91%)
Co
(93%)
Ga
(9 2%)
KI
(99%)
(99%)
0,
(95%)
Pa
(99%)
LA
(91%)
LS
(95%)
Figure1?11 Examples of the use of expression arrays to characterize tumors. (A) Identifying the tissue of origin . Expression
p~tterns of 148 genes (rows) classify 100 tumors (columns) by origin . Pr, prostate; BI, bladder/ ureter; Br. breast; Co, cOlorectal; Ga.
gastroesophagus; Ki, kidney; Li, liver; Ov, ovary; Pa. pancreas; LA. lung adenocarcinoma; lS, lung squamous cell carcinoma. Red
indicates increa sed gene expression, blue decreased expression. (B) Distinguishinglwo types of diffuse large B-ceillymphoma.
Clustering of gene expression distinguishes germinal center-like tumors (orange bars) from activated B-like tumors (blue bars).
Five-year survival in the two groups is 76% and 16%, respecti vely. (C) Sensitivity to chemo therapy drugs. Expressio n profiles of
6817 g enes in 60 tumor cell lines were correlated with response to 232 drugs. The figure shows gene expression patterns (rows) In
30 cell lines (columns) that predict sensitivity or resistance to one drug, cytochalasin D. (D) Predicting the clinical outcome of breast
cancer. Expression patterns o f 2S,000 genes were scored in 98 primary breast cancers. The fig u re shows profiles of 70 prognost ic
marker genes (rows) in 78 cancers (columns). All patients had undergone surgery and radiotherapy; the solid yellow line divides
those predicted to require additional chemotherapy (above th e line) from those w ho do not need this unpleasant treatment.
[(A) from Su AI, Welsh JB, Sapinoso lM et al. (2001) Cancel Res. 61. 7388-7393. With permission from the American Association
for Cancer Research. (B) from Ali2adeh AA. Eisen MB, Davis RE et al. (2000) Nature 403, 503-511. With permiSSion fro m Macmillan
Publishers ltd. (C) from Staunton JE, Sionim OK. Coller HA et al. (2001) Proc. Nat} Acad. Sci. USA 98, 10787-10792. With permission
from the National Academy of Sciences. (D) data from van 't Veer U. Oai H, van de Vijver MJ et al. (2002) Nalure 415, 530-536. With
permission from Macmillan Publishers ltd.1
frequ ent in very early stage lesions, whereas others usually only appear later. The
prin cipal observations are as follows:
• The earliest detectable lesions, aberrant crypt foci, lack all APCexpression. In
people with FAP, who have one inherited loss-of-function mutation of APC in
every cell, about 1 epithelial cell in 106 develops into a polyp. This rate is con
sistent with loss of the second APCallele being the determining event.
• Early stage adenomas, like most cancer cells, show a global hyp omethylation
of CpG sequences in the DNA. Despite the overall hypomethylation, CpG
islands in the promoters of specific genes are hypermethylated, as mentioned
previously.
• About 50% of intermediate and late adenomas, but only about 10% of early
adenomas, have mutations in the KRAS oncogene. Thus, KRAS mutations may
often be required for progression from early to intermediate adenomas.
About 50% oflate adenomas and carcinomas show a loss of heterozygosity on
seems likely that the relevant gene is SMAD4 rather than the initial candidate,
DCC.
~ \
\ \ '"'. . --------\
,",
"\ ' -----
"-----
;~~~~. +;..
~-' ----- ,~
,~-------- '~
"
Figure 17.22 A model for the multi-step development of (olon cancer. In many tumors a loss, activation, or mutation of certain
genes is seen at particular histological stages, as described in the text. This is primarily a tool for thinking about how tumors develop,
rather than a firm description. Every (olore(tal cancer is likely to have developed through the same histological stages, but the
underlying genetic changes are more varied.
• Colorectal carcinomas, but not adenomas, have a very high frequency of TP53
loss or mutations.
These observations bave been summarized in the much-quoted 'Vogelgram'
of Figure 17.22. This scheme should not be over-interpreted; it is a model for
thinking about the multi-stage evolution of cancer, not a description of how it
always happens. As the statistics quoted above show, many colon cancers do not
show these changes, Indeed, only about 60% have APC mutations, and Lynch
syndrome I tumors follow a different route,
Similar but more rudimentary schemes can be made for other tumors for
which the data are not so rich. The early mutations in the multi-step evolution of
a tumor are more critical than the later ones, because there may be only a few
ways in which a small number of mutations can change the behavior of an other
wise normal cell that has all its defenses intact. The genomic instability of later
stages allows a much wider range of possibilities on which natural selection can
act. According to the gatekeeper hypothesis of Kinzler and Vogelstein, in a given
renewing cell population one particular gene is responsible for maintaining a
constant cell number. Mutation of a gatekeeper leads to a permanent imbalance
between cell division and cell death, whereas mutations of other genes have no
long-term effect if the gatekeeper is functioning correctly. The tumor suppressor
genes identified by studies of Mendelian cancers are likely to be the gatekeepers
for the tissue involved, for example RBI for retinal progenitor cells, NF2 for
Schwann cells, and VHL for kidney cells. Sporadic cases would then need two
successive mutations in the gatekeeper-but, as was shown earlier for retino
blastoma, this is compatible with the normal target cell population size and the
relative rarity of each specific sporadic cancer. Some common cancers such as
prostate or lung cancer do not have clean Mendelian forms. These pose a bit of a
problem for the gatekeeper hypothesis because they suggest that no single initial
mutation substantially increases the likelihood that a tumor will develop.
Interestingly, inherited APCmutations are almost always frameshifting (68%)
or nonsense (30%) changes, whereas the somatic mutations (the second hit in
familial cases, and both mutations in sporadic tumors) seldom inactivate the
gene fully. This fits well with the gatekeeper hypothesis. People with inherited
FAP remain normal and healthy until one cell suffers a second mutation; there
should be some difference between their inherited mutations and the somatic
mutations that are postulated to change cell behavior. However, this argument
seems specific to the APC gene. It is perhaps relevant that APC protein not only
regulates the level of ~-catenin, a critical controller of cell growth (see below), but'
also has a role in ensuring chromosomal stability, APC protein helps ensure the
attachment of spindle microtubules to the kinetochores of chromosomes. We
have already noted that even very early adenomas show chromosomal instability,
It is likely that a single APC mutation (of the right type) already confers a growth
advantage on colonic epithelium, whereas mutation of a single BRCA1I2 gene
gives no advantage to ductal epithelium. This may explain why APC mutation is
very common in sporadic colon cancer but BRCA1I2 silencing is seen in only
10-15% of sporadic breast cancers.
564 Chapter 17: Cancer Genetics
(A)
axin
(8)
------....
inactlve APe
),
FlgUJ.17.l:l The APe pathway in
GSK38
colorectal cancer. (A) One main role of the
APC protein is to bind l3-catenin (!l·Ca t),
I .),C which is then marked for destruction. Axin
collaborates in this process. GSK3B, glycogen
p<atenin ~. ~
synthase kinase 3B. (B) Loss of APe function
degradation .....
g
nucleus
the TCF/LEF (lymphoid enhancer binding
nucleus
factor) transcription factor, thus promoting
/ the transc ription of genes such as MYC and
eyelln D1. However. there are olher ways
of ac hieving the same effect (see the text).
(Adapted from Fearnhead NS. Britton MP
& Bodmer WF (2001) Hum.Mol. Genet.
~ 10, 721-733 . With permission from Oxford
Uni versity Press.)
INTEGRATING THE DATA: CANCER AS CELL BIOLOGY 565
L homozygous deletion
ir i
mutation In 49% mutation, mutation amplification amplification
amplification In 8% In 13% in 4%
1 1045%
l ! ! J
amplification
in 14%
MDM2
1 MDM4
amplification
in 7%
r --+
r
-l RAS P1(3)K f----
r-
senescence
mutation.
homozygous
deletion apoptosis
mutation,
homozygous
deletion
in 18%
mutation
in 2%
mutalion
1015%
L
mutation,
homozygous
deletion
i0 36%
1
in 35%
amplification
AKT in 2%
1
proliferation
survival
translation
r----- FOXO
mutati on
in 1%
Thus, behind the heterogeneity of molecular events, there is a much more Fig"r. 1 7.24 Altered signaling pathways
regular picture of pathways that must be inactivated for a tumor to develop. In in glioblastoma multiforme tumor cells.
their 2002 review, Hahn and Weinberg attempted to list all these pathways and A variety of different sequence or copy
draw up a 'route map' of cancer. The current generation of integrated genom number changes among pathway
components are seen in different individ ual
ewide investigations of different tumor types is providing many concrete exam tumors. (A) tn 87% of tumors, the pathways
ples (Figure 17,24), and the review by Vogelstein and Kinzler gives many others. through which p53 affects senescence and
Figure 12.11 gives an impression of just how complex a full picture will be. apoptosis are compromised by a variety
of diffe rent genetic changes. (B) In 88% of
Malignant tumors must be capable of stimulating angiogenesis tumors, various genetic changes affect the
pathways by which receptor tyrosi ne kina ses
and metastasizing (EGfR, ERBB2, PDGFRA, and MET) acting
Of the six capabilities that a malignant tumor must acquire, we have already seen though RAS and PI(3)K (phosphoinositide
the roles of activated oncogenes, deranged cell cycle control, p53 mutation, and 3·kinase) control cell proliferation and
reactivation of telomerase in achieving the first four capabilities in the list. The survival. Yellow indicates activating genetic
alteratio ns, with frequently altered genes
final two capabilities are angiogenesis and metastasis.
showing deeper shades of yellow. Blue
After a tumor reaches a volume of2-3 mm3 itbecomes dependent on the pro
indi cates inactivating mutations, with
duction of new blood vessels (angiogenesis) to obtain an adequate supply of oxy darker shades corresponding to a higher
gen and nutrients. The capability of sustained angiogenesis depends on an ang percentage of alterations. [Data from Cancer
iogenic switch that both up regulates pro-angiogenic factors and down- regulates Genome Atlas Research Network (2008)
angiogenesis inhibitors. Different types of tumor may use different molecular Nature 455, 1061-1068. With permission
strategies to activate the switch. Many cytokines and other signaling molecules from Macmillan Publishers Ltd.]
are involved. Members of the VEGF (vascular endothelial growth factor) family of
proteins have prominent roles in this. Expression is up regulated in many tumors,
and VEGF is the target of anti-cancer drugs such as Avastin®. Hypoxia triggers the
expression of many genes by stabilizing the transcription factor HIF]u (Hypoxia
Inducible Factor ] u). One hereditary cancer syndrome, von Hippel-lindau syn
drome (see Table 17.4) , is caused by the loss of a protein that normally keeps
HIFlu levels low by ubiquitylating it and flagging it for degradation.
Metastasis may not be a speCifically selected capability, because the ability to
metastasize is not clearly advantageous to any cell in a primary tumor. Some
capabilities that are advantageous to primary tumors also incidentally facilitate
metastasis. Tumors need to be highly vascularized, but this also increases the
opportunities for cells to leave the tumor and travel round the body. Primary
tumors need to be invasive to provide room for growth, but the decreased cell
adhesion and increased ability to disrupt tissues also aid metastasis. On the other
hand, a whole series of functions are not important to primary tumors but are
necessary to allow a newly seeded tumor cell to establish a colony in a new loca
tion. The review by Nguyen and Massague discusses these issues.
566 Chapter 17: Cancer Genetics
Systems biology may eventually allow a unified overview of tumor f lgwil 17.2..5 Pathways important for the
acquisition of cancer cell capabilities.
development
Thisfigure is an initial attempt to show
Each of the six capabilities listed by Hanahan and Weinberg may be acquired by the involvement of oncogenes and tumor
a variety of different genetic changes, but in each case acquiring it requires cer suppressor genes (shown in red) in many
tain specific pathways to be activated or inactivated. Figure 17.25, taken from the pathways that affect a cell's choice of
progressing through the cell cycle, exiting
review by Hanahan and Weinberg, shows so me of the target pathways. The capa
from the cycle, or undergoing apoptosis.
bilities are not necessarily acquired in the same order in different tumors, but the Many specific examples are discussed by
requirements for genomic instability and s uccessive clonal expansions at inter Hanahan and Weinberg, from whose paper
mediate stages impose a certain regularity on the process. thisfigure is taken . RTK, receptor tyrosine
All these pathways are used to steer a cell in certain directions in particular kinase; 7-TMR, 7-transmembrane segment
circumstances (Figure 17.2BA), Ufe would be very simple if signal and response cell surface receptor; ECM, extracellula r
were connected by a single unbranched linear pathway (Figure 17.26B) but this matrix. (From Hanahan D & Weinberg RA
seems never to be the case. Rather, multiple branching, overlapping, and par (200 0) (eft 100, 57-70. With permission from
tially redundant pathways control the behavior of cells (Figure 17.26C). Such Elsevier.]
complicated networks are probably necessary to confer stability and resilience
on the extraordinarily complex machinery of a cell. Experimentally, unraveling
the precise genetic circuitry of the controls is exceedingly difficult, partly because
of their complexity and partly because it is difficult to distinguish direct from
indirect effects in transfection or knockout experiments. The science of systems
biology is devoted to quantitative modeling of such networks, and one of its main
applications is in cancer. When a full quantitative description of the control net
work in each different cell type has become available, we will be in a position to
truly understand cancer.
CONCLUSION
Cancer is the result of a Darwinian microevolutionary process among the cells of
our body. Ju st as in the evolution of a population of animals, individual cells
acquire mutations at random, and natural selection favors any ceU containing
mutations that allow it to reproduce faster or resist death better than its neigh
bors.1b become a cancer cell, a cell must be able to continue replicating, inde
pendently of external growth and anti-growth signals. This requires several suc
cessive mutations to disable multiple regulatory pathways.
FURTHER READING 567
Genes that normally act to stimulate growth suffer activating mutations and (AI
/"'
are known as oncogenes; in contrast, those that normally act to restrain growth exlernal ~ . ./' stasis
and - . ..- differentiation
suffer inactivating mutations and are known as tumor suppressor genes. Genomic internal - .. mitosis
instability allows the more rapid accumulation of subsequent mutations. These signals
con sist of a small number of driver mutations and a much larger number of pas
senger mutations. Driver mutations contribute to the development ofthe tumor (BI
and are positively selected; passenger mutations are incidental by-products of
tumor development. Through many rounds of ill- controlled replication with high
- - + --.. .
familial cancer syndromes. This can happen because mutations in most tumor
suppressor genes have no phenotypic effect until both alleles are inactivated (as Figure 17.16 The options open to a
cell, and how it chooses. (A) In response
Knudson put it, they must suffer two hits). A single inactivating mutation can be
to internal and external signals, a cell
transmined through the germ line, so that it is present in every cell of the off chooses between stasis, mitosis, apoptosis,
spring. Such cells still function normally but are vulnerable to a single hit that and sometimes differentiation. (6) An
inactivates the remaining functional allele. This is a much more frequent occur imaginary cell in which signals are linked to
rence than two hits happening by chance on a single normal cell, so the result is responses by linear unbranched palhways
an inherited high risk of cancer. of stimulation ( ~ ) or inhibition (-1) . Human
Every tumor is a unique product of a separate microevolutionary process, and cells do not function like this. (e) In real celis,
contains a unique spectrum of driver and passenger mutations. However, all signals feed into a complex network of partly
tumors need to have acquired a set of specific capabilities: in addition to inde redundant interactions whose outcome is
pendence from external growth and anti-growth signals, tumors need to avoid not easy to predict analytically.
apoptosis, replicate indefinitely, develop their own blood supply (angiogenesis)'
and, in malignant tumors, invade other tissues to establish secondary tumors.
These capabilities depend on subverting the same network of Signaling pathways
in a given cell type. Thus, considered in terms of pathways, tumors are much less
heterogeneous than when the individual gene mutations they contain are listed.
Understanding cancer requires a quantitative and genomewide view of how the
signaling network functions in normal ceUs of each particular type. Only when
we have obtained such a view will we then understand how specific genetic or
epigenetic changes can destabilize the network and allow a tumor to develop.
FURTHER READING
The evolution of cancer Mitelman F, Johansson B & Mertens F (eds) (2002) Mitelman
Database of Chromosome Aberrations in Cancer. http://cgap.
Hanahan 0 & Weinberg RA (2000) The hallmarks of cancer. Cell
nci.nih.gov/Chromosomes/Mitelman
100,57-70. (An important review, introducing the concept of
Mitelman F, Johansson B & Mertens F (2007) The impact of
the six essential capabilities of a tumor cell and giving many
translocations and gene fusions on cancer causation. Nat.
examples.]
Rev. Cancer 7,233-244.
Rosen JM & Jordan ( T (2009) The increasing complexity of the
cancer ste m ce ll paradigm. Science 324,1670-1673. (A review Tumor suppressor genes
of the controvers ies and complexities surrounding the
Burkhart DL & Sage J (2008) Cellular mechanisms of tumour
concept of ca ncer stem cells.]
suppression by the retinoblastoma gene. Nat. Rev. Cancer
8,671-682. [An in-depth re view of the functions of the pRb
Oncogenes protein.)
Bishop JM (1983) Cellular oncogenes and retroviruses. Annu. Cavenee WK, Dryja TP, Phillips RA et al. (1983) Expression
Rev. Biochem. 52, 301-354. [A nice review of early work o n of recessive alleles by chromosomal mechanisms in
oncogenes.) retinobla stoma. Nature 305, 779-784. [The paper that
80xer LM & Dang CV (2001) Tran slocations invol ving c-myc and established Knud son's two-hit hypothesis. Some of the data
c-myc function . Oncogene 20, 5595-5610. (An in Sight into are in an ea rlier paper, Godbout R, Dryja TP, Squ ire J et aL
the complexities of the Burkitt lymphoma and other MYC (1983) Somatic inactivation of genes on chromosome 13 is a
translocations.) common event in retinoblastoma. Nature 304, 451 -453,J
Cancer Gene Census. http://www.sanger.ac.ukJgenetics/CGP/ Fodde R & Smits R (2002) One-hit carcinogenesis. Science 298,
Census [A database of genes for w hich mutations have been 761-763. [Examples of haploinsufficient tumor suppressor
causally implicated in cancer.) genes.]
Meyer N & Penn LZ (2008) Refiecting on 25 years with MYC. Nat. Knudson AG (2001) Two genetic hits (more or less) to cancer.
Rev. Cancer 8, 976- 990. [A historical review illustrating the Nat. Rev. Cancer 1, 157-162. (A review of his two-hit
many roles of one of the major oncogenes.] hypothesis.]
Chapter 18
Genetic Testing of
Individuals
KEY CONCEPTS
Genetic testing is a routine tool for almost all biomedical scientists. PCR-based
tests are the standard way of identifying pathogens and are the everyday working
tools of microbiologists and virologists. Emerging diseases are identified by
sequencing the pathogen's genome and are tracked by genetic tests. Animal and
plant breeders use genetic tes ts to guide their work. Ow- modern understanding
of evolution and de velopment is based on genetic tests. However, this being a
book on hum an molecular genetics, we will confine ourselves here to tests
focused on the human genome. Even here the fie ld is very extensive. Previous
chapters have covered many applicatio ns of genetic tests in resea rch, for exam
ple to map disease genes or susceptibility factors, or to identify the cause of a
disease. Here we consider the methodology of testing in human diagnostics and
forensics. Many of the principles have already been covered in Chapter 7.
When considering diagnostic tests, wherever possible we will use two of the
most common Mendelian diseases, cystic fibrosis (CF) and Duchenne muscular
dystrophy (DMD), to illustrate the various testing methods. Both CF and DMD
involve large genes with extensive allelic heterogeneity, but beyond that they
pose rather different sets of problems for DNA diagnosiS (Table 18.1). Between
them, they show many of the issues involved in testing for Mendelian diseases. As
always in this book, we concentrate on the principles and not the practical details.
Best practice guidelines for laboratory diagnosis of the commoner Mendelian
diseases are available on thewebsites oftheAmerican College of Medical Genetics
and the UK Clinical Molecular Genetics Society, among others. The reader inter
ested in specific procedures or conditions should consult these (see Further
Reading).
Genetic tests are unusual among clinical tests because a genetic test is nor
mally performed just once and the result forms a permanent part of a person's
health record. Thus, it is especially important to avoid mistakes in diagnostic
testing. Any laboratory offering clinical (or forensic) testing must have adequate
quality assurance measures in place. These are beyond the scope of this chap·
ter-any interested reader should consult the principles and guidelines pub
lished by the Organisation for Economic Co·operation and Development (OECD)
(see Further Reading).
When a clinician brings a sample of a patient's DNA to a laboratory for diag
nostic testing, there are three possible questions that the laboratory might be
asked to answer:
Does the patient have any mutation in any gene that would explain the dis·
ease? This is an unreasonable question to ask for routine diagnosis, even if it
might very occasionally be possible in a research context. Gene testing has to
be targeted. Even though the day may come when a persons entire genome
sequence forms part of his or her basic health record, we would still need to
know where to look amo ng the several million specific variants present in a
person's DNA. At best it might be reasonable to ask a laboratory to check
TABLE 18.1 THE CONTRASTING GENETICS OF CYSTIC FIBROSIS AND DUCHENNE MUSCULAR DYSTROPHY
Condition Cystic fibrosis Duchenne mu scular dystrophy
Characteri stics of gene fairly large gene: 250 kb genomic DNA, giant gene: 2400 kb geno mic DNA, 79 exons, 14 kb mRNA
27 exons, 6.5 kb mRNA
Spectrum of mutations almost all mutations are single-nucleotide 65% of mutations are deletions encompassing one or more
changes; a few large deletions complete exons; 5% duplica tions; 30% nonsense, splice site, or small
frameshifti ng mutations; miS5ense mutations are very unusual
Freq uency of new mutations new mutations are extremely rare new mutations are very frequent
Frequency of mosaicism mosaicism is not a problem mosaicism (see Chapter 3, p. 77) is common
Frequency of intragenie little intragenie recombination recombination hotspot (12% between markers at either end of the
recombination gene)
WHATTOTESTANDWHY 571
Source Comments
Mouthwash or buccal scrape noninvasive; usually enough DNA for several dozen PCR tests, but qual ity and quantity very variable
Skin, muscle, etc. for RNA studies; needsto be a tissue where the gene of interest is expressed; requires a biopsy, wh ich is invasive
and often unpleasant
Single cell from a blastocyst for pre-implantation diagnosis; technically very demanding
Amn iotic fluid a relatively poor source of fetal DNA compared with chorionic villi, but used for testing at 15-20 weeks of
pregnancy
Fetal DNA in maternal blood a promising alternative to chorionic villi; detectable from 6 weeks; paternal alleles only
Pathologica l specimens vital resource for genotyping dead people; also tumor, etc., biopsies; fixed paraffin-embedded tissue requires
specia l procedures for DNA purification
Guthrie card the cards used for neonata l screening have one or more spots of the baby's blood, not all of which are normally
used for the screening; if cards are archived they are a possible source of DNA from a deceased baby
572 Chapter 18: GenetieTesting of Individuals
attempts have been made to use fetal cells in the maternal circulation for less
invasive prenatal diagnosis. Unfortunately, the cells are present only in tiny num
bers, and the results have never been reliable enough for routine use. Free fetal
DNA is more promising, although only tests for paternally derived alleles are
informative, because the fetal DNA comprises only mayb e 3% of the free DNA in
the maternal blood circulation; the rest is maternal. Atchived pathological sam
ples are very important sources of DNA fro m deceased persons (although con
sent laws in some coun tries m ake it difficult to use them). When samples have
been formalin-fixed, special procedures a re necessary, and only short sequences,
ryp icaUI' 200 base pairs or less, can usually be recovered.
Southern blotting (see Figure 7.10) is still necessary fo r a few applications,
including as an alternative 10 FISH (see Chapter 2, p. 47) for testing for m ajor bal
anced rearrangements that wo uld not be picked up by array-CGH (see Cha pter 2,
p. 49) and, in some Circumstances, for fragile X full mutations (see below).
RNA orONA?
If a gene has to be scanned for unknown mutations, testing RNA by reverse tran
scriptase PCR (RT-PCR; see Chapter 8) offers several advantages. Direct testing of
DNA usually involves amplifying and testi ng each exon separately, and this can
be a major chore in a gene with many exons. A gene with 15 exons that encodes a
2.5 kb transcript can be scanned by RT-PCR in perhaps four overlapping seg
ments, whereas the same analysis on genomic DNA would require 15 PCR reac
tions and 30 sequencing runs (assuming that both strands of each PCR product
are sequenced, which wo uld be normal practice). [n addition, only RT-PCR can
reliably detect aberrant splicing. It is often hard to predict whether a DNA
sequence change will affect splicing; moreover, abe rrant splicing may result from
activation of a cryp tic splice site deep within an intron, which would not be
detected by the usual exo n-by-exon protocols for sequencing genomic DNA (see
Chapter 8).
However, analyzing RNA has disadvantages. RNA is much less convenient to
obtain and work with. Samples must be handled and processed with extreme
care to avoid degrading mRNA. [mportantly, the gene of interest may not be
expressed in a ny readily accessible tissue. A very low level of illegitimate tran
scripts may sometimes be detectable in tissues in which the gene is not normally
expressed, but reliable analysis of very low-level mRNAs can be difficul t. [n addi
tion, truncating mu ta tions usually result in unstable mRNA because of nonsense
mecliated decay (see Cha pter 13), so that the RT- PCR product from a heterozygous
person may show only the normal allele. Treating cultured ceUs or whole blood
with the translation inhibitor puromycin has been shown to inhibit nonsense
mediated decay, which may allow cDNA sequencing to detect transcripts that
include premature termination codons.
Functional assays
When considering pedigree patterns in Chapter 3, we simply distinguished two
classes of allele, normal and abnormal, or A and a. Assaying the function of a
gene might allow a similar distinction to be made in the laboratory- which is,
after all, the essential question in most diagnoses. Function might be tested at
the DNA level, for exam pIe by a cell transfectio n assay. Alternatively, the protein
product of a gene could be tested biochemically if its function can be adequately
assayed in the laboratory. The problem wi th functional assays is that they are
specific to a particular gene or protein; by conaast, DNA technology is generic.
This has obvious advantages for the diagnostiC laboratory, but in addi tion it
encourages technical development, because any new technique could be used
for a wide variety of problems.
~Iff\
~~
patient
T .", '" A C C T A ce .~ -G T N C A N eN " A N
~
A gene is normally scanned for mutations by sequencing
TABLE 18.3 SOME METHODS FOR SCANNING A GENE FOR MUTATIONS BEFORE SEQUENCING
Method Advantages Disadvantages
Heterodu plex gel mobility very simple to set up; cheap and simple to run sequences < 200 bp only; limited sensitivity; does not revea l
position of cha nge
dHPLC quick, high throughput; quantitative expensive equipment; does not reveal positio n of change
SSCP simple, cheap; sensitive w hen formatted for seque nces < 200 bp only; does not reveal position of change
capillary electrophoresis
DGGE high sensitivity choice of primers is critical; expensive GC-damped primers; does
not reveal position of change
Melting curve analysis high sensitivity proprietary system using a Light-Cyc1er~ machine
Protein truncat ion test high sensitivity for chain-terminating mutations; chain-terminating mutations only; expensive, difficu lt technique;
shows poSition of change usually needs RNA
dHPLC, denaturing high-performa nce liquid chromatog raphy; SSC?, single-strand conformation polymorphism; DGGE. denaturing gradient gel
electrophoresis.
report is issued the data must be checked by two separate individuals, and the
likely significance of any variant must be considered. Programs are available that
automatically report differences between the test sequence and a standard
sequence, provided that the sequence is of good qUality. DNA quality is criti cal
for aVOiding artifacts and for reliably detecting base substitutions in
heterozygo tes.
15-1
. This runs faster in the gel and gives the
heteroduplex bands seen in the lower panel.
The si ngle-stra nded DNA runs more slowly
in the same gel (upper panel). Lanes 1 and
-.
60.2
2 show the pattern typical of the wild-type
sequence; the mutations in these patients must
be elsewhere in the gene. Variant patterns
1:>4
-.
60.1 can be seen in lanes 3-8; sequencing revealed
mutatio ns p.G8SE. p.l8BS. p.R7SX. p.P67l,
p.E60X, and p.R7SQ respectively.
(6) Denaturing gradient gel elecrrophoresis.
Exon s of the CFTA gene from the genomic DNA
of subjects Pl and P2, two unrelated patients
with CF, are peR amplified in one or more
segments and run on 9% polyacrylamide gels
containing a gradient of urea-formaldehyde
12345678 1 2 3
denaturant The band from any amplicon that
contains a heterozygous variant usuaUy splits
into four sub-bands (arrows). In the lanes
shown, subject Pl (left lane in each panel) has
a varia nt in amplicon 6 and subject P2 (right
• Mismatched bases in heteroduplexes are sensitive to cleavage by chemicals or lanes) has variants in amplicons 17 and 24.
enzymes. The chemical clea/Jage of mismatch (CCM, Figure 18.3B) method is Characterization of the variants showed that
a sensitive method for mutation detection, with the advantages that quite subject P2 was heterozygous for R1 0700 (exon
large fragments (more than 1 kb in size) can be analyzed, the location of the 17). Other variants were nonpathogen ic. [(A)
mismatch is pinpointed by the size of the fragments generated, and variants courtesy of Andrew Wallace, St Mary's Hosp ital,
present in only 5% or so of the sample can be detected. However, it never Manchester. (B) courtesy of Hans Scheffer,
attained wide popularity because many of the protocols use toxic chemicals University of Groningen, The Netherlands,]
and aUare experimentally quite difficult.
7 140 210 280 350 420 490 560 630 700 770 840 910 980 1050 1120
1600
control
6 1400
1200
5 1000
'>
S 800
g 4
.~
600
1¥0 3 ~
400
~• E 200
• 2 •u 0
1 H\/ l
"•,
u
~
0
900 patient
~
BOO
700
°2 3 4 5 6 600
400
300
200
100
1 1
0
Figure 18 .3 Scanning the DMD and NF2 genes for mutations. (A) Sca nning (or DMD mutation by denaturing high-performance liquid chromatography
(dHPlC) . Exon 6 of the dystrophin gene gives different patterns in an affected male (blue trace) and a normal control (red trace). Sequencing revealed
the splice site mutation c.738+ 1G> T. For males, because this is an X-linked condition, test DNA must be mixed with an equal amount of normal DNA to
allow the formation of heteroduplexes. (6) Mutation scanning of the NF2 (neurofibromatosis 2) gene by chemical cleavage of mismatches. A ftuorescently
labeled meta-PCR product contained exons 6- 10 of the NF2 gene. The lower track shows the sample from a patient; a heterozygous intron 6 splice site
mutation 600-309 is revealed by hydroxylamine cleavage of the 1032 bp meta-PCR product to fragments of 813 + 239 bp (arrows). The upper track is a
control sample. {(Al courtesy of Richard Bennen, Children's Hospital, Boston. (6) courtesy of AndrewWaliace, St Mary's Hospital, Manchester.]
576 Chapter 18: Genetic Testing of Individuals
0
[l2A
12G GTCGTATCCllTGCC7TAC AG • • probe are indicated as - (white), + (light
green), or ++ (dark green).
12C GTCGTATCCt~ TGCCTTACAG • a H
12T GTCGTATCC~TGCCTTACAG 1 • •
TCGTATCCdaGCCTTACAGT • 2
13G
[13A TCGTATCC ~~ GCCTTACAGT , 2
+1
13C TCGTATCCakGCCTTACAGT • 2
13T TCGTATCCa1GcCTT ACAGT a H 1 •
<--
5' _ CCGG 3'
• In defining genetic changes in tumors, because tumor suppressor genes are 3' GGCC r: 5'
often silenced by methylation (see Chapter 17, p. 549). ----; M
PCR products are always unmethylated, because they are made from the nor
mal monomers supplied in the reaction mix. Thus, none of the methods described
J.
PCR product
so far gives any information on the methylation patterns in a DNA sample.
Genomewide studies of methylation use chromatin immunoprecipitation with RSlt.lr4I 18.6 Using restriction enzymes to
an antibody against methylated DNA (see Chapter 11, p. 353), but to study the analyze the methylation status of a DNA
methylation status of an individual sequence, two main methods are used: sequence. The restriction enzyme Hpall cuts
((GG, but only when the cytosines are not
• Restriction enzyme digestion. Hpall cuts only unmethylated CCGG, whereas
methylated . Mspl cuts the same sequence
MspJ cuts any CCGG, whether or not it is methylated. Thus, methylated regardless of methylation. If genomic
genomic DNA gives different patterns of restriction fragments with the two DNA is digested with Hpall before peR
enzymes. Alternatively, a PCR template containing a CCGG sequence can be amplification, any amplicon that contains a
digested with either Hpall or MspI before amplification. If the CCGG is meth ((GG sequence will give a product only if
ylated, the template will not be cleaved by HpaII , and the peR will yield a methylation of the ((GG site protects it from
product (Figure 18.6). digestion.
578 Chapter 18: GeneticTesting of Individuals
o Bisulfite modification. As described in Chap ter 11 (see Box 11.4), when DNA is
treated with sodium bisulfite, cytosine (but not 5-methylcytosine) is con
verted to uracil. The changed nucleotides can be identified in various ways.
The most obvious method is conventional sequencing of the bisulfite-treated
product. Cytosine and 5-methyJcytosine in the original genomic DNA show as
thymine and cytosine, respectively, in the sequence trace. Alternatively, PCR
pri mers can be designed to amplify speci fi cally either the alte red or unaltered
sequence. Often, however, a quantitative assay is required. The original
genomic DNA will be derived from many cells, and it is often rele vant to ask
how frequently a given CpG dinucleotide is methylated. Pyrosequencing (see
Figure 8.8) or melting curve analysis can be used for this purpose. All these
procedures require very careful optimization and control of conditions to
yield reliable data.
Method Comments
,
Restriction digestion of PCA-amplified DNA; check size only when the mutation creates or abolishes a natural restriction site or one engineered by
of prod ucts on a gel the use of specia l PCR primers (see Figure 18.7); tests for a single mutation
Hybridize peR-amplified DNA to allele-specific general method for specified point mutations; large arrays allow maSSively parallel
oligonucleotides (ASO) on a dot-blot or gene chip
scanning for a very large number of mutations or SNPs
Oligonucleotide ligation assay (OLA) general method for specified point mutations (see Figu re 18.9); a few dozen tests can be
multiplexed
Pyroseq uencing high-throughput method for identifying a few nucleotides at a specified position;
quantitative result (see Figure 8.8)
Mass spectrometry very fast « 1 second) versatile general method; quantitative result (see Box 8.8 and
Figure 8.26)
peR with primers located either side of a translocation successful amplificatiOn shows the presence of the suspected deletion or specified
breakpoint rearrangement
Check size of expanded repeat for dynamic repeat diseases (see Table 13.1 ) only; large expansions may req uire Southern
blots
580 Chapter 18: Genetic Testing of Individuals
Sickle"cell disease only this particular mutation produces the sickle"cell p.E6V in HBB gene (see Figure 7.9)
phenotype
Achondroplasia only G380R produces this pa rt icular phenotype; high two distinct changes, both causi ng p.G380R in
frequency of new mutan t cases FGFR3 gene
Huntington disease, myotonic dystrophy ga in -ot-func t ion m utations unsta ble expanded repeats (see Table 13.1 )
Fragile X common molecular mechanism: expan sion of an see Table 13.1; other mutation s occur, but are
unstable repeat rare
Charcot-Marie-Tooth disease (HMSN1) common molec ular mechanism: recombination dup lica tion at 1.5 Mb at 17pl1 .2 (see
between m isalig ned repeats Figu re 13.25); point mutations also occur
a.- and 13- thalassemia selection for heterozygotes leads to different ancestral see Figure 13. 19 (a-tha lassemia) and
m utations being common in different populations Table 18.6 (j3-thalassemia)
Tay-Sachs disease fo un der effect in Ashkenazi Jews; anci ent heterozygote two common HEXA mutations in Ashkenazim:
advantag e 4 bp in sertion in exon 11 (73%); exon 11
donor splice site G>C (15%)
Cystic fibrosis common ancestral mutations in northern European see Table 18.7 and Section 3.5
popu lations, ancient heterozygote advantage
See Section 13.4 (or a further discussion of the reasons why some diseases show a limited range of mutations, whereas o thers have extensive allelic
heterogeneity.
PCR primer 3' (----- CA r'1 A -- -5' 3'(- ---- CA T~ A --- 5'
peR product 5' - - - GT GAG J' A - T - - - 3' S-- -GTG T GTA CT - --3'
Seal
Flgure18.7 l ntroducing an artlticial diagnostic restriction site. An A> T mutation in the intron 4 splice site of the FAce gene
does no t create or abolish a restriction site. The peR prime r stops short of this altered base but has a single base mi smatch (red G)
in a non-critical position that does not prevent it from hybridizing to and am plifyi ng both the normal and mutant sequences. The
mismatc h in the primer introduces an AGTACT restriction si te for Seal into the PCR product from the norma l sequence but not the
mutant sequence. The Scal-digested product from homozygous normal (N), heterozygous (H), and homozygous mutant (M) patients
is shown. (Courtesy of Rachel Gibson, Guy's Hosp ital, London.)
TESTING FOR A SPECIFIED SEQUENCE CHANGE S81
Figure 1a.8 Multiplex ARMS test to detect 29 cystic fibrosis mutations. 1 234
~~~~,~,...------'----~
Each sample is tested with four multiplex mutation-specific peR reactions ABC DAB C DAB C DAB C D
(lanes A-OJ. Within a multiplex, each product is a different size. If one of the
29 mutations is present there is an extra band. The identity of the mutation is
--- --- ---
---- -- --- - ----
revealed by the position of the band in the gel, and by which track contains it.
Each tube also amplifies two control sequences-these are the top and bottom
bands in each gel track; they differ between tracks so that each multiplex carries ... ...
its own signature pattern. Note that the normal alleles of mutations other
than p.FS08del (band in lane B) are not tested for. Samples 1,2, and 3 show no
mutation-specific bands. In sample 4, the extra bands in lanes A and D show
the presence of p.F508del and c.'898+ 1G>A, respectively, showing that this
DNA comes from a compound heterozygote. (Courtesy of Michelle Coleman,
st Mary's Hospital, Manchester; data obtained using the Elucigene™ kit from
Orchid 8iosciences.)
to labeled PCR-amplified test DNA. The same reverse dot-blot principle is applied
on a massively parallel scale in DNA micro arrays (see above and Chapter 7,
p.207).
5'
tive for detecting variant alleles that are present at very low levels. Applications
would include testing for residual disease in cancer, testing for rare somatic
mutations, and testing for paternally derived alleles in fetal DNA extracted from 5' ----->
1
<--- 3'
maternal blood. The reaction uses a primer with a dideoxy 3' end that pairs at the
position of the mutant nucleotide. In the presence of pyrophosphate, DNA ligation creates peR tem plate
polymerase will remove the 3' dideoxynucleotide, in a reversal of the normal
polymerization reaction, but only if it is correctly paired with the test strand.
,
(B)
The polymerase then extends the primer in the normal way, using normal dNTPs. 3'
The variant nucleotide is thus read twice: once by the reverse polymerase reac 5'
tion, and again by the forward reaction. This dual selectivity results in an extremely ~ ,
low level of errors. 3' S
1000
500
J'l
900 .£
""
c
600 g
300 ~
.......----' ~
IC.390s l ~
1500
1000
500
DNA polymerase plus four differently labeled dd NTPs. The test DNA acts as tem Figure 18.10 Using the oligonucleotide
plate for the addition of a single labeled dideoxynucleotide to the primer (Figure ligation assay to test for 29 known
CF mutations. A multiplex OlA is performed
18_111. The nucleotide added to the primer is identified , for example, by funning
and the products are amplified by PCR.
the extended primer on a fluorescence sequencer, allowing the base at the posi Ligation oligonucleotides are designed so
tion of interest to be identified. Minisequencing can readily be adap ted to an that products for each mutation and its
array- based format (APEX; ar rayed primer extensjon) to allow the entire sequence normal counterpart can be distinguished
of a gene to be checked for base substitutions in a single operation. The sam e by size and color of label. A ligation product
idea, b ut using reversibly blocked fluorescently labeled monomers, is the prin from the splice site mutation 62 1+ 19:>t is
ciple behind the llIumina/Solexa ultra-high -throughput sequencing technology seen (red box). The person may be a ca rrier
(see Chapter 8, p. 220). After a single nucleotide has been added and identified, or may be a compound heterozygote w ith
the blocking groups can be removed wi th a suitable flash of light, and the cycle a second mutation that is not one of t he 29
can be repeated to identify the next nucleotide. detected by this kit. (Cou rtesy of Andrew
Wallace, St Mary's Hospital, Manchester.)
~
"'u
,,'-"
""0; " , Hu u ,
u~ u'-"
131\
u
~
0;
;8
~
U
0
:s" gw
~
"c
N "';;;:•" ,
'-"
~
u
w
'"
0;
u
iii .£
111
"000 g
~
'"
'"'" ill 8 "~ ~
'"a
" "'"
00
~
" «>
N "l
'" '"" '" '"
~
I I
1\ in the BRCA 1 gene. Eleven seg ments of
, A II I
I, , 1\ wild·type
the 8ReA 1 gene were amplified from the
test DNA in a multiplex peR reaction. Eleven
1\ (I ONA
specific pri mers were then added. each
\~~
with its 3' end adjacent to a nucleotide
to be queried, together with four dye
labeled ddNTPs and DNA polymera se. The
jl
A 1\ J c. 185de1AG polymera se added a single colored ddNTP
,I b /1, M -'1\ .
.4 .\ mutation to each primer, t he color depend ing on the
relevant nucleotide in the test DNA. Primers
were different lengths, so that the produ cts
{\ II A IIA 1 liJcj J1
c.S382insC
mutation
of this multiplex reac tion could be se parated
and identified by capillary electrophoresis.
The figure shows results of four samples,
three of which have mutations, each in
heterozygous form.IFrom Reviliion F,
f~ ~\ J\l, 1 ~l j
Ve rdlere A, Fournie r J et al. (2004) Gin. Chem.
II, c.300T>G
mutation
50, 203-206. With permi ssion from the
American AssociatIon for Clinical
e Chemistry Inc.1
TESTING FOR A SPECIFIED SEQUENCE CHANGE 583
pyrosequencing
As described in Chapter 8, pyrosequencing is a method of examining very short
stretches of sequence-typically 1-5 nucleotides- adjacent to a defined start
point. The main use is for SNP typing, in which only one or two bases are
sequenced. PYTOsequencing uses an ingenious cocktail of enzymes to couple the
release of pyrophosphate that occurs when a dNTP is added to a growing DNA
chain to light emission by luciferase (see Figure 8.8). A primer is hybridized to the
test DNA and offered each dNTP in turn. When the correct dNTP is present, so
thatthe primer can be extended, a flash of light is emitted. The method has been
developed into a machine that can automatically analyze 10,000 samples a day.
Output is quantitative, so that allele frequencies of a SNP can be estimated in a
single analysis of a large pooled sample. The same technology, applied on a m as
sively parallel scale, is tile basis of the Roche/ 454 technique for ultra-high
throughput sequencing (see Figure 8.9) .
.,:!U
1
PCR amplification/
complexity reduction
100 nucleotides long. A test takes less than a second in the machine. For small
oligonucleotides, the accuracy is sufficient to deduce the base composition - 60 Mb
directly from the exact mass. Alternatively, SNPs can be analyzed by primer exten
sion using mass-labeled ddNTPs. The spots of DNA to be analyzed can be arrayed 100
on a plate and the machine will automatically ionize each spot in turn. Current
systems can genotype tens of thousands of SNPs per day and, as with pyrose
quencing, samples can be pooled to measure allele frequencies directly. ~ 1
fragmentation and
end-labeling
tion would be impracticable and also, as the number of prim er pairs increases,
the number of undesired primer-primer interactions increases exponentially. In
highly multiplexed PCR reactions, much of the product consists of unwanted
artifactual primer-dimers. Second, an extremely high level of allele discrimina
tion is required if 500,000 SNPs are to be gena typed witho ut producing thou
sands of erroneous results. Distinguishing homozygotes for the two SNP alleles
may not be too difficult, but reliably identifying heterozygotes can be a serious
Flgur.18.12 SNP genotyping using
challenge. arrayed allele-specific oligonucleotides. In
The multiplexing problem is solved by arranging for the seque nces to be the Affymetrix system, genomic DNA is cut
amplified to carry universal adaptors on their ends, so that all sequences can w ith a restriction enzyme. Universal adaptors
be amplified by using a single set of primers. Different strategies are used to com (blue) are ligated to the fragments, allowing
bine single-primer amplification with high allelic discriminarion: them to be amplified using a single pair of
peR primers. PCR products are fragmented,
• In the Affymetrix system, universal adaptors are ligated onto restriction frag
labeled, and hybridized to 25-mer
ments of whole genomic DNA. After PCR amplification using a single pair of
oligonucleotides on the microarray, each of
primers, the PCR products are fragmented, labeled, and hybridized to oligo which specifically hybridizes to fragments
nucleotides on the microarray (Figure 10.12). Genotyping uses the same sys containing one allele of o ne specific SNP.
tem as mutation detection, described above. Forty probes per SNP are [From Matsuzaki H, Loi H, Dong S et al. (2004)
arranged in five quartets for the forward strand and five for the reverse strand. Genome Res. 14,414-425. With permission
The principle was illustrated in Figure 18.5. from Cold Spring HarbOr Laboratory Press.)
84 Chapter 18: Genetic Testing of Individuals
! annealing
sequences at its two ends (red), universal
primer-binding sequences (P 1 and P2, blue),
and a locus-specific tag sequence (green) .
·,V
rm'~a g
Two separate base-spedfic primer exten sion
reaction s are used, corresponding to the two
alleles of the SNP being assayed (here, A or
G). After Circularization and then cleavage to
e"enSlonl-O'
~
P'
ligation
I '
.".nslon
! Iigation
release the MIP in linear form, the products
of the two base-specific rea({ions are
amplified with universal primers labeled
P, .
.,
,ag
.,
--------+ tag
., P,
'ag lag
5" - _ - 3' 5' --+----< • 3'
F£gure 18.14 SNPgenotyping by the Illumina GoldenGate@method. Two allele II tliI!!!I: ~ II !I1111! II
specific (P, and Pl) oligonucleotides and one locus-specific (P3) oligonucleotide are
used, with a combination of allele -specific primer extension and DNA ligation to provide II!I ill lill g11111111111
reliable discrimination of alleles. As with the MIP system (see Figure 18.13), ligation
creates a PeR substrate that is amplified with universal dye-labeled primers, and the
J, annealing
3'
amplified with one of two universal dye-labeled primers (figure 18_14). The P, ~
LSO carries a tag sequence that is used to hybridize the PCR product to a spe
cific bead in a bead array.
ywng
5'~ T
IBg P, 3'
P, lextension
18.4 SOME SPECIAL TESTS ligation
-
! f'CR labeling
techniques P,
.ag p"
Generally, introns are much bigger than the exons they flank, and so random T
+p y..'
breakpoints in a gene will most probably lie within an intron. Thus, kilobase , .
scale partial deletions or duplications of a gene sequence, if they do not lie wholly
within one inrron, will most probably remove or duplicate one or more whole
exons. Such changes will not be apparent when genomic DNA is amplified exon
P,
4t.:::::!. C
.ag _'i.
P,
by exon. Tn heterozygous deletions of whole exons, the mutant allele will give no
product. Th e test sees only the normal allele, and the result looks entirely normal. 1A
P3
Similarly, duplications will not show up by these methods. PCR in its normal form
• tag
G- -______ _
is not qu antitative, so any extra yield of prod uct would not be noticed. RNA test
ing would solve the problem, but RNA may be difficult to obtain.
This is a problem that is particular to kilobase-scale deletions or duplications. ! capture
Small deletions or duplications that lie wholly within an exon will be apparent
when the exon is sequenced. Large-scale copy- number changes that might be
anywhere in the genome are efficiently detected by array-CGH. However, this is 'A G~
an expensive technique for routine diagnosis, and the normal BAC arrays have tags
low sensitivity for changes involving only a kilobase or so of DNA. When the can
didate gene is already known, the technique of choice for detecting whole-exon
deletions or duplications is multiplex ligation -dependent probe amplification
(MLPA).
The multiplex ligation-dependent probe amplification (MLPA) test
MLPA is a development of the oligonucleotide ligation assay (OLA) described
above. As in the OLA, pairs of oligonucleotides hybridize to adjacent locations on
the test DNA, and DNA ligase seals the ga p between them to produce a single
molecule. In the OLA the gap is positioned at a suspected va riant nucleotide and
ligation will occur only if there is an exact match; in MLPA the gap is positioned
at a (hopefully) invariant nucleotide in the test DNA, so ligation will always take
place if the template sequence is present. Ligation creates a PCR-amplifiable
molecule (see Figure 18.9). The presence of a PCR product signals the presence of
the appropria te matching sequence in the test DNA.
MLPA is a multiplex procedure with up to 45 probe pairs combined in some of
the availabl e kits. The individual PCR prod ucts are distinguished by length, so as
to generate a series of peaks when the products are run on a DNA sequencer
(I'"JgUJ'C 18.15). Although the PCR is not truly qua ntitative (absolute amo unts of
product vaty between probes), the relative amounts of a particular product in
two samples should reflect the relative amounts of the template in the two sam
ples. When results hom test and control samples are compared, the relati ve peak
heights give a direct readout of the dosage of each sequence in the test DNA rela
tive to the control DNA. No rmally, one ligation is used for each exon of a gene,
allowing whole -exon deletions and duplications to be detected.
586 Chapter 18: Genetic Testing of Individuals
(A)
Figure 18.1 ~ Using multiplex ligation
~--n:;~-- exon
.~J illLllilluuJJJI1un1l:IDlil
5 10
sample. Numbered peaks represent products
from each exon; peaks labeled c are control
probes_ (B) The same analysis on DNA from
a patient with breast cancer. In comparison
I- ' - - "--'---'---r I _ . - t-··· _··,··.....· - . t····,- .. _. '- 1 ....... '- I
with the control samp le, the exon 13 peak is
(8) only half the size. (Courtesy of MRC Holland.)
IA)
tra<;k 1 2 3 4 5 6 7 8 9 10
exon
48,--
51 -----... -- --- _ ......
--------=-
-'. /" '---a:::===_
~.!".,,~i,~·
43_
== - - -
--. 1_
45-
-........
-- ---
-----.--,-
50 ../'
53 %
47
60~
'/'j ... - ........
...... -
~~~~ - --
-.. - ~ .--
Flg ur .. 1 • •1 ~ Multiplex screen for
52/ dystrophin deletions in males, (AJ Products
of multiplex PCR amplification of nine exons,
pri mers using samples from 10 unrelated boys with
Duchenne/Becker muscular dystrophy. PCR
(8) primers have been designed so that each
exon, with some nanking intron sequen ce,
exon 43 44 45 46 47 48 49 50 51 52 53 60
- .........\ _ ,."- .. -tt-I~;---.~_ ."' _ ~
gives a different-sized PCR product.
track
1 -- -- . ...
Solid lines show exons definitely deleted,
2 -
~ "------====~
deletion running into untested exons.
6 - ,
or deletions of exom no t examined in this
7
test. Exon sizes and spacing are not to scale.
8
9
Compare with Figure 13.15. HAl courtesy of
10 ~-- _ -4
R. Mountford, Liverpool Women's Hospital.]
SOME SPECIAL TESTS 587
1 2 STR45
allele 1 -
I,
..
I,
-
II , II,
---
II, 11 4 ((11 111 2
'1C?T? ir
2 -'
- -
-
234
-
1 2 3 4 1
3 31
U ir ii 1r
l'
III III
1 2 1 2
the gene) will reveal 98% of all deletions. It is important to consider alternative
explanations for failure to amplify: maybe there was a technical failure ofthe PCR
or a base substitution in one of the primer-binding sites. Most deletions remove
more than one exon. Deletions of just a single exon and deletions that seem to
i '1r
Figure 18.17 Deletion carriers in a
affect non- contiguous exons need confirming. Deletions can be confirmed by OMD family revealed by apparent non
using alternative PCR primers or by MLPA. About 5% of dystrophin mutations are maternity. Pedigree (AI and results (B) of
duplications of one or more exons, and detecting these requires MLPA in males genotyping with the intragenic marker
as well as in females. STR4S. The affected boy, 1111, has a deletion
that includes STR45 (his lane on the gel
Apparent non-maternity in a family in which a deletion is segregating is blank). His mother, Ill, and his aunt, 113,
inherited no allele of STR45 from their
If a deletion is segregating in a DMD family, genotyping females for microsatel m othe r, 12, showing that the deletion is being
lites that map within the deletion may reveal apparent non-maternity, in which a transmitted in the family. 12 is apparently
mother has transmitted no marker allele to her daughter because of the deletion homozygous for this highly polymorphic
(figuJ"(' 18.17). In such families, non-maternity proves that a woman is a carrier, m arker (lane 2), but in fact is hemizygous
whereas heterozygosity (in the daughter or sister of a deletion carrier) proves tha t as a res ult of the deletion. The boy's other
a woman is not a carrier. Several markers suitable for this purpose have been aunt, 11 4, and his sister, 1112, are heterozygous
identified in the introns at deletion hotspots. The method works best in families for the marker and therefore do not carry
the de leti on. (C) Pedigree showing how the
in which there is an affected male in whom the deletion can first be defined. In
m arker genotypes of individuals 12, 112, and 113
principle, the same approach could be applied to tracking any other deletion, are hemizygous as a result of the deletion.
whether X-linked or autosomal, through a family. However, for any autosomal
deletion, a quantitative method would be needed to identify any he terozygous
deletion in the first place, and if such a method were available it would be simpler
to use the same method to test other family members.
,"
02151435
.. 021511
~ u.
02151270
,,,
- 0135634
~.
-
018S535
u. ...
/,1 11 ..I ,( 1 I !
018S391
.. 0135742
~ u.
D18S386
,,,
013S634
.. ,
1. II
,,' ,~
018S978
.. - 021S1411
'" ~. OR
0135628
".
~I II .1.1 J
Figun '8.'8 Prenatal diagnosis of trisomy 21 with QF-peR. Fetal DNA, obtained by chorionic villus biopsy or amniocentesis,
was ampli~ed in a multiplex PCR reaction using dye- labeled primers for four highly polymorphic microsatellites from each of
chromosomes 13, 18, and 21. The markers from chromosomes 13 and 18 each show two equal-sized peaks (if heterozygous) or one
larger peak (if homozygous). Markers from ch romosome 21 all show three peaks (02151411) or two peaks in a 2:1 size ratio (02151"
02151270, and 02151435) , (Courtesy of Susan Hamilton, St Mary's Hospital, Manchester,)
--
..l.ClB. .
'-
•
-
lanes 3, 4, 5, 7, and 8 are from affected people. lane 5 is a juvenile-onset case;
her father (lane 4) had 55 repeats but she has 86. lane 9 is an affec ted fetu s.
(CAG)"
•
diagnosed prenatally. (8) Myotonic d ystrophy. Southern blo t o f genomic DNA (epeat number: ..
digested with EeoRI and hybridized to a labeled probe consisting of part of
the OMPK gene. Bands o f9 kb or 10 kb are normal variants. the result of a
nonpathogenic insertion-deletion polymorphism. The grandfather (lane 4)
*-
-
86 - .
55-.
44
~
100,.
.
has cataract s but no other sign of myotonic dystrophy. His 10 kb band seems threshold
to be very slightly expanded compared with the same band from the normal of normal
m ale in lane 3, but thiS is not definite on the evidence of this gel alone. His alleles 16-
daughter (lane') has one normal and one definitely expanded 10 kb band; she
has classical adult-onset myotonic dystrophy. Her son (lane 6) has a massive
::
expansion and the severe congenital form of the disease. (C) Fragile X. The 1 2 3 4 5 6 7 8 9 10
DNA of the inactivated X chromosome in a female, and of any X chromosome IB)
ca rr yi ng the full mutation, is methylated. Genomic DNA is digested with a ~ O
combination of EeoRI and the methylation-sensitive enzyme EclXl, Southern /I
•
i 0
blotted, and hybridized to Oxl.9 or a similar probe. The X chromosome in a
normal male (lane 1) and the active normal X chromosome in a female (lanes
6
2,3,4, and 6) give a small fragment (labeled N). Un methylated pre-mutation
alleles (P) give a slightly larger band in lanes 4 and 5 (female pre-mutation
carriers) and lane 7 (a normal transmitting male). Methylated (inactive)
X-chromosome seq uences do nor cut with EcfXl and give a much larger band
(NM), and the fully expanded and methylated sequence gives a very la rge lOkb
9kb
smea red band (F) because of somatic mosaicism. [(Al courtesy of Alan Dodge, St
Mary's Hospital, Manchester. (B) and (el courtesy of Simon Ramsden, St Mary's
Hospi tal, Manchester.) 1 2 3 4 5 6
IC)
The mutation screen for some diseases must take account of feoRI EcfXl (eGG)" EcoRI
geographical variation ! !
Ox1.9
The population genetics of recessive diseases is often dominated by founder
effects or the effects of heterozygote advantage (see ClJapter 3, p. S9). The result
<2
ing limited diversity of mutations in a population can make genetic testing much
easier. ~- Thalassemia and cystic fibrosis are good examples. For both these con m
ditions, a very large number of different mutations in the relevant gene have been rti
described, bur in each case a handful of mutations account for most cases in any
particular population. With ~-thalassemia, DNA testing is not needed to diag
nose carriers or affected people-orthodox hematology does this perfectly well IF
but it is the method ofclJoice for prenatal diagnosis. Different mutations are pre INM
dominant in different populations rfable 18,6). Provided that one has DNA
samples from the parents and knows their ethnic origin, the parental mutations
can often be found by using only a small cocktail of specific tests, after which the
fetus can be readily checked.
!n cystic fibrosis, the p.FSOSdel mutation is the commonest in all European
populations and is believed to be of ancient origin. However, the proportion of all
mutations that are p.FSOSdel varies, being generally high in the north and west of Ip
Europe and lower in the south. Testing for cystic fibrosis mutations divides into
IN
two phases. First, a limited number of specified mutations, always including
p.F50Sdel but otherwise population-specific, are sought with the methods
described in Section IS.3. As Tahle 18.7 shows, there is no obvious natural cutoff 1234567
in terms of diminishing returns on testing for specific mutations. If this phase
fails to reveal the mutations, then, if resources allow, a screen for unknown muta
tions maybe instit1lted, using the methods describedinSection IS.2.Alternatively,
gene tracking (see below) may be used. The impact ofthis diversity on proposals
for population screening is discussed in Chapter 19.
Surprisingly often, when a recessive disease is particularly common in a cer
tain population, it turns out that more than one mutation is responsible. An
example is Tay-Sachs disease among Ashkenazi Jews, where there are two cam
mon HEXA mutations (see Table lS.S).!t is difficult to explain this situation except
by assuming there h as been a long-continuing h eterozygote advantage, favoring
the accumulation of mutations in the gene.
590 Chapter 1B: Genetic Testing of Individuals
selectio n favoring heterozygotes. Data courtesy of J Old, In stitute of Molecular Medicine, Oxford.
~o, complete absence of ~-globin chains; ~+, ~·globi n present but in in suffi Cie nt quantity. The
nomenclature of mutations used here is nonstandard (see Box 13.2, p. 407), but is widely used for
thalassemia.
p.FS08del and a few of the other relatively common mutations are probably ancient and
spread through selection favoring heterozygotes; the other mutations are probably recent, rare,
and highly heterogeneous. Cystic fibrosis is more homogeneous in this population than in most
othe rs. See Box 13.2 for nomenclature of mutations. Data courtesy of Andrew Wallace, St Mary's
Hospital. Manchester.
case in which this does happen. Worldwide, 20-50% of children with autoso mal
recessive profound congenital hearing loss have mutations in the GJB2 gene that
encodes connexin 26. Different specific mutations are common in different pop·
ulations-c.30deIG in Europe, c.235de1C in East Asia, c.167delT in Ashkenazi
Jews. A simple PCR test therefore provides the answer in a good proportion of
cases. For the remainder, it would be necessary first to sequence the whole GJR2
gene and then, ifresources allowed, to examine a large number of other genes.
In many other cases, no one gene accounts for a significant proportion of
cases. Learning difficulties present the ultimate challenge in this respect. Custom
chips are being developed to allow a large panel of genes to be screened; alterna·
tively, the new exon capture and ultra· fas t sequencing technologies may allow
dozens of genes to be sequenced at a reasonable cost. Technically, the challenge
is identical to the problem of screening a person's DNA for large numbers ofvari·
ants that confer susceptibility or resistance to common complex diseases.
However, data interpretation presents different problems. For susceptibility
screening, the problem is to know the combined risk from many variants, each of
which modifies risk to only a small degree. Risks cannot be simply added or mul
tiplied; combining them req uires a quantitative model of the effect of each vari
ant on overall cell biology. For heterogeneous Mendelian conditions this is not a
problem: one mutation is responsible for the con dition. The problem is the large
number of unclassified variants that will undoubtedly be identified.
592 Chapter 18: Genetic Testing of Individuals
18.5 GENETRACKING
Gene tracking was historically the first type of DNA diagnostic method to be
widely used. It uses knowledge of the map location of the disease locus, but not
knowledge about the actual disease gene. Most of the Mendelian diseases that
form the bulk of the work of diagnostic laboratories went through a phase of gene
tracking when the disease gene had been mapped but not yet cloned. Once the
gene had been identified, testing moved on to direct gene analysis. Huntington
disease, cystic fibrosis, and myotonic dystrophy are familiar examples. However,
gene tracking illay still have a ro le even when a gene has been cloned. In the set
ting of a diagnostic laboratory, it is not always cost effective to search all through
a large multi-exon gene to find every mutation. Moreover, there are always
cases in which the mu tation cannot be found. In these circwnstances, gene track
ing using linked markers is the method of choice. The prerequisites for gene
tracking are:
The disease should be well mapped, with no uncertainty about the map loca
tion, so that markers can be used that are known to be tightly linked to the
disease locus.
The pedigree structure and sample availability must allow the determination
Shown here are three stages in the investigation of a late· onset her unaffected chromosome. Her affected chromosome, inherited
autosomal dominant disease where, for one reason or another, direct from her dead father, must be the one that carries marker allele 1.
testing for the mutation is not possible. By typing Ilil and her father, we can work out which marker allele
Individuallill (arrow), who is pregnant, wishes to have a she received from her mother. If she is 2- 1 or 2- 3, it is good
presymptom atic test to show whether she has in herited news: she inherited marker allele 2 from her mother, which is the
the dis ease allele.The first step is to tell her mother's two grandmaternal allele. If she types as 1- 1 or 1- 3 it is bad news:
chromosomes apart. A marker, close ly linked to the disease locus, she inherited the grand paternal chromosome, which carries the
is found for which 11 3 is heterozygous (2- 1). disease allele.
Next we must establish phase- that is, work out which marker Note that it is the segregation pattern in the family. and not the
allele in 113 is segregating with the disease allele .The maternal actual marker genotype, that is im portant: if 1112 has the same marker
grandmother, I:z, is typed for the marker (2-4). Thus, 113 must have genotype, 2- 1, as her affected mother, thiS is good news, not bad
inherited marker allele 2 from her mother, which therefore marks news, for her.
IA) (6 ) Ic)
2-4
'"
/
p:s
~
disease. Four families each have a child
affected with a recessive disease. Direct
mutation testing is not possible (either
because the gene has not been cloned or
because the mutatio ns could not be found).
rIo ONA 2-1 2-1: 50'1f1 Qlant8 affected
500f.c chance homC1Z'J&OUs (A) No diagno sis is po ssible if there is
normal no sample from the affected child. (8) If
everybody has the same heterozygous
(C) (0 ) genotype for the marker, the result is
p:s p:g
2-1 no pre.:::lICtIOfl 2-2 1-1! hO/T'lOZYE:OU5 norm61
2-1: carner Cerror = 2~)
(e-rror '" &2)
not clinically u~ful. (el If the parents are
homozygou s for the marker, no prediction
is possible with this marker. (0) Both parents
are heterozygous ca rriers, and the genotype
of the affected child shows that in each
parent the pathogenic mutation is on the
chromosome that carries allele 2 of the
2-2 : Affecle<:l terror = 461 marker, allow ing a successful predictio n to
be made. The error rates shown are the risk
of predicting an unaffected pregnancy when
could have passed on the disease allele to the proband, and who mayor may not the fetu s is affected, o r vice versa, if the
have actually done so. The process always follows the same three steps: marker used shows a recombination fraction
1. Distinguish the two chromosomes in the relevant parent(s)-that is, find a 8with the disease locus. These examples
closely linked marker for which they are heterozygous. emphasize the need for both an appropriate
pedigree Structure (DNA must be available
2. Determine phase-that is, work out which chromosome carries the disease from the affected child) and informative
allele. marker types.
3. Work out which chromosome the proband received.
FIgure 18.20 shows gene tracking for an autosomal recessive disease. The
pedigrees emphasize the need for both an appropriate pedigree structure (DNA
must be available from the affected child) and informative marker types. Even if
the affected child is dead, if the Guthrie card used for neonatal screening can be
retrieved, sufficient DNA for PCR typing can usually be extracted from the rem
nants of the dried blood spot. Informativeness of the marker should not be a big
problem. With more than 20,000 highly polymorphic microsatellites mapped
across the human genome, it should always be possible to find informative mark 1 2
ers that map close to the disease locus. A'5
B' 3
Recombination sets a fundamental limit on the accuracy of gene
11
tracking 1 2 3
Because the DNA marker used for gene tracking is not the sequence that causes
the disease, there is always the possibility that recombination may separate the
"'·5-2
B' 3--<; '"'
8"'7
disease allele and the marker. This would lead to an erroneous prediction. The 111
recombination fraction, and hence the error rate, can be estimated from family 1 2
studies by standard linkage analysis (see Chapter 14). With almost any disease A·5 A-5-1
there should be a good choice of markers showing less than 1% recombination 8 -6 8 "3-7
with the disease locus. This follows from the observations that 1 nucleotide in 300
Figure 18.21 Gen@tracking in Duchenne
is polymorphic, and that loci 1 Mb apart show roughly 1% recombination (see
muscular dystrophy with flanking
Chapter 14, p. 446). Ideally, one uses an intragenic marker, such as a microsatel mark@rs.The family has been typed for
lite within an intron. two polymorph isms, A and B, that flank
Recombination between marker and disease can never be completely ruled the dystrop hin locus. Individual 1111 has
out, even for very tightly linked markers, but the error rate can be greatly reduced a recombination between marker locus
by using two marker loci situated on opposite sides of the disease locus. With A and OMO, but thi s does not confuse
such jlankingor bridging markers, a recombination between either of the mark the predi ction for 111 2 because we know
ers and the disease locus will also produce a marker-marker recombinant, which unambiguously that the disease allele in
can be detected (e.g. 11[1 in FIgure 16.21). If a marker-marker recombinant is her mother is carried on a chromo so me
bearing marker alleles A'"2 and 8*6.1112 can
seen in the consultand, then no prediction can be made abo ut inheritance ofthe
have inherited OMO only if she has a double
disease, but at least a false prediction has been avoided. Provided that no
recombination, one between marker A and
marker-marker recombinant is seen, the only residual risk is that of double OMO and another between DMD and market
recombinants. As we saw in Chapter 14, the true probability of a double recom B. If the recombination fractions are 8A and
binant is very low because of interference (see p. 445). Thus, the risk of an error 881 respective ly, t hen the probability of a
due to unnoticed recombination is much smaller than the risk of a wrong predic double recombinant is of the order 8A88,
tion due to human error in obtaining and processing the DNA samples. Perhaps which typically will be well under 1%.
594 Chapter 18: GeneticTesting of Individuals
a greater risk is unexpected locus heterogeneity, so that the true disease locus in
the family is actually different from the locus being tracked.
(E), given hypothesis Hi. An example will probably make this clearer
[Figure 1).
of all the hypotheses must sum to l.lt is not important at this 2-1 3
stage to worry about exactly what information you shou ld use to
decide the prior probability, as long as it is consistent across the
columns. You will not be using all the information (otherwise there
would be no point in doing the calculation because you would
already have the answer), and any information not used in the
"'
prior probability can be used later.
Using one item of information not included in the prior
probabilities, calcu late a conditional probability for each hypothesis: III~ is a carrier not a carrier
hypothesis. The conditional probability is the probability of prior probability 1(2 1( 2
the information, given the hypothesis, namely P(EjHi) [not the
conditional (1): DNA result 0.05 0.95
probability of the hypothesis given the information, P(HiIE)]. The
conditiona l probabil ities for the different hypotheses do not conditional (2): CK data 0.7 1
necessarily sum to 1.
joint probability 0.0175 0.475
The previous step can be repeated as many times as necessary
until all information has been used once and once only. The end final probability 0.0175/0.4925 0.475/0.4925
resu lt is a number of lines of conditional probabilities in each = 0.036 = 0.964
column.
Within each column, multiply together the prior and all the Flgwe 1 Calculating the risk that lib is a carrier of DMD.
cond itional probabilities. This gives a joint probability, Individual 1112 w ishes to know her risk of being a carrier of DMD,
P(Hi) x P(EIH i). The joint probabilities do not necessarily sum to which affected her brother Ill, and uncle II,. Serum creatine kinase
1 across the columns. (CK) testing (an indicator of subclinical muscle damage common in
If there are just two columns, the joint probabilities can be used DMD carriers) gave carrier-non-carrier odds of 0.7:1. A DNA marker
directly as odds. Alternatively, the joint probabilities can be that shows on average 5% recombination with DMD gave the
sca led to give final probabilities, which do sum to 1. This is done types shown in red. These form two conditional probabilities and
by dividing each jo int probability by the sum of all the joint allow the risk to be calculated, following the guidelines in th is Box,
probabilities, 'i.[P(Hi) x P(E!Hi)]. to give her overall risk of being a carrier as 3.6%.
GENE TRACKING 595
f ~
subjects, the program can ca(culate the third.
For linkage analysis, the program is given
(A) and (8), and calculates (C). For calculating
(8) segregation of ( (C) linkage relation between genetic risks, the program is given (B) and
markers In the pedigree ) disease and markers (C). and calculates (A).
Bayesian calculations give a quick answer for simple pedigrees, but the calcu
lations can get very elaborate for more complex ones. Few people feel fully confi
dent of their ability to wo rk through a complex pedigree correctly, although the
attempt is a valuable mental exercise for teasing out the factors contributing to
the final risk. An alternative is to use a linkage analYSis program.
Using linkage programs for calculating genetic risks
At first sight it may seem surprising that a program designed to calculate lod
scores can also calcnlate genetic risks, but in fact the two are closely related
ll'lgure 18.22). As described in Chapter 14, linkage analysis programs are gener
al-purpose engines for calculating the likelihood of a pedigree, given certain data
and assumptions. For calculating the likelihood of linkage we calculate the ratio
likelihood of data I linkage, recombination fraction e
likelihood of data I no linkage (e - 0.5)
For estimating the risk that a proband carries a disease gene, we calculate the
ratio
likelihood of data I proband is a carrier, recombination fraction e
likelihood of data I proband is not a carrier, recombination fraction e
As in Box 18.2, the vertical line I means given t/WI.
(A) (8)
M C Fl F2
--_f---- 0
~
i,
~
"
E •
<--- E <;
.,E g
• '1 2 3
- <--
-- <--
(SF/PO 5 +
055818 5 +
F13A7 6 +
075820 7 +
0851179 8 + + +
THO 1 11 + + + +
VWA 12 + + + +
0135317 13 +
FESlFP5 15 +
0165539 16 + +
,.
018551 18 + + +
0195433 19 r +
0215" 21 + + +
Amelogenin X, Y + + +
Current panels are COOlS and SGM+. The UK first panel and SGM panel are no longer used; they are
included to illustrate how testing panels have evolved over time. The match probability of a profile
is the probability that two unrelated Individuals would have the same profile by random chance. It is
calculated by multiplying together the chance of a match for each individ ual marker allele.
assuming p to be the same for every bin and ignoring the need to match on
intensity as well as position). Even for p =0.2, plOis only 10-7 It is imperative
that the same binning criteria be used for judging matches between two pro
files and for calculating p. The criteria can be arbitrary within certain limits,
but they must be consistent.
MOO=lW==~=~UO~_=2W2~=_=_
Molecular weight (bp)
= __ =3W=
Figur. '8.24 A DNA profile produced using the SGM+ set of markers. The SGM+ multiplex uses three d ifferent colored labels for
primers to make the results clearer. Numbers in boxes are the number or repeat units in each microsatellite allele. [From Jobling MA
& Gill P (2004) Not. Rev. Genet. S, 739-7 51 .With permission from Macmillan Publishers ltd.]
the complete genotype from a single definable ancestor. For the Y chromosome,
a panel of short tandem repeat markers is used, whereas mitochondrial geno
types are defined by SNPs. An interesting example was the identification of the
remains of the Russian tsar and his family, killed by the Bolsheviks in 191 7, by
comparing DNA profiles of excavated remains with those of living distant rela
tives (see Further Reading). These markers are also sometimes used in criminal
investigations, particularly when the question arises of whether the criminal
might be a relative of somebody who se profile is stored on a forensic database.
Ge netic markers provide a much more reliable test of zygosity. The extensive
but now obsolete literature on using blood groups for this purpose is summa
rized by Race & Sanger (see Further Reading). DNA profiling is nowadays the
method of choice. Th e Jeffreys fingerprinting p robe provided an immediate
impression: samples from MZ twins look like the same sample loaded twice, and
samples from DZ twins show differences. When single-locus markers are used, if
twins give the same types, then for each locus the probability that DZ twins wo uld
type alike is calcul ated. If the parents have been typed, this follows from Mendelian
principles; othetwise, the probability of DZ twins typing the same must be calcu
lated for each possible parental mating and weighted by the probability of that
mating calculated from population gene frequencies. The resultan t probabilities
for each (unlinked) locus are multiplied, to give an overall likelihood PI that DZ
twins wo uld give the same results with all the m arkers used. The probability that
the twins are MZ is then
Pm = m /[m+ (1 - m)Pd,
where m is the proportion of twins in the population who are MZ (about 0.4 for
same-sex pairs).
The match probabilities in Table 18.8 are the probability that an unrelated indi
vidual wo uld have a specific DNA profile-for example, the profile of DNA recov
e red from a crime scene. These are conserva tive estimates: some all eles are m ore
common than others, so the exact match probability will vary between different
profiles. For the current CODIS a nd SGM+ m arker sets, the chances of a fortu
itous match look so low as to make a DNA match irrefutable. However, three fac
tors may reduce the degree of certainty.
First, the chances are much higher that relatives would give a full match.
Because crim inality often ru ns in families. this is not a purely academic point.
Second, crime-scene samples often do not give a full profile. When the quan tity
of DNA is very small and / or is degraded, some alleles may fail to amplify (allele
drop-out). Below a certain threshold quantity of DNA (typically 100 pg, which is
the DNA content of about 17 cells), stochastic effects begi n to be significant.
Running the PCR for extra cycles or in a reduced volume may produce a profile
when the standard protocols fail, but the profile is often only partial. This so
called Low Co py Number (LCN) technique has been quite controversial. Some
major crimes have been solved with it, but there are questions about how reliable
it is. Finally, crime-scene samples often contain a mixture of the DNA of several
individuals. How can the individual profiles be teased out of a mixed profile?
Provided that the component DNAs are present in different amounts, th e peak
sizes can be used to disentangle the individual contributions. With LCN data this
is much more of a probl em, because stochastic effects mean that the relative
amounts of PCR products do not necessarily reflect the amounts of template. In
addition, with ve ry small amounts of DNA, even the m ost trjvial contamination
can be very significant.
The markers in the CODIS and SGM+ sets have been chosen, among other
reasons, because they give no personal information; they m erely serve to identify
a person. This helps make profiling less contentious. However, the DNA sample
potentially co ntains all sorts of personal information. One intriguing possibility
concerns the link between V-chromosome haplotypes and surnames. In societies
in which the surname is taken from the fa ther, such links should exist. Pilot stud
ies have indeed demonstrated linkage, at least for unusual surnames. Much of
600 Chapter 18: GeneticTesting of Individuals
our physical appearance is also encoded in our DNA. People dream of a DNA
photofit-a portrait (and maybe surname) of the criminal revealed by the DNA
left at the crime scene! Considerable developm ent work is taking place toward
that goal, but it remains far distant. Currently only eye and hair color can be even
roughly predicted. It is also worth remembering that, whatever might be possible
with a good blood sample, crime-scene DNA is often too limited in quantlty and
quality to allow extensive genotyping.
Courtroom issues
If the genotypes do all match, the court needs to know the odds that tlle criminal
is the suspect rather than a random member of the population. Of course, if the
alternative were the suspect's brother or the suspect's identical twin, the odds
would look very different. It is also important to remember that a DNA match
may prove beyond reasonable doubt iliat the suspect had been present at the
crime scene, but (apart perhaps in rape cases) it cannot prove what he or she did
iliere. By itself, a DNA match is not normally enough to secure a conviction.
The fate of DNA evidence in courts provides a fascinating insight into ilie dif
ference between scientific and legal cultures. Suppose that a suspect's DNA pro
file matches the scene-of-crime sample. There are still at least iliree obstacles to
a rational use of the DNA data.
First, the jury may simply not believe, or may perhaps choose to ignore, the
DNA data, as evidently happened in the 0.). Simpso n trial. Maybe they will decide
the incriminating DNA was planted-a not unreasonable proposition.
Second, an unscrupulous lawyer may try to lead the jury into a false probabil
ity argumen t, the so-called Prosecutor's Fallacy. This consists of confusing the
probability that the suspect is innocent, given the match, wiili the probability of
a match, given that the suspect is in nocent. The jury should consider the first
probability, not the second. AI; Box 16,3 shows, the two are very different.
Finally, objections may be raised to some of the principles by which DNA
based probabilities are calculated:
• The multiplicatiue principle, iliat ilie overall probabilities can be obtained by
multiplying ilie individual probabilities for each allele or locus, depends on
the assumption that genotypes are independent. If the population actually
consisted of reproductively isolated groups, each of whom had genotypes
that were fairly constant within a group but very different between groups,
Suppose that a court case has good evidence that the person whose DNA was found at the crime
scene committed the crime. The suspect's DNA prolile matches the crime-scene sample. Does that
make the suspect guilty? Consider two different probabilities:
The probability that the suspect is in nocent, give n the match.
The probab ility of a match, given that the suspect is inn ocent.
Using Bayesia n notation (see Box' 8.2) with M = match, G = suspect is guilty, and! = suspect
is innoce nt, the first probability is PAM and the second is PMI!. The Prosecutor's Fallacy consists of
arguing that the relevant probability is PMI! when in fact it is P!IM. The calculation below shows
how different these two probabilities are.
If the suspect were guilty, the samples would necessarily match: PMIG = ,. Let us suppose
that population genetic arguments say there is a 1 in 10° chance that a randomly selected person
would have the same profile as the crime-scene sample: PM1'= , 0-0. Suppose that the gu il ty
person could have been anyone of 107 men in the population. If there is no other evidence to
implicale him, he is Simply a random member of the population, and the prior probability that
he is gUilty (before conSidering the DNA evidence) is PG = 10- 7. The prior probability that he is
innocent is P! = , - 10- 7, which is very close to , .
Bayes's theorem tells us that
P,IM ~ (P, X PMII) / IIP,X PMII) + (PGx PMIG)]
= 10-6/(10-6+ 10-7 )
= , ,O/ Ll
~O.9.
CONCLUSION
In this chapter we have reviewed some ofthe many ways in which aDNAsequence
can be checked for the presence of known or unspecified variants.
For cases in which it is necessary to scan one or more candidate genes for any
variant, DNA sequencing is the main tooL Normally, genomic DNA is sequenced,
but cDNA may be preferable for large genes with numerous small exons, or if
abnormal splicing is suspected. However, RNA requires more careful handling
than DNA. DNA or RNA samples are generally amplified by PCR (RT-PCR in the
case of RNA). !fit is importantto reduce the sequencing load, various techniques
allow a quick scan of exons of a gene, allowing sequencing to be restricted to
those that show some variant. These techniques exploit differences in the prop
erties of DNA heteroduplexes or in the conformation of single-stranded DNA.
Alternatively, microarrays can be used to scan a whole gene sequence for vari
ants. The continuing rapid improvements in sequencing technology have made
all these alternatives to sequencing less attractive; for most diagnostic laborato
ries the main bottleneck in mutation analysis is now in analyzing the sequence
data rather than in generating it.
Another group of tests are designed to detect specific mutations in a particu
lar gene. Commercial kits for such tests are increasingly available. Allele-specific
oligonucleotide probes can be used in reverse dot-blot hybridization or, on a
large scale, as probes on a DNA micro array. In ARMS (amplification refractory
mutation system) testing, pairs of allele-specific primers are used in PCR to dis
criminate between normal and mutant sequences. In the oligonucleotide liga
tion assay (OlA), mutations prevent the DNA ligation of two paired oligonucle
otides and thus PCR amplification of the test sequence. Specialized sequencing
technologies such as pyrosequencing or minisequencing identify just one or a
few nucleotides adjacent to a primer, thus allowing specific mutations to be
detected. The same micro array techniques used for mutation detection can also
be used to search whole genomes for specific mutations or SNPs.
Special tests are required to identify heterozygous deletions or copy-number
changes in genomic DNA. Multiplex ligation-dependent probe amplification
(MLPA), a variant of the OlA test, is a popular method for detecting deletions or
duplications of exons within a gene. At the level of whole chromosomes, quanti
tative PCR of a panel of fluorescence-labeled microsateUites (QF-PCR) is often
used as an alternative to standard karyotyping for prenatal diagnosis ofthe com
mon fetal trisomies. For genomewide detection of copy-number changes, array
comparative genomic hybridization or high-resolution SNP chips can be used.
The older technique of gene tracking in a family pedigree is stili useful in some
situations, but care needs to be taken in the choice of DNA marker and calcula
tion of the risk.
DNA profiles, consisting of genotypes for a panel of polymorphic markers, are
of great use in identifying individuals and determining relationships. The older
DNA fingerprinting technique, based on hypervariable minisatellites, has been
superseded by single-locus profiling using standardized panels of polymorphic
microsatellites. The main use of DNA profiling is in criminal investigations, in
which profiling and the associated police DNA databases have revolutionized
practice but have also raised concerns about possible infringements of civilliber
ties. Courts have not always found it easy to handle the scientific questions raised
by DNA profiling within an adversariallegal ftamework. It is important that they
should recognize the limitations, as well as the powers, of DNA profiling. Other
uses for DNA profiling include resolving cases of disputed paternity, identifying
the origin of clinical samples (e.g. between mother and fetus, or transplant donor
and recipient), and determining the zygosity of twins.
The emphasis in this chapter has been on the techniques used to genotype a
sample; except for the problem of unclassified variants there was little need to
discuss the significance of the result. However, the same techniques can be used
in situations where there is much more uncertainty about how useful the results
might be. One such case is the use of genetic tests to predict somebody's response
to a drug. Individual variations in response often have a genetic basis, and there
are hopes that in the future genetic testingwili lead to more efficient prescribing,
with the ultimate goal of personalized medicine. Howrealistic is this goal? Further
questions arise from the use of genetic tests in population screening. This raises
FURTHER READING 603
FURTHER READING
Best practice guidelines program that combines the biophysical characteristics of
amino acids and protein mUltiple sequence alignments to
American College of Medical Genetics. httpJlwww.acmg.neV
predict where missense substitutions in genes of interest fall
Clinical Molecular Genetics Soc iety (UK). http://cmgsweb.shared
in a spectrum from deleterious to neutral.]
.hosting.zen.co.uk/
J. Craig Venter Institute: SIFT (Sorting Intolerant from Tolerant).
Cystic fibrosis mutation databa se. httpj/www.genet.5ickkids
http://sift.jcvi.org/[Predictswhetheranaminoacid
.on.ca/cftrl (The best source of overview information about
substitution affects protein function on the basis of sequence
cystic fibrosis mutations.]
homology and the physical properties of amino acids.}
Leiden muscular dystrophy pages. http://www.dmd.nl!
PolyP hen (Polymo rphism Phenotyping). genetics.bwh.harvard
[A comprehensive and well -maintained source of information
.edu/pphl [A tool that predicts the possible impact of an
on all molecular aspects of mu scular dystrophy.)
amino acid substitution on the structure and function of a
RNA analysis human protein.)
Andreutti-Zaugg C, Scott R & Iggo R (1997) Inhibition o f
Methods for testing for a specified sequence
nonsense-mediated messeng er RNA decay in clinical samples
facilitates detection of human MSH2 mutations with an Fakhrai-Rad H, Pourmand N & Ronaghi M (2002) Pyrosequencing:
in vivo fusion protein assay and conventional techniques. an accurate detection platform for single nucleotide
Cancer Res. 57, 3288-3293. [Deals with the use of puromycin polymorphisms. Hum. Mutat. 19,479-485.
to identify mRNAs containing premature termination Liu Q & Sommer SS (2004) PAP: detection of ultra rare mutations
codons.) depends on P* oligonucleotides: "sleeping beauties"
awakened by the kiss of pyrophosphorolysis. Hum. Mutar.
Methods for scanning a gene for mutations 23,426-436.
Cotto n RGH, Edkins E & Forrest S (eds) (1998) Mutation Detection: Monforte JA & Becker CH (1997) High-throughput DNA analysis
A Practical Approach. IRL Press. [Individual chapters review a by time-of-flight ma ss spectrometry. Nor. Med. 3, 360-362.
w ide range of methods.]
Keen J, Lester D, Inglehearn C et al. (1991) Rapid detection of
Large-scale 5NP genotyping
single base mismatches as heteroduplexes on Hydrolink gels. Affymetrix GeneChip. http://www.affymetrix.comltechnology/
Trends Genet. 7, 5. index.affx [Requires a log-in.]
Sheffield VC, Beck JS, Kwitek AE et al. (1993) The sensitivity of Fan J-B, Oliphant A, Shen Ret al. (2003) Highly parallel SNP
single strand conformational polymorphism analysis for the genotyping. Cold Spring Harbor Symp Quant Bioi. 68, 69-78.
detection of single base substitutions. Genomics 16, 325 - 332. Gut IG (2004) DNA analysiS by MALDI-TOF mass spect rometry.
Sheffield Vc, Beck JS, Nichol s Bet al. (1992) Detection of Hum. Mutat. 23,437 - 441.
multiallele polymorphi sms within gene sequences by Hardenbol p. Yu F, Belmont J et al. (2005) Highly multiplexed
GC-clamped denaturing gradient gel electrophoresis. Am.). molecular inversion probe genotyping: over 10,000 targeted
Hum. Genet. 50, 567-575. SNPs genotyped in a single tube assay. Genome Res,
van der Luijt R, Khan PM, Vasen H et al. (1994) Rapid detection 15,269-275 .
of translation-terminating mutation s at the adenomatou s IIlumina BeadArray. http://www.il lumina.com/pages.ilmn?ID=5
polyposis coli (APC) gene by direct protein truncation test. Stanssens P, Zabeau M, Meersseman G et al. (2004) High
Genomics 20, 1-4. throughput MALDI·TOF discovery of genomic sequence
Wallace AJ, Wu CL & Elles RG (1999) Meta-PCR: a novel method polymorph isms. Genome Res. 14, 126-133.
for creating chimeric DNA molecules and increasing the T6nisson N, Zernant J, Kurg A et al. (2002) Evaluating the
productivity of mutation sca nning techniques. Genet. Test. arrayed primer extension resequencing assay ofTPs3 tumor
3,173-183. suppressor gene. Proc. Nat/Acad. Sci. USA 99, 5503-5508.
Va rious authors (2002) The Chipping Forecast II. Nor. Gener.
Analysis of DNA methylation 32 (Sup pi), 465- 551 . [Reviews of microa rray-based
Ehrlich M, Nelson MR, Stanssens Pet al. (2005) Quantitative high technologies.]
throughput analysis of DNA methylation patterns by base
specific cleavage and ma ss spectrometry. Proc. Natl Acad. Sci. Gene tracking
USA 102, 15785-15790. Bridge PJ (1997) The Calculation Of Genetic Risks: Worked
Laird PW (2003) The power an d the promise of DNA methylation Examples In DNA Diagnostics, 2nd ed. Johns Hopkin s
markers. Nar. Rev. Cancer 3, 253-266. University Press. [A very detailed set of calculation s covering
Zesc hnigk M, Lich C, Buiting K et al. (1997) A Single-tube PCR test almost every conceivable situation in DNA diagn os tics .)
for the diagnosis of Angelman and Prader-Willi syndrome Young ID (2006) An Introduction To Risk Calculation In Genetic
based on allelic methylation differences at the SNRPN locus. Counseling. Oxford University Press. (A clinically oriented set
fur. J. Hum. Genet. 5, 94-98. [An example of methylati o n of risk calculations.)
speCific PCR.]
DNA profiling
The problem of unclassified variants Gill p, Ivanov PL, Kimpton C et al. (1994) Identification of the
International Agency for Research on Cancer: Align-GVGD. remains of the Romanov family by DNA analysis. Nat. Genet.
agvgd.iarc.fr/agvgd_input.php [A freely available, Web-based 6,130-135.
Chapter 19
Pharmacogenetics,
Personalized Medicine,
and Population Screening
KEY CONCEPTS
Medicine in the post-genomic era holds the promise of testing of individuals to deliver
personalized medicine and screening of populations to determine susceptibility to
comnlOn diseases.
Any clinical test, including genetic tests, should be evaluated by first considering its
analytical validity, which includes its sensitivity, specificity, and predictive value.
• A broader evaluation of a clinical test should consider clinical validity, clinical utility, and
ethical implications.
• The most crucial question about a proposed test is how often the result, in combination
with other available data, will move participants across a threshold for possible
further action.
• Pharmacogenetics is concerned with the way in which genetic variations between
individuals cause different responses to a drug.
• Many pharmacogenetic variations affect the rate of absorption or metabolism of drugs
(pharmacokinetics) .
Other pharmacogenetic variations affect the response of a drug target to a given level of
the drug (pharmacodynamics).
• One aim of personalized medicine is to tailor the drug and dosage that an individual
receives to his / her genotype. Obstacles to personalized prescribing include the inadequate
predictive power of most genotypes and the need to develop rapid real- time bedside
genotyping methods.
• Genotype-specific prescribing is closest to reality in cancer medicine, in which some drugs
are already prescribed only after a tumor specimen has been genotyped.
• Expression profiling of tumors may eventually guide treatment, but its value has yet to be
confirmed by large prospective studies.
• Testing for susceptibility to complex diseases is a growing but largely unregulated industry.
Tests marketed directly to consumers over the Internet are often not supported by evidence
of clinical utility.
• Population screening usually aims at defining a high-risk subset of the population. This will
almost certainly include many individuals who do not, in fact, have the condition being
screened for. People in the high-risk group are then offered some intervention, most usually
a definitive diagnostic test.
• Screening tests usually involve a trade-off between sensitivity and specificity, with an
arbitrary intervention level being defined to optimize the compromise. Screening programs
need very careful consideration to ensure that they do more good than harm.
.. A vision for the future role of genetics in medicine is to move from the current reactive
'diagnose and treat' system to a proactive 'predict and prevent' system. Whether this will be
achievable, and if so when it will happen, are extremely contentious and
unresolved questions.
606 Chapter 19: Pharmacogenetics, Personalized Medicine, and Population Screening
The simplest measures of test performance are itssens itivity and specificity. Using the numbers
defined in Table 1, the sensitivity is al(a + b); that is, the proportion of all people who have
the condition who are identified by the test. The specificity is defined as dl(c + d ), and is the
proportion of all people who do not have the condition where the test result correctly predicts
absence of the condition. Other measures include:
Fal se positive rate = d(a + c): the proportion of positive test results that wrongly identify a
False negative rate = b/{b + d): the proportion of negative test resul ts that wrongly identify a
Positive predictive value = 0/(0 + c): the proportion of positive test results that correctly
Negative predictive value:: d/(b + d ): the proportion of negative test resu lts that correctly
---
Condition present
Condition absent
Test positive a (
Test negative b d
these cases the analytical validity covers just laboratory quality issues. It should
not be taken for granted. In 2001, investigato rs in California sent hair samples
from the same person to six of the leadi ng US laboratories that offered hair min·
eral analysis to people concerned about nutritional deficiencies. The results from
the different laboratories were grossly disparate and overall not far from random.
Before relyingon results from a laboratory it is advisable to check that it is enrolled
in a reputable quality assurance scheme.
The measures become more interesting if the test is defined as the totality of
what the laboratory does to try 10 answer a clinical question. In Chapter 18 we
distinguished between testing for a specific predefined sequence change and
scanning an entire gene for mutations. The clinically relevant question is usually
whether any mutation has been detected anywhe re in the relevant gene. The sen
sitivity of the test will depend on how hard we look. Will we sequence every exon,
or will we rely on a method such as single strand conformation polymorph ism or
melting curve analysis? Will we use multiplex ligation-dependent probe am plifi
cation (MLPA) to check for whole exon deletions, in addition to sequencing
exo ns? Scanning the entire gene also raises the problem of unclassified variants,
discussed in Chapter 18. If a nonpathogenic variant is mistakenly reported as a
positive test result, the specificity of the test is reduced.
The likelihood ratio measures the likelihood that a perso n who Risk genotype present 800 700
has the susceptibility factor in question will deve lo p the disease,
compared with the likelihood of a person in the general population Risk genotype absent 200 ' 300
(who mayor may not have t he susceptibility factor), Alternatively. one -
could measure the likelihood relative to somebody who does not have The OR is (800 x 300)/(200 x 700) = 1.7. However, if the factor
the susceptibility factor. In either case, calculating the likelihood ratio were present in only 8% of (ases and 7% of controls, the OR in the
requires knowledge of the epidemiology of the disease. same study would be (80 x 930)/(920 x 70) == 1.15. A more intuitive
For case-control studies, the usual measure is the odds ratio (OR). measure of the relative risk, the frequency of the risk genotype in cases
In contrast with the likelihood ratio, this can be calculated directly compared with controls, is the same in both examples: 8/7 or 1.14.
from the results of the study, as shown in Table 1, without the need for Odds ratios approach that intuitive risk ratio when the absolute risk is
wider population data. Its clinical validity depends, of course, on the low (8% versus 80% in the examples above).
control group's being r~presentat ive of the wider population, as well as Odds ratios must be interpreted w ith great caution, especially
being matched to the cases for other factors that could independently for high-frequency risk alleles. It would be easy, but very wrong, to
produce an association with the disease. suppose that the odds ratio of 1.7 in our first example means that
having the risk genotype gives a person a 70% greater chance of
developing the disea se. However, the intuitive interpretation of an OR
Table 1 Categories in a case-control study of 1 is correct It means that a factor has no effect on risk.
Despite being less intuitive, ORs are used in case-control studies
Cases Controls
because of their statistica l properties, particularly in relation to logistic
regression, the statistical technique used to tease out single effects
Risk genotype present a c
from a complex set of factois. Note that the OR in Table 2 does not
Risk genotype absent b d change if the number of cases or controls is changed, provided that the
proportion of each who have the risk genotype remains unchanged.
If we consider a genetic risk allele, the OR will probably be different
In this study, for a person who has the risk genotype, the odds of for people heterozygous or homozygous for that allele. Unless there is
being affected (being a case rather than a control) are a:c or alc; l . For clear evidence of dominance or recessiveness, susceptibility alleles are
a person without the risk genotype, the odds are b:d or bld: l. The odds usually assumed to have a multiplicative effect. The ORs for genotypes
ratio is AA, Aa, and aa are assumed to scale in the ratio
1:r:r~.
(ale) = adlbe.
A per-allele OR is no rmally quoted.
(bl<fl
The mathematically adept can find some extensions of these
Imagine a study of 1000 cases and 1000 controls in which the risk calculations in the paper byWang and colleagues (see Further
factor is present in 80% of the cases but only 70% of (ontrols (Tabl. 2). Reading).
In the end, the value of a test must be judged by the changes it brings about.
Whatever the performance of a test according to the criteria outlined above, in
the end what matters is what it achieves for patients. Does it change what people
do? For most conditions there will be some threshold for action-for taking
X-rays, doing a biopsy, or prescribing a drug. The result of any single test will be
considered alongside all other relevant information-the age and sex of the
patient, their general health and social situation, and the results of any other tests.
A useful test is one that makes a substantial contribution to moving people across
the action threshold, in either direction. In this connection, absolute risk is much
more important than relative risk. A test may alter somebody's risk tenfold, but if
the effect is only to change the risk from 1 in 10,000 to 1 in 1000 it will probably
not make any practical difference. In contrast, a positive test for a BRCAI muta
tion has a relative risk of only about 7 (80% for a carrier versus 12% general popu
lation risk), butthe test result is important because of the high absolute risk.
cause an adverse reaction in a few individuals. Some of these differen tial effects
are due to environmental causes: a person's ability to absorb or metabolize a drug
may be changed by his illness or lifestyle. Sick people are often taking multiple
drugs, and some combinations may interact. But many differences are due to
genetic variation between people. Pharmacogenetics is the study of the roles of
specific genes in these effects, whereas pharmacogenomics uses genomewide
tools for the same purpose.
Pharmacokinetics studies the absorption, activation, metabolism, and excre
tion of drugs, whereas pharmacodynamics considers the actual target response.
In other words, pharmacokinetics addresses what the body does to a drug,
whereas pharmacodynamics addresses what the drug does to the body. Genetic
factors are relevant to both. There are at least four stages at which genetic varia
tions can affect a patient's response to a drug:
o Absorption--individuals may cliffer in their ability to transport an oral drug
into the bloodstream.
o Activation-many drugs are given in the form of a prodrug that must undergo
an enzymatic reaction, usually in the liver, to convert it to th e active fo rm.
o Target response-in different patients, the process or pathway targeted by the
drug may respond differently to a given local concentration of the drug.
Cata.bolism a.nd excretion-individuals often differ in the rate at which they
catabolize and dispose of the active drug molecule. Slow metabolizers will
have a longer or stronger response to a given concentration of the drug than
fast metabolizers will.
Adverse drug reactions are a serious problem. It has been estimated that they
are responsible for about 100,000 deaths a year in the USA, and for I in 15 of all
hospital admissions in the UK. Added to this is all the wasted time and money,
and the extra suffering involved when a patient fails to respond to a drug- an
especially severe problem with psychiatric drugs.
derivative of the substrate. The P450 cytochromes probably evolved to deal with
toxic plant metabolites, and variants may be important in determining the
response to chemicals in the diet and environment, as well as to prescribed drugs.
The enzym es are found mainly in the liver, but some occur in the small intestine
and elsewhere. Some drugs have an additional effect of inducing or inhibiting
specific P450 enzymes, which can lead to unforeseen interactions between these
drugs and those that are substrates of the enzyme. The Cytochrome P450 Drug
Interaction Table (see Further Reading) maintains comprehensive lists.
Humans have about 60 P450 genes, encoding enzymes that, between them,
are responsible for the phase I metabolism of maybe 60% of all prescribed drugs.
They are grouped into several families. Genes have names such as CYP2D6 (cyto
chrome P450 family 2, subfamily D, polypeptide 6). At least 10 different P450
enzymes have Sign ificant roles in phase I drug metabolism. Many of them can
metabolize a considerable range of different drugs, and individual drugs are often
substrates for more than one P450 enzyme. These wide and overlapping specifi
cities mean that there are not always close correlations between anyone P450
activity in a patient and his/ her handling of a specific drug. Nevertheless, some
drugs are metabolized by just one P450 enzyme, and in pharmacological research
these drugs can be used as probes to identify individuals with unusual functional
variants of that enzyme.
CYP2D6
Back in the 1970s, it was noted that some individuals showed markedly enhanced
sensitivity to the antihypertensive drug debrisoquine and to the anti-arrhythmic
drug sparteine. On investigation, these individuals showed high plasma levels of
the drugs but low urinary levels of the catabolism products, implying that they
were failing to metabolize and excrete the drugs. The cause was eventually identi
fied as low activity ofth e CYP2D6 enzyme. Individuals can be classified into poor,
intermediate, extensive, and ultra-rapid metabolizers. Figure 19_2 shows how
CYP2D6 activity governs the effective dose of the antidepressant drug nortriptyl
ine, which is metabolized by this enzyme.
CYP2D6 is involved in the metabolism of perhaps 25% of all drugs. Variation
in its activity has significant effects on the response to some beta-blockers used
to treat hypertension and heart disease, and to several psychiatric drugs includ
ing tricyclic antidepressants. Poor metabolizers are at risk of overdose effects of
these drugs. In contrast, CYP2D6 is also required to convert codeine into its active
form, morphine. Codeine is ineffective in pain relief for poor metabolizers,
whereas ultra-rapid metabolizers risk adverse effects sllch as sedation and
impaired breathing.
90 1
Figure 19..1 Pharmac.ogenetics of
80
o CYP2D6
o 24 48 72
hours
The CYP2D6 gene has nine exo ns and is located on chromosome 22q 13.
Sequencing the gene in poor metabolizers has revealed a variety of point muta
tions. mainly nonsense. frameshift. or splice site changes. and occasional com
plete gene deletions. Interestingly. the Human Genome Project chromosome 22
reference sequence happened to be from a deletion carrier. so CYP2D6 was not
recorded in the original sequence. Ultra-rapid metabolizers have increased num
bers of CYP2D6genes (Figure 19.3).
Other P4S0 enzymes
Another P450 cytochrome. CVP2C9. hydroxylates various drugs. including non
steroidal anti-inflammatory drugs. sulfonylureas. inhibitors of angiotensin
converting enzyme. and oral hypoglycemics. For example. rare poor metaboliz
ers have an exaggerated response to tolbutamide. a hypoglycemic agent that is
used to treat type 2 diabetes. The role of CYP2C9 variants in hypersensitivity to
warfarin is discussed below in the section on personalized medicine.
CVP2C19 metabolizes a variety of drugs. including anti-convulsants such as
mephenytoin. proton pump inhibitors such as omeprazole (used to treat stom
ach ulcers). proguanil (an antimalarialJ. and certain antidepressants. Subjects
can be classified phenotypically into poor. intermediate. and extensive metabo
Iizers (FIgure 19.4) Intermediate metabolizers are compound heterozygotes for
an active and an inactive allele; poor metabolizers have two inactive alleles. The
frequency of poor metaboHzers is 3-5% among Caucasians and African
Americans. but higher in Orientals and very high in Polynesians. Mephenytoin
and omeprazole are degraded by CYP2C I9. so poor metabolizers require a lower
dose. A particular risk for poor metabolizers is unacceptably prolonged sedation
CYP2C/
with a standard dose of diaze pam as a result of slow demethylation by CYP2C19.
In contrast. proguanil is a prod rug that requires activation by CYP2C I9 to form
the active molecule. cycloguanil; poor metabolizers therefore show a decreased
effect of that drug. ( 37% 1
CYP3A4 is another P450 cytochrome involved in the metabolism of maybe
40% of all drugs. Its activity in the liver varies up to thirtyfold between individu
als. but the cause of thls variability has been hard to identify. Coding sequence
variants are not common. The enzyme is inducible. and it may be that regulatory
effects are the main cause of the variability. A further complication is that the
activity of CYP3A4 strongly overlaps with that of yet another P450 enzyme.
CYP3A5.
•
o
poor metabolizer
Figure 19.4 Pop ulation frequencies of P450 activity variants.
The distributions differ in different populations. (Data for CYP2D6 adapted r·----; intermediate metabolizer
from Meyer UA (2004) Not Rev. Genet. 5, 669-67 6. Data for CYP2C9 for northern
Europeans from Service RF (2005) Science 308, 1858- ' 860. Data for CYP2Cl 9 L,"~j extensive metabolizer
forTaiwanese from Li ou YH, Lin CT, Wu YJ & Wu L5 (2006) J. Hum. Genet.
51,857-863.J
D ultra-rapid metabolizer
PHARMACOGENETICS AND PHARMACOGENOMICS 613
Other P450 enzymes with important roles in drug metabolism include CYPIA2
and maybe CYP2A6. Figure 19.4 shows the frequencies of variants for three of the
mos t important P450 enzymes.
24
~
u
~
,
13"
12 slow ra te of acetylation
0; Figure 19.5 A bimodal di stribution
~
adverse effect of the drug. Other drugs for which variations in acetylation rate SH
can be clinically important include procainamide (an anti-arrhythmic), hydrala
zine (an antihypertensive), dapsone (anti-leprosy). and several sulfa drugs.
H,N, ...--..
Y '-./
Jl,NH~
( _
. NH COOH
...............
and 0.3% are homozygous, for low-activity variants of TPMT. These individuals
require a lower dose of the drugs. Homozygotes can suffer life-threatening bone
marrow toxi city when given a standard dose of either drug. Three relatively com
~
~_ '~~~
. ' ~.
.~.' _
activating an intracellular signaling cascade.
Beta-l receptors are located mainly in
__ • . _ , p.389 j ' g(GIY
the heart and kidneys, and are the target
of many drugs, both beta-blockers and
l)
,tl
'-'t
':"'"" -
('
"'.... .
'-<\".0- 1:0
;J1" i beta-agonists. A common polymorphism,
p.389 Arg/Gly, affects the response to many
:x;p '
,,"Ox:cctlfOn:n:=J:XX
" ~
'I::==D:>=J=).:= :n:xn'/?
"
:s »
CO:<),. beta-blockers.
L:':::::..rJX£y:n::c..~xCD~--o:xd
I. ~'~"-'--~ r.a:::o:~u :o::u:a::o:::cco:, eaOH
However, much depends on precisely what response is being studied, The review
by Evans & McLeod (see Further Reading) gives details.
TheADRBl gene located on chromosome lOqZ4-Z6 encodes the beta-l recep
tor. This has only limited homology to the beta-Z receptor and is targeted by dif
ferent drugs (Figure 19.7). A common polymorphism, p.Arg389Gly, has been
tested for pharmacodynamic effects. The Gly389 allele is associated with a
decreased cardiovascular response to beta-blocker drugs. For example, Arg389
homozygotes had a much better response to the beta-blocker bucindolol than
those who are heterozygous or homozygous for the Gly389 allele.
Variations in the angiotensin-converting enzyme
The angiotensin -converting enzyme (ACE) is a peptidase that converts angi
otensin I into angiotensin II. The latter is an important regulator of blood pres
sure and has many other physiological functions. A polymorphic insertion/dele
tion of an Alu sequence in an intton of the ACE gene is associated with variations
in enzyme activity (Figure 19,8). DD (deletion) homozygotes have about twice
the level of circulating ACE than 11 (insertion) homozygotes.
ACE inhibitors such as enalapril and captopril are widely used drugs for treat
ing heart failure. Several reports suggest that ACE inhibitors are more effective in
DD than 1I patients. The insertion-deletion polymorphism has also been exten
sively studied for association with diseases. DD homozygotes are at increased
risk of myocardial infarction and coronary artery disease, and possibly also for
complications of type Z diabetes, but at slightly decreased risk of late-onset
Alzheimer disease.
Variations in the HT2RA serotonin receptor
The serotonin (S-hydroxytryptamine) receptor 2A has been a target for many
investigations because of the central role of serotonin in many neurological proc
esses. Response to psychiatric drugs is notoriously unpredictable, and several
studies have looked for associations between HT2RA variants and variations in
rrrtrtrfrrtrr'trrlY"fn~ru I allele
insertion ,
Alu
deletion
F1gure 19.8 An insertion-deletion
rrrrrtrfffl~'rm Dallele polymorphism in the ACE gene that
I vasoconstriction: encodes the angiotensin-converting
angiotensin-converting enzyme . / blood pressure i enzyme. Insertion (I) alleles have an
angiotensin
.
(Inactive)
I ! )
angiotensin II
DRVYIHPF
./ )
renal effects''
retenti on of Na", CI-
insertion of an A/u element into an intron
of the gene. Homozygotes for the deletion
allele (D) have higher levels of circulating
DRVYIHPFHl "- excretion of K" enzyme than those with the I allele.
t ~ 1'secretlon or Peptide sequences are shown with the
uloomde cls<M1d C~ . aldosteron e single·letter code.
616 Chapter' 9: Pharmacogenetics, Personalized Medicine, and Population Screening
response. Two variants, p.IleI97Val and p.His452Tyr, have been associated with a
poor response to the antipsychotic drug clozapine, whereas an intronic SNp,
rs7997012, has been associated with variable response to the antidepressant
citalopram. There are sometimes conflicting results regarding possible associa
tions of HT2RA variants with conditions including schizophrenia, obsessive
compulsive disorder, and alcohol dependence.
Malignant hyperthermia and the ryanodine receptor
Rare individuals have a life-threatening response to inhalation anesthetics such
as halothane or isoflurane. They suffer severe muscle damage (rhabdomyolysis)
and an extreme rise in temperature. This condition, knovvn as malignant hyper
thermia (MH), is genetic but heterogeneous. Many, but not all, cases have muta
tions in the ryanodine receptor gene, RYR1, on chromosome 19q13. This gene
encodes a calcium release channel in the muscle sarcoplasmic reticulum. It is
thought that MH results when a mildly temperature-sensitive ryanodine recep
tor allows excessive calcium flow, which causes the temperature to rise and initi
ates a catastrophic positive feedback loop.
VI tamin K
Warfarin
Warfarin is a powerful anticoagulant, used for patients at risk of embolism Or
thrombosis (and also to poison rats and mice by triggering internal bleeding). It
works by decreasing the availability of vitamin K. This is an essential cofactor for
the enzyme y-glutamyl carboxylase that activates blood clotting factors II (pro
thrombin), VI!, IX, and X. During these reactions, vitamin K is converted to the
inactive vitamin K epoxide, and this is recycled by the action of vitamin K epoxide
reductase (VKOR). Warfarin inhibits VKOR (Figure 19.9).
Achieving the correct level of anticoagulation is clinically very important. If
the level is too low, the patient remains at risk of thrombosis or embolism; how
ever, if it is too high, there is a risk of hemorrhage, which can be life -threatening.
The therapeutic vvindow (the range of doses that are efficacious and not harmful)
is narrow. After insulin, warfarin is the most common prescription drug respon
sible for emergency hospital admissions. The average cost of a bleeding episode
has been estimated as $16,000. But the effective warfarin dose varies up to twen
tyfold between individuals. This has made warfarin something of a test case for
the general utility of pharmacogenetically informed prescribing.
Warfarin as normally used is a mixture of two stereoisomers, R-warfarin and
S-warfarin. Both isomers are active, but the Sform is about three times as potent
as the R form. CYP2C9 is the principal enzyme that catalyzes the conversion of
S-warfarin to inactive 6-hydroxy and 7-hydroxy metabolites, whereas the oxida
tive metabolism of R-warfarin is catalyzed mainly by CYF IA2 and CYP3A4. Vari
ants in CYF2C9 and VKORCl (the gene encoding subunit 1 of VKOR) , together
vvith a limited set of environmental factors, can explain about 50- 60% ofthe vari
ability of response to warfarin. However, as Figure 19.10 shows, several other
LIVER CELL
<£YP3A~ I @P2~
~
~
R..wari'an n $w;J..-i;:mn· l
~ &hydroxywarfann
7.rlY(lrO~
.£~P2 C~
:0P2Cl~ ~
del1yorowarfarin
Figure 19.10 The catabolism of warfarin.
Warfarin, as prescribed, is a mixture of two
4-t¥lro~~artn] stereo isomers, R- and S-warfarin, each
of wh ich can be catabolized by several
P450 enzymes have a role in removing warfarin. In 2007. the US Food and Drug
Administration put wording on the label of warfarin that recommended (but did
not ma ndate) gena typing these two loci. Many clinicians, however, remain skep
tical about the value of gena typing their patients, because C¥P2C9 and VKORCl
genotypes do not fuU y predict the response. A major goal in pharmacogenetics is
to identify the remaining facto rs and develop a simple test that will accurately
predict the correct dose of warfarin for a patient. The review by Wadelius and
Piromohamed (see Further Reading) describes 30 other genes in which variation
might explain some of the varia ble effect, and discusses the prospects for better
prediction.
micradoslng study
a single very small dose is given to maybe 10-15 healthy VOlunteers to get preliminary data on the
pharmaCOdynamics and pharmacokinetics (this step m ay be omitted)
Nowadays companies try to avoid compounds that are metabolized by the most
variable enzymes. Many of the drugs described above, for which pharmacoge
netic effects are important, were developed yea rs ago when these effects were
less well understood. They are also now off-patent, so that the company has little
financial incentive to develop a genetic test.
Predicting adverse effects is particularly important. It can be both a financial
and a public relations disaster for a company if an adverse effect becomes appar
ent only after a drug has been broughtto market. There will probably be litigation
and claims for compensation. Vioxx® (rofecoxib), a nonsteroidal anti-inflamma
tory drug used to treat arthritis and other forms of pain, is a recent example. Vioxx
was withdrawn from the market in 2004 by its manufacturer, Merck, because of
evidence that high doses could increase the risk of heart attacks and stroke. Sales
in the previous year had generated $2.5 billion in revenue in the USA alone. If it
could be shown that only patients with certain genotypes were vulnerable to the
adverse effect, the company would have a huge incentive to market the drug in
combination with a genetic test. But identifying rare genotypes that predispose
to adverse effects can be difficult. If only a few patients in a phase III clinical trial
experience these effects, there will be little statistical power to identify the cause;
moreover, obtaining the necessary DNA for investigation may be complicated
and require extra consent.
("y"F
(A) signal molecule (8)
01 HN M cl
C~ ~ N ~ O~
± ) (' '-oJl,.l)___
1~ CYTOSOL
cross-phosphoryLation
1
by activated
kinase domains
'0
on the left. Note that these are not pictures
of microa rrays but displays of data. Eac h
column shows data from one tumor and
, \": . I . .':"1'(
'- ceNEt eac h row shows the exp ression level of
I'
.' ......
• •
-""-' one gene. The red -green color sca le shows
~ .'J
•
.n,.
"y'
...... "i
',.
'1' ....rJ._
.-
. .. PJ@""
,.
........ - ,
• •
~
C ~ CLl
c,." upregulated and green for down regulared
"'~I J
..,_ m~ level s. (From SorlieT, Tibshiran i R,
'"',,· 1 .......
i
."'~~,
'...
••
•
• ,.
f'" )
l I
I' J',
'
. ...
.•.
."
'P
I"
... J La" ~"".A,
.. ~
-,
~ t '
"~.' .
•
'
1
I \
I,
B3GNTS
Parker Jet at. (2003) Proc. Natl Acad. Sci. USA
100,8418- 8423. With permiSSion from t he
Nati ona l Academy of Sciences.!
subtype
I
basal ," IILJ • • • ... " ~ . . . .: : , ".. f .:.;r::" ..
normal
breast-like
TABLE 19.1 THE SIGNIFICANCE OF A GENETIC TEST RESULT FOR THREE DIFFERENT CONDITIONS
Test CAG repeat nu mber BRCAI /2 mutation TIC SNP rs7903 146 in TCF7L2 gene
Ris k if test res ult is positive or high-risk 100% 70-85% 6.5% 1.1 36% (bl
Risk If test result is negative or low-ri sk 0% 12% (population risk) 4.5% I.) 27%(bl
The risks for rype 2 diabetes are calculated from frequencies of the risk allele of 0.345 in ca ses and 0.261 in contro ls (Den mark Bsa mple of Helgason et
at (2007) Nar. Genet. 39, 218- 225J assu ming a person with an age and sex-specific risk before testing of 5% (a) Or 300m (b). Test resul t positive means that
a person has one copy of the high-risk allele.
per-allele odds ratios between 1.2 and 1.5), and even these estimates are likely to
be inflated." The point about inflation refers to the phenomenon ofstriking lucky,
discussed in Chapter 15 (see Figure 15.3).
As discussed in Chapter 15, it is not surprising that factors identified through
association studies are weak. To show associations with tagging SNPs, these fac
tors must have persisted in the population for hundreds of generations. Given
that they must therefore be evolutionarily insignificant, it requires something
special for them to be clinically significant. Possible special factors are an inter
action with some specific featu re of modern life, for example: smoking or a West
ern diet; a balanced effect whereby susceptibility to one disease is balanced by
resistance to ano ther; balancing selection in favor of heterozygotes; or an effect
entirely limited to people past reproductive age (over the age of 50 years).
TABLE 19.2 THE 10 STRONGEST ASSOCIATIONS FOUND BY THE WELLCOME TRUST CASE CONTROL CONSORTIUM STUDY
OF 500,000 SNPS IN SEVEN COMPLEX DISEASES
Disease ----------~-------------
( hromosomall ocation of fact or ,
Odds ratio 95% confidence Interval
I
Type 1 diabetes 6p21 (MHC) 5.49 i 4.83-6.24
As shown in Figure 15.9, no strong associations were detected fo r the seventh disease studied, hypertension. MHC, major histocompatibility complex.
Data from Welicome Trust Case Co ntrol Consortium (2007) Nature 447, 66 1- 678.
624 Chapter 19: Pharmacogeneti cs, Personalized Medicine, and Population Screening
1.0
plotted on a graph (Figure 19.14). The area under the curve is a measure of the
0.9
ability ofthe test to discriminate between people with and without the condition
being tested. A completely useless test would give a strillgh t line (area under 0.8
curve = O.S). The more useful a test is, the more the curve bulges up hom the 0.7
diagonal, and the greater the area under the curve. A test that discriminated per :~0.6
fectly would have an area under the curve of 1. .~0.5
o
Figure 19.14 shows the ROC curves from simulations by Janssens and col ~ 0 .4
leagues (see below and Further Reading) of genotyping two, three, four, or five 0 .3 - 2 genes (AUe 0.59)
susceptibility factors for a hypothetical complex disease. We see that the area 0.2 - 3 genes (AUe 0.64)
- 4 genes (AUe 0.68)
under the curve increases as more tests are added to the battery. The exact shape 0.1 - 5 genes (AUe 0.70)
of the curves, and the amount of increase in the area under the curve as a new 00
test is added, dep end on the extra risk identified by that test. The curves in the 0.2 0.4 0.6 0.8 1 .0
1 specificily
figure assume that the successive tests are of variants that confer relative risks of
I.S, 2.0, 2.S, 3.0, and 3.S, in that order. Note that the beneficial effect of adding Figun 19.14 Receiver-operating
eXlta tests is partly a consequence of the fact that each successive test identifies characteristic (ROC) curve for a simulated
more risk than the preceding test. If they all had the same effect on the overall test battery, showing the increasing
risk, the progression would be less marked. discrimination obtained by testing two,
It is important, however, to bring some populatio n genetics into the argu three, four, or five Independent disease
susceptibility factors. The ROC curve plots
ment. Suppose there are 20 independent two-all ele SNPs, each having one allele
sen silivity against (1 - specificity) for a test.
that confers a relative risk of 1.05 for a disease and is present a t a frequency (q) of The more discriminating a test is, the greater
0.3 in the population: SI % of peop le have at least one high-risk allele for anyone is the area under the curve (AUC), [From
SNP (q =0.3, P = 0.7, P' = 0.49, I - P' = 0.51 ). That single test result makes little Janssens AC, Pardo MC, Steyerberg EW &
difference to the person's risk. But if somebody has high-risk alleles at al l 20 lOCi, van Duijn CM (2004) Am. )J1um. Genet 74,
their risk will undoubtedly be high. Similarly, if they have only low-risk alleles at 585-588. With perm is s~from E(sevier.)
all 20 loci, their risk will be low. Such results would very probably have clinical
utility. But these are independent loci, and only one person in (0.51)20 = 1.4 per
million people would h ave the very high risk. One person in (0.49)20 = 0.6 per mil
lion people would have a very low risk; everybody else would be somewhere in
between.
Testing for several factors must provide more information than testing for just
a single factor. The ROC curves in Figure 19.14 illustrated this. Th e more tests that
are added to a battery, the greater is its power to detect significantly high or low
risks-but the smaller is the percentage of people who would show those ex
treme risks. In many ways, the larger a test battery is, the more it resembles test
ing for a Mendelian condition: it becomes good at defining rare people who are at
notably high (or lowl risk. But Mendelian testing is done only when the family
history or clinical signs suggest a high prior risk. The val ue of a test battery must
depend on its ability to report usefully above-average risks for a usefully large
minority of subjects.
In our example with 20 SNPs, we can use the binomial theorem to calculate
that 0.7% of people would have one or more high-risk alleles at 16 or more of the
20 loci, 7% would have high-risk alleles at 14 or m ore loci, and IS% at 13 or more
loci. We cannot, however, calculate the relative risk of these people simply by
doing sums with the relative risks at individual loci. This is biology, not arithme
tic. Different facto rs may affect the same pathway and either be redundant or act
synergistically. A very detailed systems biology understa nding of the pathogene
sis, coupled with extensive epidemiological data, would be necessary to interpret
these intermediate risks.
Figure 19. 15 shows the simulations by Janssens and colleagues (see Figure
19.14) in more detail. With the parameters used in this sim ulation, it is clear that
the chances of a perso n being iden tified as being at very high risk are extremely
small .
100
Figure 19.15 The distribution of risk
prior risk 2 genes 3 genes
80
after testing for two, three, four, or five
independent risk factors for a disease. As
60 more genes are included in the test battery,
~ 40 the test increasingly identifies a very small
•ro
~
percentage of people w ho are at very high
•
~
20
~ risk. In the final panel, the simulation allows
u ~
'0
a for an additiona l environmental risk factor.
~ 100
[From Janssens AC, Pardo MC, Steyerberg EW
4 ge nes 5 genes 5 genes & erNlfOnmen t
& van Duijn eM(2004) Am. J, Hum. Genet. 74,
"
ro
~
0
80
585-588. With permission from Elsevier.]
"- 50
40
20 )
00 20 40 60 80 1000 20 40 60 80 1000 20 40 60 80 100
genetic risk across the population, on the basis ofthe reported allele frequencies
and per-aUele odds ratios. The risk in the highest risk class (homozygous for the
high-risk allele at all seven loci) is 6.3 times the risk in the lowest risk class-but
only six women in a million would faU into either of these classes. The distrib u
tion of genetic risk in cases is only marginaUy different from the distribution in
the general population. The 10% of the population at highest risk accounts for
only 15% of cases.
Clearly, genotyping women for these seven factors would contribute little to
defining a high-risk group for special attention. Maybe these genotypes could be
used to modify the age at which women are offered routine mammographic
screening. This would be similar to the way in which maternal serum biomarkers
are used to modify the age threshold for prenatal testing for Down syndrome (see
Section 19.5). At present, in the UK all women over the age of 50 years are offered
mammographic screening every 3 years. Screening wo uld be more effective if
entry into the program were dictated by combined age and genetic risk, rather
than by age alone. Women with high-risk genotypes wo uld be offered screening
from an earlier age. The corollary (assuming the overall numbers screened remain
the same) is that women with low- risk genotypes should be denied screening
until some later age. Some of them will nevertheless develop breast cancer that
could have been picked up had they been screened at 50 years old, so this may
prove a difficult concept to sell to the public. The proposal could also be seen as
another instance of the tendency to focus on marginal genetic effects rather than
much more significant lifestyle factors. Calorie intake and weight have far more
influence on susceptibility to breast cancer than the genetic fa ctors discussed
here. Being 20 kg overweight at the menopause doubles a wo man's risk of breast
Cancer. A screening program based on age, weight, and diet would define a high
risk group far more effectively than one based on age and the currently known
genotypes; moreover, it would define a cohort ofwomen who wo uld be a natural
target for risk reduction programs.
~
c
:8ro 4
l
&
"0
c
o
~
a
o-0.4
0 .4
log relative risk
(B) affected cases , Flgur. 19. 16 Relative risk of sporadic
6 breast cancer, based on the predicted
combined effects of seven susceptibility
loci. (A) The disuibution of genetic risk in
the general population. (6) The distribution
~
c of risk in diagnosed cases. The seven factors,
gro 4 with their chromosomallocation and per
"5
a allele odds ratio, are SNPs rs2981582 (lOq,
& 1.26), '53803662 (16q, 1.20), rs13387042 (2q,
c 1.20), rs889312 (5q, 1.13), rs1 053485 (2q,
o
Data from Zheng SL, Sun J, Wiklund F et al. (2008) N. Engl. J. Med. 358, 91 0-919.
628 Chapter 19: Pharmacogeneti cs, Personalized Medicine, and Popu lation Screen ing
to consumers through the Internet. Typical offerings have titles such as Heart
health, Bone health, Immunogenomic profile, and Nutriti onal genetic profile.
A Dutch-US group reported in 2008 on the scientific bases of genomic profiles
offered by seven such companies (see Further Reading). Janssens and colleagues
identified the particular SNPs used for each profile and sought evidence that
these were significant risk fact ors for the claimed aspect of health. Their findings
did little to bolster confidence. They concluded, "This review shows that the
excess disease risk associated wirh many generic variants included in genomic
profiles has not been investigated in meta-analyses, or has been found to be min
imal or not significant .... Although genomic profiling may have potential to
enhance the effectiveness and efficiency of preventive interventions, to date the
scientific evidence fo r most associations betw"een genetic variants and disease
risk is insufficient to support useful applications."
Few would question the right of people to find out about their own genetic
constitution, as long as they do so at their own expense. People should, however,
be able to trust the claims that are made for the services they are buying. Pro
vided that the advertising is honest, people will opt to buy or not buy a test
according to personal whim. This may not accord with the calculations of health
economists. A news article in Nature (vol. 453, p. 570; 2008) repo rted the anec
dotal case of a young woman whose test result, from a company selling through
the Internet, claimed to change her risk of obesity from 32%, based on her age
and sex, to 34%. She was a satisfied customer ofthe service. The report quotes her
as saying, "It really does show that if I wasn't already taking care of myself I would
probably be overweight or diabetic."
Table 19.4 summarizes the situation in 2010 regarding susceptibility testing.
Whether the situation will be very different in the future is one of the biggest
debates in public health.
Environment and lifestyle do not usually have environment and lifesty le are often more
a major role important than the genetic factors covered by
a test
A si ngle genetic test is highly predictive a single genetic test has very little predictive
va lue
Genotype of one person may have implications no implications for relatives(susceptibi lity
for their relatives depends on pa rticular combinatio ns of
ge notypes, and recombi na tion during meiosis
breaks these up)
Tests assessed for sensitivity, specifiCity, and tests assessed for poten tia l profitability before
predictive value before they are made available they are offered
Testing laboratories subject to external quality testing laboratory may or may not be su bject to
assurance any quality assurance
POPULATION SCREENING 629
.••
;;
E
2.0
1.5
e
..
"•
1.0
~
0
Q
'0 0.5
x
."•
0
20 25 30 35 40 45
most babies are born to younger women, so too are most babies with Down syn
drome. A screening test with a cutoff based on maternal age alone has low
sensitivity.
Over the years, several biomarkers have been identified in maternal serum
that show different, although strongly overlapping, distributions in pregnancies
in which the fetus does or does not have Down syndrome. None of these would
individually be useful as a screening test, but combining age and measures of
several biomarkers can give a valuable composite risk. Several alternative tests
have been proposed. For example, a study of 46,193 pregnancies in London hos
pitals demonstrated the efficacy of a quadruple test using maternal age plus lev
els of alpha-fetoprotein, unconjugated estriol, human chorionic gonadotropin,
and inhibin-A (Table L9.5). Other biomarkers such as PAPP-A (pregnancy-associ
ated plasma protein A) or looking at the neck of the fetus by ultrasound (nuchal
translucency test! are used in some programs. The FASTER (First and Second Tri
mester Evaluation of Risk) study by the US National Institutes of Health com
pared several possible protocols. in many programs, the threshold for offering a
definitive diagnostic test would be a composite risk of Down syndrome of i in 300
or greater. Various options in the FASTER study gave detection rates of 90-95%
for a false positive rate of 5%.
Whatever the protocol, there is a trade-off between sensitivity and specifiCity.
Sensitivity can be maximized only at the cost of plunging more couples into con
siderable stress and anxiety, as well as triggering risky and invasive diagnostic
testing. In contrast, minimizing the number of false positives means that more
women who have been given a negative test result will nevertheless go on to have
a baby with Down syndrome. One important consideration is how early in preg
nancy the result can be given. A positive result in the first trimester probably
engenders less stress than one obtained later, and if the couple opt for termina
tion of an affected pregnancy, this is far less traumatic when performed earlier on
in the pregnancy.
Either test defines a high-risk group who then need a definitive diagnostic test. The quadruple test is
considerably more efficient than maternal age alone. Data from Wald NJ, Huttly WJ & Hackshaw AK
(2003) lancer 36 1, 835-836.
POPULATION SCREENING 631
A positive result must enable some preventive treatment, e.g. special d iet for PKU; review and
useful action choice of reproductive options in cystic fibrosi s carrier
screening
The test must have high sensitivity tests with many false negatives undermine confidence in
and specificity the program; tests with many false positives, even if these
are subsequently filtered out by a definitive diagnostic test,
can create unacceptably high levels of anxiety
The whole program must be socially subjects must opt in with informed consent; screen ing
and ethically acceptable without counseling is unacceptable; there must be no
pressure to terminate affected pregnancies; sc reening
must not be seen as discriminatory
The benefits of the program must it is unethical to use limited health care budgets in an
outweigh its costs inefficient way
TABLE 19.7 A TEST THAT PERFORMS WELL IN THE LABORATORY MAY BE USELESS FOR POPULATION SCREENING
Prevalence of True positives in True positives True negatives In False positives Positive predictive
condition population screened dete<ted by population detected by value
screening screening
different pathogenic mutations have been described . Even with the new ultra
high-throughput sequencing meth ods it would not b e feasible to test for all of
the m as a populalion screen. Screening would test for some subset of the com
moner mutations. PeopJe carrying rare mutations not included in the screening
panel would be false negatives. The question is, what sensitivity would be accept
able? FJgure 19./9 shows a flowchart for the screening program.
It is clear that simply testing for the most common muta tion, p.F508del. would
not p ro duce an acceptable program; more affected children would be born to
couples who tested negative on the screening program than would be detected
by the screening. Whether or not s uch a program we re financially cost-effective,
it would surely be socially unacce ptable. What conslitutes an acceptable program
is harder to defin e. One suggestion focuses on +1- couples; tha t is, couples with
one known carrier and the partner negative on all the tests. The partner still might
be a carrier of an undetectable or unidenlifiable rare mutatio n. Writing in the
journal Nature (see Further Reading) , Leo ten Kate suggested that an acceptable
screening program is one in which the risk for such +1- couples is no hlgherthan
in th e general population risk befo re screening. Fo r cystic fibrosis screening of
northern Europeans, that would require a sensitivi ty of about 96%. Rather than
screen everybody with such high sensitivity, it might be more effi cient ro use an
I
r tltJl! W I f.alsE!'...o,te;
9565
9M5 JI .,
131 3D'
392
I
I I
I.-Douse I~ SPOU5e ,~ ~D U 5e is
not a earner ~ c!tt'ller I'lCIt a carner
6 125 13 291
2 41 17 375
,_L]
(i.e. testing for p.FS08del only); red ligures
show results for a test with 90% sensitiVity.
Pink boxes represent ca ses that would be
r:-~~
:~t~~~~ i seen as successes for the sc reening program
l
i
(regardless of what action they then take),
~
3 2. 25 ! pale bl ue boxes represent failures. PND,
13 4 ,
I prenatal diagnosis.
POPULATION SCREENING 633
initial screen of fairly low sensitivity and involving very large numbers of sub
jects, but then to do a very thorough screen ofthe much smaller number of part
ners of the carriers identified in the irritial screen.
Choosing subjects for screening
gram for CF carriers would have to decide whom to test. As Table 19.8 shows,
there are several options, each of which has advantages and disadvantages.
TABLE 19.8 POSSIBLE WAYS OF ORGANIZING POPULATlON SCREENING FOR CARRIERS OF CYSTlC FIBROSIS
Group tested Advantages Disadvantages
Newborns easily organ ized no consequences for 20 yea rs; many families would forget
the result; unethical to test children
Schoolleavers easily organized; inform people before they start difficult to conduct ethically; risk of stigmatization of
relationships carriers
Couples from physician's couple is unit of risk; stresses physician's role in preventive difficult to contro l quality of counseling
lists medicine; pre-conception screening allows a full range of
reproductive options
Women in antenata l clinic easily organized; rapid results bombshell effect for carriers; partner may be unavailable;
time pressure on laboratory
Adult volunteers (drop-in few ethical problems bad framework for counseling; no targeting to suitable
CF center) users; inefficient use of resources?
634 Chapter 19: Pharmacogenetics, Personalized MediCine, and Population Screening
This argument has special force for couples who already have one affected child.
They often request prenatal testing, saying that however much they love that
child, they could not cope with having a second affected child. Whatever their
position on this argument, all civilized people must surely agree that society has
an obligation to look after people born with disabilities and do whatever is pos
sible to allow them a fuillife.
Some people worry that enabling people with genetic diseases to lead
normal lives spells trouble for future generations
If neonatal screening detects a baby with PKU, and the family is able to keep to
the low-phenylalanine diet, the baby will grow into a person who can lead a nor
mal life. That may well include becoming a parent. Phenotypically such people
are normal, but genetically they have PKU, and they will pass on one of their
mutant alleles to any child. [f they had not been treated, they could never have
become parents and their mutant allele would have died with them. [s this not
going against natural selection and storing up trouble for future generations'
It has to be true that helping those affected by a mutation to live normal lives
has the effect of increasing the number of mutant alleles in the next generation
the question is, by how much' A little population genetics helps here:
For recessive conditions, only a very small proportion of the disease genes are
carried by affected people. The great majority are in healthy heterozygotes.
The Hardy-Weinberg equation (see Chapter 3) gives the ratio of carriers to
affected in a population as 2pq:q2, where q is the frequency of the disease
allele and p= 1- q. Because each carrier has one copy of the disease allele and
each affected person has two, the proportion of disease alleles present in
affected people is 2cj21 (2pq + zcj2). This expression simplifies to just q when
numerator and denominator are both divided by Zq. So, for a recessive dis
ease affecting one person in 10,000 (cj2 = 1110,000; q= 0.01), only 1% of disease
alleles are in affected people. Whether or not we allow affected people to
transmit their disease alleles has very little effect on the frequency of the dis
ease in future generations. We do not need to think of measures to prevent
them from having children as the price of treatment .
• For dominant conditions, the effect on the gene frequency would be more
marked. Ifa condition is sufficiently serious that, untreated, it is incompatible
with reproduction, then it must be maintained by recurrent mutation. If
affected people were treated so well that they had the same reproductive suc
cess as unaffected people, then the frequency of the condition would double
in the next generation. On the other hand, if treatment were so successful,
would an increase in the number of affected people be such a disaster?
\"Cwo".fr
~j{!
MO$1'" It\~ FPt.C1"OIt:;.
I'\fIt~ NO"r 1../,nFvu..v
f"ItHJlcnve, .. '
Optimists believe that medicine will move hom a reactive 'diagnose and treat'
- diagnose and treat medicine
to a proactive 'predict and prevent' mod el. At present, we wait until we are unwell ~ - predict and prevent medicine
and then visit our physician for a cure. In the vision of the optimists, all this will ~
change. Some time when we are young and healthy, we will donate a DNA sample 1l
'0
that will be analyzed to identify our pattern of genetic disease susceptibility. •E cos t of
Maybe our whole genome will be sequenced, or maybe the DNA will be typed for
a few thousand SNPs that govern susceptibility to major diseases and reactions to "§ genotyping
J,
most drugs. However, being able to predict is not useful unless we are also able to
prevent. To fulfill the vision, the knowledge that we are susceptible to a certain o 10 20 30 40 50 6 0 70 80 90
disease must allow us to take effective meas ures to avoid it. age (year s)
If all this works, the pattern of health care spending will be revolutionized
(!'igure 19.21 ). A small extra expend iture early in life will bring huge and steadily Flgur. 19.11 Changing profile of health
care expenditure if medicine moves
increasing savings later on. Even in the shorter term, there will also be savings
from a diagnose and treat to a predict
because the pharmacogenetic profile will ensure that whenever we are ill, money and prevent model. The extra cost of
is not wasted on ineffective drugs or unnecessarily high doses. genotyping eve rybody at an early age
Pessimists (who like to call themselves realists) ask: where is the evidence for is more than offset by savings due to
any of this? Some of the strongest disease susceptibility factors and many of the personalized prescribing throughout life
variants unde rlying drug response have been known for decades but have made and the avoidance of late-onset complex
very little impact on the practice of medicine. The main reason is that the knowl diseases. The curves are indicative on ly and
edge does not lead to any useful ac tion. It would be easy to screen everybody for do not represent the resu lt of quantitative
the HLAB'27 tissue type, which is found in 6-8% of people in many populations. calculations.
B27 has been known for decades to be a very powerful susceptibility factor for a
serious disease, ankyl osing spondylitis. The relative risk is about SO-vastly
higher than any risk discovered by the current generation of genomewide asso
ciation studies. B27 typing is indeed used when there is a question abo ut diagno
sis, but there is no point in testing healthy people because although the relative
risk is high, the absolute risk is small and there is nothing that a B27-positive
person can be told to do to avoid the risk. To give another example, the strongest
association found by the Wellcome Trust Case Control Consortium (see Table
19.2) was between certain HlA types and type I diabetes. Again, this has been
known for decades. This is a common severe disease that is hugely costly to health
services. We have all the knowledge to screen for susceptibility-but nobody does
so because no useful action would follow.
With other complex diseases we do know strategies to reduce the risk-but
these do not depend on a person's genotype. Type 2 diabetes is a prime example.
The incidence is rocketing in many countries asadirectconsequence of unhealthy
lifestyles. Getting some exercise and losing weight are strong protective factors.
Intervention programs with high-risk people (defined by weight and lifestyle, not
by genotype) have demonstrated the benefits of exercise and weight loss. But the
advice to eat sensibly and get some exercise applies to everybody, regardless of
genotype. That is the dilemma faced by companies offering lifestyle genetic pro
filin g. Regardless of their results, they must offer the same generic healthy living
advice to everybody. There is no genotype that says it is OK to live on a diet of
French fries.
This argument cuts both ways. We all know that not smoking, eating senSibly,
controlling our weight, and getting some exercise are the keys to a healthy life. All
these options are available to everybody at zero or even negative cost-yet health
care budgets are increasingly eaten up by the consequences of people choosing
unhealthy lifestyles. Does this mean that we need genetic tests to identify those
people on whom the healthy living advice should be concentrated? Or does it
mean that many people will simply ignore any advice, regardless of how well
founded it is?
It may be that the pessimists simply lack visio n. Given the staggering rate of
progress of human molecular genetics, it is a brave person who claims dogmati
cally that nothing can be done. Perhaps a more prudent position is to accept
that prediction will come long before prevention, but not to rule out the possi
bility of effective prevention for at least some common complex diseases. If only
one or two of the currently intractable major co mplex diseases could be pre
vented through identifying and treating people at high genetic risk, it would be
a major advance in public health. Maybe the correct place for our heads is nei
ther in the sand nor in the clou ds, but somewhere in between, where we have a
clear vision.
636 Chapter 19: Pharmacogenetics, Personalized Medicine, and Population Screening
CONCLUSION
Clinical tests must be evaluated against several criteria. The analytical validity
measures how far the test measures the thing it seeks to measure. Typical consid
erations would be the sensitivity, specificity, false positive rate, and positive pre
dictive value. For a test of a continuously variable quantity, suitable measures
would be the test-retest reliability and the ability to report correctly the value in
a known test sample. Reputable laboratories enroll in external quality assurance
schemes to ensure that their results have good analytical validity.
Clinical validity measures the extent to which the test result predicts the clirti
cal outcome that it is testing. For tests of genotypes, the clinical validity depends
on the extent of genotype-phenotype correlation. Further criteria are the clinical
utility- how useful is the result to the patient?-and considerations of the ethics
of the test.
An important facet of individual genetic variation is the way in which indi
viduals differ in their response to many drugs. Pharmacogenetics considers the
effect of specific genotypes on pharmacokinetics (what the body does to a drug
by way of absorption, metabolic activation, and elimination) and pharmacody
namics (the variable response ofa target organ to a drug). There are many exam
pies in which a drug is either ineffective or dangerous to patients who have some
specific genetic variant. This has led to an interest in patient-specific prescribing,
in which the choice and dose of a drug would be influenced by a genetic test on
the patient. This is already a reality with some expensive anti-cancer drugs, but it
has been slow to become part of wider medical practice. Reasons for the slow
uptake include the limited predictive value of many pharmacogenetic tests, and
the logistic problems of integrating genetic testing into normal clinical routines.
A second area where common individual genetic differences are of clinical
interest is their role in modulating susceptibility to common complex diseases.
Previous chapters have described the scientific and technical breakthroughs that
have led to the identification ofJong lists of genetic susceptibility factors for many
complex diseases. However, attempts to use this knowledge as a basis for person
alized medicine seem premature. For most common diseases, the currently iden
tified genetic factors account for only a modest part of the individual differences
in susceptibility. Genotyping may be performed to satisfy individual curiosity,
but for most susceptibility tests the clinical validity is low, and no clirtical utility
has been established.
Measures of susceptibility find current use in population screening programs.
In contrast with diagnostic tests, population screening tests do not usually seek to
identify cases directly. Instead, the usual aim is to define a high-risk group ofpeo
pIe who can then be offered a definitive diagnostic test or some other useful inter
vention. The initial screening tests for genetic conditions are usually based on
clinical or biochemical findings, rather than tests of genotypes. Usually, there is a
trade-off between sensitivity and specificity in deciding where to draw the line
between a normal and a high-risk result. In contrast with diagnostic tests, large
scale population screening programs target people who had no particular prior
reason to be concerned about the condition being screened for. This makes it very
important to consider the social and ethical implications of any proposed screen
ing program. A poorly designed program could easily do more harm than good.
In the longer term, there are hopes that susceptibility testing could become
much more predictive and could be combined with preventive measures. This
would allow medicine to move from the current diagnose and treat model to a
predict and prevent model. People would be genotyped at some early age for a
whole range of susceptibility factors, their individual susceptibilities identified,
and appropriate preventive measures recommended. How far this vision could
ever be realized is a matter of considerable conuoversy. Progress will depend on
identifying stronger determinants of susceptibility and better means of preven
tion. The chances of achieving these goals will differ from one disease to
another.
Chapter 20
KEY CONCEPTS
The study of gene function and human genetic disease can be greatly aided by genetically
modified animal models, often known as transgenic animals.
To make a transgenic animal, exogenous DNA (known as a transgene) is inserted into the
germ line; the transgene may contain more than one gene and accompanying regulatory or
marker sequences.
• Transgenes are often introduced into a fertilized oocyte, whereupon they can randomly
integrate into the genome and be transmitted to all cells of the developing animal.
• Transgenes can also be introduced into embryos via cultured pluripotent stem cells that can
ultimately give rise to all cells of the animal.
In gene targeting a specific predetermined gene is genetically modified within cultured
pluripotent stem cells and the genetically modified stem cells are used to introduce the
mutation into the germ line.
Gene targeting may be designed to inactivate a gene Ca gene knOCkout) or to modify it; gene
targeting frequently exploits homologous recombination, a natural cellular process, to
replace specific DNA sequences within a gene by DNA sequences carried on an introduced
vector DNA.
Transgenes can also be directed to integrate within a specific target gene so that the target
gene is inactivated while transgene sequences are expressed under t11e regulation of the
promoter of the target gene Ca gene knock-in).
• Larger genomic changes can be brought about by chromosome engineering. Short
recognition sequences for a microbial recombinase are inserted by gene targeting into
specific positions in the chromosomes of pluripotent stem cells. A specific microbial
reco mbinase is then used to induce the recognition sequences to recombine to produce a
chromosomal deletion, inversion, or translocation.
Specific predetermined target genes can also be knocked down in animal cells at t11e RNA
level by using RNA interference or synthetic antisense oUgonucleotides to selectively target
transcripts of the gene to be knocked down.
Large-scale mutagenesis screens often use insertional mutagenesis, whereby specific
transgenes are inserted into the genome, causing inactivation of genes in a random or semi
random way. The transgene acts as a market to identify the inactivated gene.
Disease resulting ftom a loss of gene function is often modeled by using gene targeting to
inactivate the orthologous gene in a suitable animal model.
• Disease resulting ftom a gain of function can be modeled by making transgenic animals
with mutant transgenes.
640 Chapter 20: Genetic Manipulation of Animal s for Modeling Disease and Investigating Gene Function
Much of our knowledge of gene function and human disease has been obtained
or illuminated by studying model organisms. As outlined in Chapter 5, the con
servation of basic developmental processes between species has enabled the
study of model organisms to aid our understanding of human development. The
characteristics of a wide variety of model organisms were detailed in Section 10.1.
In this chapter, we explore how knowledge of the genome sequences of humans
and of various animals has facilitated the genetic manipulation of animal models
to understand gene function and to mimic human disease states.
20.1 OVERVIEW
Unicellular models-notably Escherichia coli and the yeas t Saccharomyces cere
uisia..-have been invaluable in shedding light on how genes function and on
many general cellular functions (see Box 10.1 and Table 10.1). About 30% of
human genes known to be involved in disease have fun ctional homologs in yeast,
and yeasts have also been of great help in modeling some kinds of human disease
that are caused by defects in basic cellular function s, such as abnormalities of
protein glycosylation.
As described in Chap ter 12, studies of cultured invertebrate and mammalian
cells have also been inval uable for investigating gene function and understand
ing pathogenesis. However, many types of cells are not easy to culture, and cul
tured cells can take us only so far-they cannot reveal much about interactions
between different cell types. If we are to begin to understand our biology and to
model disease we also need to study whole organisms.
Animal models are extremely valuable in providing insights into gene func
tion and disease processes, allowing detalled physiological assessments and
investigations of the cellular and molecular basis of disease. They are also cru
cially important in allowing us to test drugs and other novel therapies before con
ducting clinical trials on human subjects.
TABLE 20.1 PRINCIPAL ANIMAL CLASSES USED TO MODEL HUMAN DISEASE AND INVESTIGATE GENE FUNCTION
Organism class Advantages as models of disease and gene function Principal disadvantages and practical limitations
Nonhuman closely resemble humans at physiological, biochemical, and prim ate experimentation is ethically unacceptable to many
primates (see genetic levels; high intelligence; great apes have similar brai ns people: primate colonies are very expensive to maintain;
Table 10.2) to those of huma ns; important for neuroscience research generally lo ng generation times and small numbers of
(understand ing brain fu nction, etc.) and modeling res ista nce to offspring are a major limitation to genetiC studies, but
certain disea ses marmosets, which have a relatively short gestation period
and are reasonably prolific, may be more amenable.
especially after successful germ li ne transmission of
transgenes first reported in 2009 (see the text)
Other mammals resemble humans qu ite closely at different levels; sheep and mamma ls other than rodents have comparatively long
(see Table 10.2 pigs, and to a lesser extent rats, are more su ited to physiological generation times and low n umbers of offspring, and
and Box 10.3) studies than mice; dogs provide models of genetic disorders experimentation on them is not ethically acceptable to some
and genetics of behavior; cats are good models for studying people; rodents are not good models for studying normal
host-gene-pathogen interactions; rodents have short human brain function and disorders of hig her brain funct io n,
generation times and la rge numbe rs of offspring, making them such as Alzheimer disease; they also show important genetic
more suited to genetic analyses; mice have been especially and biochemical differences from humans, and their short
amenable to genetic m an ipu la tion, with large numbers of generation times may make them less suited to modeling
disease models; rats have provided better models of complex late-onset d isorders
disorders than mice
Other vertebrates chkkens, frogs (Xenopus), and zebrafish are good models for genetic manipulation has been limited in non-mammalian
(see Box 10.3) studying early developmental processes; the large eggs of vertebrates other than zebraftsh; some models are relatively
Xenopus are useful for biochemical analyses; zebrafish are expensive to maintain and have moderately lo ng generation
inexpensive to mainta in and are particularly su ited to genetic times and moderate offspring n umbers; more distantly
analyses, w ith short generation tim es and la rge numbers of related to humans than mamma ls are, and lack counterparts
offspring; they have a transparent embryo that facilitates for some human genes
mutant screening, and good genetic manipulation
Invertebrates the nematode C. elegGns and fruit fly D. melanogaster are invertebrates are evo lutionarily very distant from human s
(see Box 10.2) inexpensive to maintain and are extremely well su ited to and so are physiologically remote, even though some
genetic analyses, having very short generation times and cellular processes are reasonably conserved; they lac k
very large numbers of offspring, and being amenable to counterparts for a conSiderable number of human genes
sophisticated genetic manipulations; both organisms have
been very important for explo ring the functions of the sizeable
number of human genes th at have homologs in the worm
or fruit fly, and they are often very useful in providing cellular
models with wh ich to understand human disease processes
disorders such as the mdx muscular dystrophy mouse model, and of more com ·
plex disorders such as the NOD non·obese diabetic mouse.
The vast majority of animal disease models have been produced by human
intervention at different levels. For some models, disease has been induced by
treating animals with chemicals or infectious agents. In others, autoimmune dis
ease has been induced by challenging the immune system with suitable antigens;
for example, injection ofpristane, a synthetic adjuvant oil, into the dermis of rats
has produ ced a model of arthritis. Some other models have been obtained by
selective breeding to obtain animal strains that are genetically susceptible to dis·
ease. However, most animal models are produced by some kind of genetic
modification.
The techniques of genetic modification of animals will be described in this
chapter. In Section 20.2 we describe strategies in whiclJ the aim is to insert an
exogenous gene into an animal's germ line and then express it to help study gene
function or to induce disease. Other strategies simply rely on modifying endog·
enous genes within the germ line to produce animals with different mutant phe·
notypes, and two main approaches are used. Sometimes, as detailed in Section
20.3, the aim is to specifically alter the expression of a single predetermined
endogenous gene to produce a disease model andlor understand the function of
that gene. Alternatively, as described in Section 20.4, endogenous genes can be
mu tated at random to generate a large series of mutants whose phenotypes are
then analyzed and categorized.
642 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene FUnction
•
....
-
Perez VL, Blair HJ, Rodriguez-Andres ME et
al. (2007) Development 134, 2903-2912. With
permission from The Company of Biologists.]
MAKING TRANSGENIC ANIMALS 643
may be analyzed to gain insights into abnormal cellular functions, and relevant
permanent cell lines may be produced for convenient analysis of the mutant
phenotype at the cellular level.
In large-scale mutagenesis screens, large numbers of mutants are produced,
and those that produce easily ascertained phenorypes tend to be over-repre
sented. Mutants with major defects in limbs or craniofacial structures are easier
to identify than those with minor alterations to some internal organs, but at least
in zebrafish abnormalities of some internal organs such as the heart are more
readily observed because the embryo is transparent. As we will see in Section
20.4, standardized phenoryping is important for large-scale phenoryping
studies.
I @( ,
sperm
~ure 20.2 Genetically modified mice
primordial can be produced by a variety of routes.
germ cells j ______ ~ zygote blastocyst
The boxes linked by gray arrows at the top
..--:. ooc~e·nSI ~ ~ p
(l ~\
show components of the mouse life cycle
and represent the multiple stages at which,
l'-____-;;/'_
. ./ "-
?<..
____ exogenous DMA
I
cell
transduction shown in green arrows). Red arrows show
the input and transport of exogenous DNA
,-~-ct.""-
u.....~ te Ion
_1
transfe,
for all other routes. The most w idely used
gene transfer methods-microinjection infO
the male pronucleus o f the fertilized oocyte,
~~ ~~ 9
" and transfection of embryonic stem (ES) cells
followed by injection of genetically modified
ooma(~ ""~
cells ~ ~
~~ ES cells into a blastocyst-are shown by thick
red arrows. 1(51, intracytoplasmic sperm
injection. For convenience, so me methods
described in the text are not shown.
644 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
!
holding pipet1e nick In chromosomal DNA
of a desired DNA done. The introduced
.; ·---JJk
o
DNA integrates at a nick (single *stranded
DNA break) that has occurred randomly
DNA clone in the chromosomal DNA. The integrated
l integratlon transgene usually consists of multiple head
... .
--=-~4~I' ~;;~$
of pseudopregnanl
female of pseudopregnant foster females (they
will have been mated with a vasectomized
e!'
:,~.:.!.:.::.::.: ...:. ~.:.~:. male to initiate physiological changes in the
.......... -......"".-.
.
.-.................
....._.__....... ........
. '... '," '.
'
'
female that stimulate the development of
the implanted embryos). DNA analysis of
tail biopsies from resulting newborn mice
checks for the presence of the deSired DNA
transgenic mouse sequence.
MAKING TRANSGENIC ANIMALS 645
Transgenes can also be inserted into the germ line via germ cells,
gametes, or pluripotent cells derived from the early embryo
Various other routes have also been used to introduce trans genes into the germ
line, in addition to the fertilized oocyte (zygote) route. Some of them are less
widely used but can be useful in some circumstances and can be popular with
non-mammalian species.
development. ~
oviduct
mammary gland
The technology itself is not new. It has been used for more than 50 years to
l
isolate
clone amphibians, and it has been possible to produce cloned mammals from
embryonic cells since the 1980s. In 1995, it was shown for the firsttime that mam
mals could be cloned from the nuclei of cultured ceils, and in 1997 the first mam
1
isolale and @@
ovulated
oocytes
.. , J
mal cloned from an adult ceil was produced, a sheep called Dolly. cui lure cells
como'o
Dolly was produced by transferring a nucleus from a mammary gland cell ~! chromosome:;
from a Finn Dorset sheep into an enucleated oocyte taken from a Scottish black
face sheep. The resulting artificially fertilized oocyte was allowed to develop to
the blastocyst stage before implantation into the uterus of a foster mother (Figure
20.4). Out of a total of 434 oocytes, only 29 developed to the transferable blasto cullure In enucleated oocyte
cyst stage, and of these only one developed to term, giving rise to Dolly. Dolly has seru~e pleted
1. Inject
the same nuclear genome as the Finn Dorset sheep donor of the mammary gland medium Into oocytes
cell, so they are considered to be clones. However, their mitochondrial genomes , under zona
')I pelluCida ;A)
are different because Dolly inherited her mtDNA from the Scottish blackface ••• )\.y'
sheep from which the enucleated oocyte was derived. Subsequently, a variety of donor Go cells ...r1' I .
/- ~ fuslon
different mammals have been cloned by the nuclear transfer procedure, includ 2. electric
ing mice, rats, cats, dogs, horses, mules, and cows. pulso ~
We will consider here the nucleartransfer method as a way to generate trans
genic animals. The essential point is that if the donor nucleus is taken from a renucleated oocyte
somatic cell that has been genetically manipulated to contain a transgene of
interest, the animal that develops from the manipulated oocyte will be trans
genic. Transgenic livestock have been produced in this way with trans genes that
~
. ':
'
,':,
produce therapeutic proteins, as described in Chapter 21. It is also possible to , ;
modify a specific predetermined endogenous gene in a somatic cell, such as a
fibroblast, and then use somatic cell nuclear transfer to make an animal with the ,-"' C' ' blastocyst
mutation in all its cells, as in a pig model of cystic fibrosis that will be describ ed
in Section 20.5. We consider the ethical implications of human cloning in .•.. ~ -.,.. ~.-
-
implanl in uterus
blackface sheep
Exogenous promoters provide a convenient way of regulating ~Ilr. 20.4 The first successful attempt
transgene expression at mammalian cloning from adult cells
resulted in a sheep called Dolly. (A) The
To make a useful transgenic animal, the transgene not only has to be inserted donor nuclei were derived from a cell line
into the host's genome but also has to be expressed to produce functional prod established from adult mammar y gland
ucts. Some regulatory elements that influence transgene expression are found cells o f a Finn Dorset sheep. The donor
within the host genome, but important elements are provided with the trans cells were deprived of serum before use,
gene, most notably a suitable promoter and a polyadenylation site. forcing them to exit fro m the cell cycle and
enter a quiescent state known as Go. in
In transgenic animals, it is often desirable to express the transgene in particu
which only minimal transcriptio n occurs.
lar tissues or at particular developmental stages that are appropriate for the Nuclear transfer was accomplished by fusing
intended function of the transgene. To achieve a desired expression pattern, the individual som atic cells to enucleated,
trans gene will have a suitable tissue-specific or stage-specific promoter. metaphase II-arrested oocytes from a
Maximum control over transgene expression is provided by inducible promoters Scottish blackface sheep. Eggs are normally
that can be switched on and off according to need, usnally by controlling the sup fertilized by transcriptional!y inactive sperm
ply of a particular chemical ligand. Typically, the transcription factors that regu whose nuclei are presumably programmed
late such promoters are struct urally modified by this ligand. by transcription factors and other chromatin
Several naturally inducible promoters have been used, but endogenous pro proteins available in the egg, and so the
moters are generally disadvantaged by leakiness (a high background expression Go nucleus may represent the ideal basal
levelL and by relatively low levels of induction, unwanted co-activation of other state for reprogramming. (8) Dolly with her
tirst born, Bonnie. [(8) courtesy of the Roslin
endogenous genes that respond to the same ligand, and differential rates of
Institute, Edinburgh.]
uptake and elimination of the ligand in different organs. More robust inducible
promoter systems have therefore been developed by using non-mammalian,
MAKING TRANSGENIC ANIMALS 647
1 Hsp90 J.
and so prevents the expression of transgene
A. However, if doxycycline (an analog of
o(TetR)
p f-r:=:
'-'-~---:-1
~-l§f1Sgene A I~ ~ inactive tetracycline) is subsequently provided, it
binds to the Tet repressor protein, causing
~
transgene
tam:;f~C69
it to change its conformation so that it
• OFF
no longer binds to the tetO operator. As a
doxycycline
result, expression of the previously silenced
p~
transge ne A is switched on. Note that
toxicity effects reflecting high-level
"O·~~I transgeneA!- ~ .... constitutive expression of the repressor limit
~$ the use of this syste m in some cells.
(f~ 4 active
(8) Tamoxifen-inducible expression. Here
coding sequences for the ligand-binding
domain of a mutant mouse estrogen
-6~·'~~~r- transgene
receptor (ER) are fused to a transgene of
interest (TG) in an expression vec tor. The
ON expressed ER-TG fu sion pro tein is bound by
the Hsp90 inhibitory protein complex that
pre vents it from performing its function.
The mutant ER ligand-binding do main is
not recognized by estrogen but will bind to
often bacterial elements, o r by engineering mammalian proteins so that they are tamoxifen, an estrogen analog. Following
incapable of respo nding to endogenous inducers. Two widely used systems are tamoxifen binding, the fUSion protein is
described below. released from the Hsp90 complex and the
protein of interest is activated, even though
Tetracycline-regulated inducible transgene expression it is part of a fusion protein. P, promoter.
This system is based on the E. coli tetoperon, and induction occurs at the level of
transcription . Two transgenes are involved. One includes the E. coli tetR gene and
is designed to express the Tet repressor protein constitutively. The other includes
a gene of interest that has been modified so that the tetO operator sequence is
inserted between the promoter and the coding sequence. The Tet repressor
protein will bind to the tetO operator specifically and prevent expression of the
downstream coding sequence. However, gene expression can be restored by
providin g doxycycline, a tetracycline analog, which binds to the Tet repressor
protein, inducing a conformational change and dissociation of the Tet repressor
protein (F1gure 20.5A).
Tetracycline-regulated inducible transgenes are widely used, but because
expression is induced at the transcriptional level there may be a significant delay
(while the RNA is translated and the active protein is produced) before a response
to induction is seen. A similar delay will occur between the removal of the induc
tive stimulus and return to the basal state after RNA transcripts and protein have
been broken down.
Tamoxifen-regulated inducible transgene expression
Where tapid induction and decay are essential, an inducible system that works at
the protein level is needed. One popular system is based on the post -transla tional
regulation of the estrogen receptor. Like other nuclear hormone receptors, the
estrogen receptor is bou nd by the Hsp90 protein inhibitory complex that keeps it
in an inactivated state within the cytoplasm until its ligand, estrogen, enters the
cell. Estrogen binds to the C-terminal ligand-binding domain of the estrogen
receptor, causing a change of conformation. As a result, the Hsp90 inhibitor is
released and the activated estrogen receptor translocates to the nucleus to acti
vate certain target genes (see Figure 4.8 for the structure of nuclear hormone
receptors, and Figure 4.9 fo r the mechanism of activation by ligands).
Tamoxifen-inducible expression uses a mutant ligand-binding domain of the
mouse estrogen receptor Esrl. The mutant domain does not bind its naturallig
and, estrogen, at physiological concentrations but does bind the synthetic ligand
648 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
(A)
endogenous gene
~/::-/===1X~=~~F=lI!~
(( _ '~~
129
blastocyst ! gene targeting
~targeu ng construct1
!
re·implant
into fo ster
l selection for
gene-targeted
ES cells H
(B)
mutated endogenous gene
x
genetically
modified
ES celis ,•
1 +
mating
~
Figure 10.6 Genetic targeting of embryonic stem cells to introduce mutations into the mouse germ line. (A) An embryonic
stlffll (ES) cell line is made by excising blastocysts from the oviducts of a suitable mouse strain (such as 129), and cells from the
blastocyst'S inner cell mass are cultured to eventua lly give an ES cell line (see Box 21.2 for the details). ES celis can be genetically
modified while in culture by transfection of a linearized recombinant plasmid containing DNA sequences that are identical to certain
sequences in a predetermined endogenous gene, except for a desired genetic modification. Homologous recombi nation allows
sequence swapping so that the targ eted gene is mutated in the ES cell's genome. Only a very few celis will undergo homologous
recombination to produce the expected modification of the targeted gene, but they can be selected for and amplified. The modified
ES celis are then injected into isolated btastocysts of another mouse strain such as (57810/ J, which has black coat coloration that
is recessive to the agouti color of the 129 strain; the cells are then implanted into a pseudopregnant foster mother of the same
strain as the blastocySL Subsequent development of the introduced chimeric blastocyst results in a chim eric offspring containing
two populations of cells that derive from different 2ygotes (here, 129 and (57B10/J, normally evident by the presence of differently
colored coat patches). (B) Backcrossing of chimeras can produce mice that are heter02ygous for the genetic modification (bottom
left). Subsequent interbreeding of heterozygous mutants generates homozygotes.
-----if
'/'MN)
~~~~~n_~_~~:~_~
\ \.gene·targ9ting c.onst(UCl co nfers resistance to neomycin, and its
'- ' . analog G418; and the Herpes simplex
thymidine kinase gene (rk). (A) In some
J, J, gene-targeting events that occur by
~r . -
mutated endogenous gene
/1-- (r:-=r'\,
/ '- homologous recombination, double
crossover leads to incorporation oFthe nee
, b
promoters (Pl. The targeting procedure
results in a double crossover (X) and in thiS
case an inaCtivating deletion of a large
E2 En
part of exon 1 that is usually designed to
,, H ~ result in a shift in the translational reading
~ b:'knOck:in-l-'~f-'-"~___""'
celis with the correct gene-targeting event,
.: and a thymidine kinase (rk) gene marker
~~.Jv targeting helps select against random integration as
vector illustrated in Figure 20.7.
652 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
--------~~~I±I==~~c~z==~
) .__~ ---- that included a lacl reporter (encoding the
enzyme p-galactosidase) and a neo marker
gene plus target homology sequences. The
latter corresponded to a 6 kb upstream
(6) (C)
region preceding exon 1 (green line) plus a
5' region of exon 1 terminating in the ATG
translation initiator trinucleotide and then
a 1 kb region immediately followin g exon
1 (lilac line). lmegration of the transgene
inactivated the Evc gene by deleting the
rest of the exon 1 coding sequence and also
tb brought the lael gene under the regulation
of the endogenous Eve promotor. (B, C)
ExpresSion images show p-galactosidase
activity in Eve+!- embryos and represent
eKpression in the whole El 5.5 embryo
(15 .5 days post coitum) (B) and in a sagittal
paraffin section ofthe head of an E15.5
embryo (e). Note the strong expression
in the mouth and tooth-forming areas in
addition to the developing skeleton. mx,
of the recognition sequence, Recognition sites for microbial recombinases can be maxilla; ma, mandible; mc, Meckel's cartilage;
engineered easily into gene-targeting vectors and can be inserted into specific tb, temporal bone. [Courtesy of Judith
locations in an animal genome. The microbial recombinase can be supplied con Goodship, Newcastle University; from Ruiz
Perez VL, Blair HJ, Rodriguez-Andres ME et
ditionally so that it is expressed only in certain tissues or at certain developmen al. (2007) Developmenr 134, 2903-2912. With
tal stages in the animal, or so that it is inducible. To do this, another transgene is permission from The Company of Biologists.J
required in which the microbial gene coding for a recombinase is ligated to regu
latory components such as a suitably regulated promoter or inducible gene
expression system, such as the tetracycline-inducible system.
In genetically modified mice, the most widely used site-specific recombina
tion systems are the Cre-loxP system originating from bacteriophage PI, and the
FLP- FRT system from the 2 ~m plasmid of Saccharomyces cerevisiae. Both the Cre
recombinase and the FLP (fiippase) recombinase recognize specific target
sequences that are 34 base pairs long, respectively the loxP sequence and the FRT
sequence (FLP recombinase target). The size of these microbial sequences is such
that when introduced into an animal genome they will be quite distinctive (the Figure 20.10 Structure of the loxP and
odds that a sequence identical to the loxP recognition sequence will exist by FAT recognition sequences, Recombinases
chance in an animal genome of 3000 Mb and 40% GC are less than I in 1020). that catalyze site-specific recombination
Although their sequences are different, loxP and FRT have essentially the typically recognize and cleave a sequence
same structure, comprising inverted l3 bp repeats separated by a central asym that has inverted repeats flanking a central
metric 8 bp spacer (FigureZO.1 0)_If two copies of the recombinase target site are asymmetric core. The Cre recombinase
from bacteriophage Pl recognizes the /oxP
located on the same DNA molecule and in the same orientation, recombination
(locus of X-over, Pl) and the FLP (flippase)
results in excision of the intervening sequence; if the loxP sites are in opposite recombina se from the 5. cerevisiae 2 M-m
orientatjons, recombination produces an inversion, Site-specific recombination plasmid recogni zes the FRT (flippase
between two target sequences on different DNA molecules is also possible. recognition target) sequence. Both the loxP
The Cre- loxP system has been applied in many different ways in genetically and FRT recognition sequences are 34 bp
modified mice. It can allow the site -specific integration of transgenes, the condi long, with 13 bp inverted repeats (within
tional activation and inactivation of transgenes, and the deletion of unwanted arrows) flanking a central asymmetric 8 bp
sequence (shown in bold). A recombinase
monomer binds ( 0 both inverted repeatS,
J.
5 ' A'VVI..CTTCGTAT1\ GCATACAT TA7ACGAAGTTAT 3 '
and the two monomers form an active dimer
/oxP that asymmetrically cleaves the central core
3 ' ThTTGAAGCAT.a,T CGTATGTA JI.TATGCTTCAATA 5 '
sequence (breakpoints are shown by vertical
t green arrows) within. The central sequence
J.
5' GAAGTTCCTATTC TCTAGAAA GTATAGG AACT'l' C 3'
confers orientation: for example, in one
orientation /oxP has the central sequence
FRT 3' CTTCAAGGATMG AGATCTTT CJ<.TATCCTTGAAG S' shown (GeATAeAT); in the other orientation
t the central sequence is ATGTATGC.
TARGETED GENOME MODIFICATION AND GENE INACTIVATION IN VIVO 653
~~ ~~
Fig...". 20.11 Conditional gene
inactivation using Cre-loxP site-specific
x recombination. The mouse shown at
the top left has been subjected to gene
~~:J C~0 targeting so as to introduce two loxP
sequences at positions flanking a specific
genomic DNA sequence, A, that may be a
mouse with target locus 0\ transgenic mouse with
tissue-specific promoter, P
be expected to cause gene inactivation.
The f10xed A sequence will be speCifically
1 r /o.,p
,A
deleted in the presence of ere recombinase.
To introduce ere recombinase, a mouse
~
I __
C're "
,. ~\
t:,,,,p is required that has previously been
constructed to have a ere transgene ligated
to a tissue-specific promoter P of interest
(top right). By breeding the two types of
~oJ
1,,"P
mouse it is possible to obtain offspring
containing both the 10xP-fianked target locus
plus the ere transgene (bottom left). ere
offspring with flaxed
target plus ere transgene
•
tissue- or ce ll-s pecifi c recombinase wi!l be produced in the desired
deletion of A at ta rget locus type of tissue and will cause recombination
between the two 10xP sequences in cells
of that tissue, leading to deletion of the
A sequence and tissue-specific gene
marker genes. The most important applications, however, are conditional gene inactivation.
inactivation and conditional recombination leading to major rearrangements in
chromosomal DNA.
Conditional gene inactivation
Some genes have vitally important roles in cells or during development. A
homozygous knockout for such a gene will typically result in embryonic lethality,
but the heterozygous gene knockout may not show a phenotype (the presence of
one functional allele may be enough for the organism to function normally). To
study such a gene, a collditiollal gene knockout is often made. The gene is
designed to be inactivated in only a selected tissue or group of cells or at a desired
developmental stage so that the effect of homozygous gene inactivation can be
studied.
Conditional gene inactivation typically involves replacing exons of an endog~
enous gene with a homologous gene segment flanked by loxP sequences (the
segment is said to be floxed). Mice carrying the floxed target sequence are then
mated with a stmin of mice carrying a Cretransgene under the control of a tissue
specific or developmental stage-specific promoter (Figure 20.11 ).
Offspring of the cross that contain both a floxed target sequence and aCre
transgene are identified and analyzed. Depending on the promoter regulating
the Cre transgene, the Cre recombinase should be expressed only in certain cells
or at certain stages of development so that viable mutants can be analyzed.
Inducible Cre transgenics have also been generated. The Cre trans gene is
designed to have a Tet operator tetO and to be repressed by the binding of a Tet
repressor protein (see Figure 20.5A), but it can be switched on by supplying doxy
cycline in the animal's drinking w,ater:' Alternatively, ere is expressed constitu
tively as a fusion protein with a mutant estrogen receptor ligand-binding domain
and is activated at the protein level by tamoxifen (Figure 20.5B).
Chromosome engineering
Inducible site-specific recombination has also been applied to produce deletions
of specific mouse subchromosomal regions (from tens of kilo bases up to tens of
megabases) and to engineer specific chromosome translocations. Gene targeting
is first used to integrate loxP sites at the desired chromosomal locations.
Subsequently, transient expression of Cre recombinase induces a selected chro
mosomal rearrangement, such as a chromosomal translocation after loxP sites
have been targeted to different chromosomal DNA molecules (Figure 20. 121\).
Alternatively, engineering two loxP sites into the same chromosomal DNA
can result in excision or inversion of the intervening sequence (Figure 20.12B).
Excision occurs when the loxP sites are in the same orientation and is essentially
irreversible because it generates a single loxPsite. Inversion occurs when the loxP
654 Chapter 20: Genetic Manipulation of An imals for M odeling Disease and Investigating Gene Function
1 1
t,an,molecula' at predetermined pOSitions on the two
recombination intramolecular
recombination chromosomes (ch romosomes 12 and 1S
in this example). Subsequent exposure of
12
---=r
these chromosomes to ere recom binase
~
X cor
15
bj: results in chromosome breakage at the
1
10xP sites and re-joining to produce the
,J, desired translocation chromosome.
•
C>-;D=='t-- - 12/15
_ 15/ 12 !
h g f (B) Targeted insertion of 10xP sites permits
intrachromosomal deletion (left) or an
inversion (right) by intrac hromosomal
~hgled .C ba ~ recombination . To generate a deletion, the
deletion < = _:l
flanking loxP sites need to be in the sa me
InverSion
orientation. To generate an inversion, the
loxP sites must be in opposite orientations.
Becau se rwo intact loxP sites remai n, in this
~ IOXP Xcrossover • centromeres
case there is the possibi lity of a secondary
recombina tion between the 10xP sequences
causing the inverted segment to flip back
again; to minimize this possibility, two
sites are in opposite orientations and yields two loxPsites that are indistinguish different 10xP sequences can be used
able from the original loxP pair, so that the inversion could be reversed by a sub (see the text).
sequent reco mbination. To avoid this, loxPsequences have been mutated to gen
erate two slightly different mu tant loxP sequences that can be used together to
promote inversion in the desired direction.
Such chromosome engineering offers the exciting possibility of creating
novel mouse lines with specific chromosomal abnormalities for genetic studies.
A drawback has been that the frequency of such events has been very low.
Recently, however, methods using transposons seem particularl y suited to chro
mosome engineering, and these are described in Section 20.4.
assembled by fusing coding sequences for different types of zinc finger to pro
duce a particular combination of zinc fingers with the desired DNA-binding
sequence specificity [Figure 20.13A).
Like restriction nucleases, zinc finger nuc1eases work as dimers. To make a
double-strand DNA break, the two monomers bind to and cut complementary
DNA strands. Standard restriction nucleases typically work as homodimers (with
the result that both monomers recognize the same DNA sequence and the recog
nition sequence is palindromic). However, zinc finger nucleases are deliberately
designed to work as heterodimers, and each monomer is designed to recognize
different DNA sequences that are located within a very short distance from each
other in the target gene (Figure 20.13B). A zinc finger nuclease heterodimer com
posed of monomers each with four zinc fingers (as shown in Figure 20.13) effec
tively binds a 24 (12 + 12) nucleotide recognition sequence that can be expected
to occur just once in a genOlne.
The induced double-strand break can be repaired by different natural cellular
dsDNA repair pathways (as described in Box 13.3). One s uch pathway, the non
h omologous e nd joining (NHEJ) pathway, is natu rally prone to error: because the
broken ends are simply joined together, the joining process is frequently acco m
panied by a loss or gain of nucleotides (Figure 20.13C). For the vast majority of
locations in the DNA sequence there would be no serious consequences, but
when the double-strand break is designed to occur in a coding sequence, the
result is often gene inactivatio n. Successful gene targeting with zinc finger nucle
Figure 20".1 3 Targeted gene inactivation
ases was first conducted in zebrafish in 2008 and in rats in 2009. In both cases using genetically engineered zinc finger
mR NAs encoding zinc finger nucleases were injected into one-cell embryos, and nudeases, (A) Zinc finger nucleases (ZFNs)
target genes were inactivated after inaccurate repair by the NHEJ DNA repair contain a series of zinc fingers joined by an
pathway. amino acid linker to a DNA-cleaving domain.
Individual zinc fingers bind to specific triplet
(A)
sequences in DNA, and combinations
DNA-cleaving
of different zinc finger modules can be
domain monomer
N~~ assembled in a defined order to bind to
specific DNA sequences of interest, (8) A
In finge rs double-strand break is introduced into
(B)
functionall y important sequence in a specific
site ZFN reCOQ:nltion site
gene with a pa ir of different ZFNs. ln this
example the ZFNs each have four zinc fingers
and are designed to spedfically recognize
neigh boring 12 bp sequences at the target
5'====
3'
====3' 5' 5"-===
3'- ====~
J,reS&liOn
with nucleotides being deleted or inserted
that may cause gene ina ctivation. (D) In the
homologous recombination (HA) pathway,
the 5' ends of the DSB are first uimmed back
5' - - - - 3'
3'·- - 5' (re.section), allowing strand invasion by donor
DNA strands. A plasmid with homolog ous
repaired, but with deletIon
== ~n,-==
U -
repaired. but with insertion
::: 4 ~ 11invasion by donor
DNA strand and
repair
sequence (bold lines) that is designed to
have an inactivating mutation (red asterisk)
may be used as the donor DNA, acting as a
t em plate for synthesizing new DNA from a
p ortion of the homologou s sequence (thick
5' ! 3' blue lines) to replace the previously existing
~ S sequence. As a result, the selected mutatiOn
I !
repaired sequence with is copied into the gene sequence, causing
inactivating mutation gene inactivati on.
656 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
The alternative major pathway for repairing double-strand DNA breaks, the
homologous recombination (ER) pathway, offers a general and more powerful
way of artificially inducing faulty DNA repair. In this ceUular pathway the broken
DNA ends are trimmed back and new DNA, copied from a donor DNA strand
with the homologous sequence, is inserted (Figure 20.13D). To induce an inacti
vating mutation, a plasmid can be provided that contains a portion of the target
gene sequence spanning the location of the double-strand break but altered to
include a specific inactivating mutation. The HR pathway would use the plasmid
as a donor template DNA to direct repair and in so doing would be able to copy
the inactivating mutation into the repaired sequence (see Figure 20.13D). This
approach allows exquisite control over the mutagenesis procedure, unlike the
random inactivation of the NHEJ pathway.
Targeted gene knockd own at the RNA level involves cleaving the
gene transcripts or inhibiting their translation
Gene targeting has resulted in numerous informative gene knockouts, but the
procedure is labo rious and time-consuming. Alternative ways of inactivating
genes in vivo have therefore been tried.
Quick and comparatively easy methods enable a gene's function to be selec
tively inhibited at the RNA level. The downside is tha t such methods rarely pro
duce the complete abolition of gene function that is possible with gene targeting
at the DNA level. Instead, the net effect is usually a major decrease in gene expres
sion and is described as a gene knockdown, to distinguish it from a gene knock
out. To knock down a gene, cellular RNA interference pathways (see Box 9.6) are
exploited to selectively degrade transcripts from a gene, or an exogenous oligo
nucleotide or RNA base-pairs with transcripts from the target gene, inhibiting its
expression.
I I
individual mononucleotide repeat units have a carbon-based ring to which is H:zC ..... N /CH:z
attached one of the four bases, and the repeat units are joined by phosphorus H
containing bonds. However, the ribose or deoxyribose sugar is replaced by a ring
that has an additional nitrogen atom (blue shading), and successive morpholine
units are connected by phosphorodiamidate linkages (high lighted) instead of (B)
phosphodiester bonds. For more information on morphoiJno synthesis, see the I
Gene Tools website at http://www.gene-tools.com/. H2C 0 base
.....C/ ..... C/
H/ I I ..... H
nuclease attack, enabling them to remain intact and function much longer than
y
H2 C CH 2
standard oligonucleotides or polynucleotid es.
Phosphorodiamidate morpholino oligonucleotides (often called morpholinos) O=P-NH2
are synthetic polymers of modified nucleotides in which bases are attached to I phosphoroolamidate
o linkage
morpholine rings instead of to a ribose or deoxyribose sugar. The morpholines I
H2C 0 base
of adjacent nucleotides are connected by non-ionic phosphorodiamidate link ..... C/ ..... C/
ages instead of the usual phosphodiester bonds (Figure 20,14; compare with H/ I I ..... H
Figure 1.5).
Morpholinos are very stable and are not degraded in cells. They are usually
H2 C
y CH 2
designed to have a sequence of about 25 nucleotides and hybridize to their tar O = P - NH2
gets under conditions that would prevent the equ ivalent normal oligonucleotides I
o
from binding. They may be directed to bind to a 5' untranslated region (VTR) I
sequence close to the translation start site, so that they inhibit translation.
Alternatively, they are designed to bind and blockade a splice site in the precursor
RNA, causing alternative splicing.
Morpholinos are widely used to make gene knockdowns to investigate gene
function in vertebrate embryos. Like other oligonucleotides, morpholinos do not
easily cross membranes, so that delivery is not straightfOIward. Microinjection
into embryos from a wide variety of species has been possible, but most work has
been done in zebrafish embryos, which have the advantage of being transparent.
One disadvantage is that morpholinos may have only a transient effect when
introduced into the single- cell zygote and are diluted out after cell division and
growth.
r~
~galactosidase neomycin
cells for resistance to the neomycin analog
G418 and for ~-galactosida se expression.
Th e mutated genes can be identified by PCR
phosphotran sferase assays such as 5' RACE.
locations, they sometimes integrate into genes and can inactivate them.
Retrotransposons encode a reverse transcriptase and they move via an RNA
intermediate. DNA transposo ns move as DNA and encode a DNA transposase
that binds to short inverted terminal repeats (see Figure 9.20).
DNA transposons have been widely used in germ-line mutagenesis in D. mel
anogasterand C. elegans. The 2907 bp P element of D. melal10gasteris a celebrated
example that has revolutionized the study of gene function in the fruit fly. It
moves by a cut-and-paste mechanism: each element needs to be excised from
the DNA before moving to a new location (as opposed to the copy-and-paste
mechanisms used by retrotransposons; see Figure 10.9). Germ-line mutations
caused by transposon mutagenesis are molecularly tagged, greatly facilitating
their identification.
Until very recently, transposons were not available to mutagertize vertebrate
genomes in a meaningful way. Transposable elements had been isolated from
several vertebrate species, but nearly all of the elements were found to be inac
tive or had low frequencies of transposition. Transposons with broad host ranges
were sought for use as effective mutagens in a wide range of vertebrate
genomes.
Two recently developed transposon systems that offer high-frequency trans
position in vertebrate genomes are described below. In addition to applications
involving germ-line mutagenesis, vertebrate transposon systems can be used in
the mutagenesis of somatic cells for cancer gene discovery and in modeling
somatic cancers. They can also be used in standard transgenesis via pronuclear
microinjection, offering both high efficiency and single-copy transgenes.
Insertional mutagenesis with the Sleeping Beauty transposon
A defective transposon in salmonid fish, presumed to have been inactive for mil
lions of years, was resuscitated by molecular engineering and named Sleeping
Beauty (Figure 20.17). It can transpose in cells of a wide variety of vertebrate spe
cies, but it is popularly used to produce genetically modified mice by inserting
transgenes into the mouse genome, causing insertional inactivation of genes.
Because its sequence is rather different from the endogenous mouse transposons,
it is readily identifiable.
Although Sleeping Beauty shows no integration preference with respect to
gene structure. it tends to reintegrate into sites located within a short distance
(often less than 3 Mb) from the locus it transposes from. This phenome non,
called local hopping, is a feature of many transposons (the D. melanogaster PI
transposon usually reintegrates within 100 kb of its donor site). The local hop
ping tendency can be exploited by designing mod ified Sleeping Beauty trans
posons to permit region-specific mutagenesis and chromosome engineering
(see the review by Carlson & Largaespada in Further Reading).
660 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
inverted/direct repeats
::II W~l
its transposon subs trate.The transposase
recognizes inverted repeat or direct repeat
elements that flank the DNA that will be
transposed, and then excises the transposon
!
from its original location (arrow 1), leaving
behind a canonical three-base footprint,
C(AfT )G. The excised transposon can
subsequently re-insert (arrow 2) at a new
location that can be anywhere that a TA
dinucleotide is present This TA is duplicated
on re-insertion. IAdapted from Carlson eM &
larg aespada DA (2005) Nat. Rev. Genet.
6, 568-580. With permiSSion from Macmillan
Publishers ltd .]
piggyBac-mediated transposition
Many DNA transposons, including the P element of D. mefa.rlOgaster, are nonfunc
tional outside their natural hosts- presumably they need some host cell factor to
transpose. Others, however, have a broad host range. piggyBac (PB) is a DNA
transposon originally derived from the cabbage looper moth, but PB element-like
sequences have been found in diverse genomes from fungi to mammals. PB
transposition is not so dependent on host cell factors, and PB transposes effi
ciently in human and mouse cell lines and also in mouse germ-line cells.
PB transposons insert into the tetra nucleotide TTAA and regenerate the
sequence on excision. Like Sleeping Beauty, piggyBac displays preferential inte
gration into actively transcribed loci but has a higher transposition efficiency;
there is no evidence for local hopping but a bias toward reintegration within
intragenic regions.
The genetic background of a mouse descri bes the genetIc A useful example of the importa nce of genetic background
constitution (all alleles at all loci) except for the mutated gene of is the Min (multiple intestinal neoplasia) mouse that was generated
interest and a very small amou nt of other genetiC materIal (generally by ENU mutagenesis and results from mutations in the mouse
from one or two other mouse strains), The genetic background is Ape gene. Mutations in the orthologous human gene, APC, cause
important because it can influence the phenotype of a muta nt allele in adenomatous polyposis coli and related colon cancers (OMIM 175100)
different ways. and the Min mouse has been regarded as a good model for such
Generally, a mutant allele of interest is maintained on one to a disorders.
small n umber of backgrounds that are considered advantageous Th e genetic background, however, drastically modiAes the
(e.g. they d isplay strong phenotypes, and breed well). Thus, whereas p henotype of the Min mouse. For example, the number of colonic
ES cells used to make ch imeras are traditionally of the 129 strain, 'n
polyps in mice carrying the ApcM mutation is striking ly dependent
knockout mice are not usually kept on a pure 129 background on the strain of mouse. Thus. B6 mice heterozygous for the ApcMin
because they do not breed well, and they show abnormalities in their mutation are very susceptible to developing intestinal polyps, but
anatomy, immunology, and behavior. Consequently, knockout lines are offspring of these mice mated with AKR/J, MA/MyJ, or CAST strains
backcrossed, often w ith the robust C57BU6 strain to increase fertility are significantly less susceptible. The latter three strains have strai n
and assist phenotype ana lysis. unique ApcMm modi fier loci, named Mom 1 (modifier of Minl). Similar
When identical mutations are placed on different genetic phenotypiC variability is found in human adenomatous polyposis coli
backgrounds, large differences may be seen in the m utant phenotype. families, in w hich different members of the same family may have
At the core of the differences between the different genetic strikingly different tumor phenotypes although they possess identical
backgrounds are a small number of modjfier genes that somehow mutations in the APC gene.
interact with the mutant allele.
C57BLl6 ES cells are generally less efficient than 129 ES cells in colonizing
developing embryos and so usually give rise to low·level chimeras, but once they
have done this there is no problem in achieving germ-line transmission. Recently.
a C57BLl6-derived strain of mouse ES cells known as JMB has been shown to be
highly competent for germ· line transmission and has been genetically modified
to show a dominant pale (agouti) coat color like the 129 strain. Because of these
advantages. the JMB strain is likely to be used widely in large· scale mouse
knockouts.
Although the next few years will see a massive increase in the number of pub·
licly available mouse knockouts. much work will be required to move from knock
outs in ES cells to the time· consuming development and phenotyping of the
reSUlting knockout mice. For the large mutagenesis screens. the challenge of ana
lyzing the many facets of the phenotype-collectively called the phenome-is
being addressed. and various phenotyping standardization protocols have been
developed and are being applied.
Humanized alleles
Many single-gene disorders show considerable mutational heterogeneity, For
disorders arising from a loss of function of a polypeptide-encoding gene, most
chain-terminating mutations can be expected to be null alleles, but the effects of
some splicing mutations and some other point mutations may be more difficult
to predict. Disease alleles found at high frequen cy may warrant the effort need ed
to engineer the pathogenic human mutation into the sequence of the mouse
allele, thereby creating a humanized allele (Box 20,2 gives some examples).
BOX 20.2 MODELING RECESSIVE DISORDERS IN MICE: THE EXAMPLE OF CYSTIC FIBROSIS
Cystic fibrosis (CF; OMIM 219700) is a recessive disorder that i5 Other investigato rs deleted an early exon (exon 1 or exon 2)
particularly common in populations of Caucasian ancestry. It is caused to cause early premature termination. Again, there was no lung
by mutations in the CFTR (cystic fibrosis transmembrane regulator) pathology. However, in the deleted exon 1 model, the severity of the
gene, which has 27 exons, specifying a polypeptide of 1480 amino intestinal di sease was observed to vary significantly depending on the
acid residues. The 6.FS08 mutation, a trinucleotide deletion in exon 10, genetic background. This prompted t he originators of the first exon
is very freq uent (about 70% of CF alleles). Other relatively common 10 deletion model to make an inbred congenic strain by repeatedly
alleles result in the amino acid substitutions GS5 1D and G4SOC. backcrossing with C57BU6 until the contribution of the ES cell strain
In the absence of a suitable animal model, the identification of was limited to a small region including the efrr allele. The resulting
CFTR as the CF locus prompted efforts to make a mouse knockout mouse now developed the hoped-for lung pathology.
model. Different types of mutation were generated in the mouse eftr Attempts to model human alleles with a single amino acid
gene, and the resulting knockout mice were produced on various deletion or substitution produced milder phenotypes because the
genetic backgrounds. Not unexpectedly, the different CF models mutant protein had some resid ual functional activity. Two
showed very considerable differences in phenotype (Table 1). attempts to model the .1F50S mutation produced different
Initial anempt5 focused on mutating exon 10, either deleting phenotypes because of differences in the amount of expression of
it to cause a downstream shift in t he translational read ing frame or the engineered Cftr1lF508 allele. Other models have been produced for
introducing an insertional mutation. Th ree such models failed to show the missense mutations. A mouse with the G55 1D mutation did have
evidence of lung pathology (see Table 1). Two m utations seemed to be some intestinal disease but less than for the null mice, and a mouse
null alleles, resulting in an absence of (ftr from mutant homozygotes, with the G480C allele had a very mild phenotype. The milder models
and did result in severe intestinal phenotypes; the other model had can nevertheless provide valuable information on how the
some residual Citr mRNA because of a leaky mutation. Cftr gene functio ns.
TABLE 1 SOME OF THE DIFFERENT GENE·TARGETING STRATEGIES USED TO MAKE MOUSE MODELS OF CYSTIC
FIBROSIS
Mutant allele rNatur. of
mutation a
eftr mRNA
(percentage
Genetic background
(ES cell strain-
Phenotype
mlUNC .1 ExlO 0 E14TG2a-C57BU6 less t han 5% survive to maturity; no lung pathology even In
adulthood. but severe intestinal disease
mlHGU Ins. ExlO 10 129- MF1 a leaky mutation by which some transcripts underwent exon
skipping to produce some functional mRNA; the resulting mild
phenotype included mi ld intestinal obstruction; about 95% of
mice survive to maturity
mlCAM I~ "Exl0 0 mixed: 129-MF1 and similar pathology to the m 7UNC on a mixed genetic background,
129-C57BU6 except has additional pathology in the lacrimal91and of the eye
mlHSC .1Exl 0 129-CDl severe phenotype with 30% survival but, after subsequent
crossing with variety of different mouse strains, the survival rates
and severity of phenotypes are seen to vary w idely
I
mlEUR <lf508 100 129-FBV disease mild (possi bly because of high expression of the mutant
mlKTH <lF50B very low 129- C57BU6 survivaJ rate only about 40% because virtually no Cftf mRNA of
any kind is produced in intestine
"tt.Ex, deletion of exon; Ins., insertional mutation, For historical reasons, the 6th and 7th exons of the human CFTR and mouse Cftr genes are
labeled as exons 6a and 6b, and so in this table exons 10 and 11 are in fact the 11th and 12th exons of the Cftr gene, respectively,
bAs a resu lt of extensive backcrossing with C57BL/6 and selection for the original Cftr allele, only a very small region (including the eftr gene)
orig inated from the El4TG2a ES cells, and the vast majority of the DNA is from the C57BU6 strain.
TABLE 20.2 ANIMAL MODELS FOR MODELING HUNTINGTON DISEASE PATHOGENESIS AND EXPLORING THE FUNCTION OF
THEHDGENE
Animal Models and application(s)
Mouse (see also gene knockout to investigate effects of loss of huntingtin protein function; conventional transgenics that express
Fig ure 20.19) mutant huntingtin to study pathogenesis; knock-ins to study pathogenesis when mutant mouse huntingtin is
regulated by the endogenous promoter
Rat conventional transgenics that express mutant huntingtin; transgenics in whic h viral vectors were used to deliver
mutant transgenes to brain
Rhesu s macaque transgenics with mutant HIT gene encoding expanded polyglutamine tract
Drosophila melanogaster enhancer and suppressor screens to identify genes that modify polyglutami ne toxicity; phenotype rescue by
in hibition of transcriptional repressor complexes
Caenorhabditis e/egans RNA i screen to identify genes that modulate polyglutamine aggregation
Saccharomyces cerevjsiae enhancer and supp ressor screens to identify genes that modify polyglutamine toxicity
666 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
~"~i~ "-"
'-.~
{:! on the long arm of human chromosome
Mmu16 A2 ..... ~ 21 (Hsa21) mostly have counterparts on
~
~~A
Al C3.1 the distal pa rt of mouse chromosome 16
~ . A2
(MmuI6) (pal e green boxes). Genes on the
81 ' B2 A3
83
f?;;; ~ boxes) or 10 (MmulO) (yellow boxes).
0
82 ,0.;:" I..."i
(B) Extent of trisomy in segme ntal trisomy
84
85
El.l
El.2
83 ~ .t" , 6 mouse models. The widely used T5650n
E1.3 B4 C3.3 mouse model is trisomic for a distal region
C1.1 E2
e1.2
Hsa21
E3 B~'
B j of mouse chromosome 16 that spans about
B .
cld E. Cl
60% of the region showing conservation
C3 .1 of synteny with Hsa21. However, it is also
E5 C2 C4
C3.2
trisomic for a more than 5.8 Mb region on
C3.3
C3 mouse chromosome 17. Complementary
C.
ru 01
02
components of the trisomic region in Ts65 0n
are represented in the Ts1 Cje (small proximal
component) and Ms 1Ts65 (large distal
03 component) models . Th e Dp(16)lYu model
was generated very recently by long-range
difficult to model because of the limited conservation of human-mouse synteny ch romosome engineering with Cre-loxP
(gene order; see Figure 10.16). Nevertheless, considerable effort has been devoted Site-specific recombination; it carries an
to making mouse models of trisomy 21 (Down syndrome). accurate duplication of the fu1i22.9 Mb
In trisomy 21, the pathogenesis results from the dosage sensitivity of a few region on mouse chro mosome 16 that
genes on chromosome 21, for which the presence of three copies instead of two shows conserved synteny with Hsa21.
is sufficient to produce abnormal gene regulation. Because 21p is essentially
made up of heterochromatin and ofrRNA genes that are also present in multiple
copies on four other chromosomes, the problem lies with genes on 21g. Most of
these genes have counterparts on mouse chromosome 16 (Figure 20.20A).
However, mouse trisomy 16 does not resemble human trisomy 21; trisomy 16
mice die in utero and are not useful models. Importantly, most genes on mouse
chromosome 16 have orthologs on chromosomes other than human chromo·
some 21. Genes in the distal 4.5 Mb of human 21q have orthologs on mouse
chromosome 17 or 10.
To obtain more accurate models, various mice with segmental trisomy 16
have been produced. Earlier models relied on using random mutagenesis and
retrieving mice with partial trisomy 16 arising thro ugh translocations. TheTs65Dn
mouse has been a very widely used model of trisomy 21 because of its learning
and behavior deficits. However, it is trisomic fo r less than 60% of the human
chromosome 21 syntenic region on mouse chromosome 16. Additionally, ilis trio
somic for a region of mouse chromosome 17. Recently, long· range chromosome
engineering using Cre-loxP recombination has produced a promising model
with a duplication of the entire 22.9 Mb region on mouse chromosome 16 that is
syntenic with human chromosome 21 (see Figure 20.20).
As a radical alternative to duplicating mouse chromosome segments,
tcanschromosomic mice have been generated that carry a single almost com·
plete human chromosome 21. To do this, chromosomes from human fibroblast
cells were transferred into mouse ES cells by using microceU·medialed chromo·
some tralls!er (see Chapter 16, p. 523, for a description of this technique). With
the use of a marker gene, those ES cells that had picked up human chromosome
21 were selected . Some of the resulting ES cells had a nearly complete human
chromosome 21 and were used to make an ane uploid transchromosomic mouse
line, Tcl, that had various characteristics of trisomy 21 <Figure 20.2 1). Although
the Tcl transchromosomic mouse model appeared promising, it has not been
easy to work with. Random loss of the HsaZ l fragment during development
means that analyses are complicated by the variable levels of mosaicism in differ·
ent tissues.
© ~:::,
(AI
739 1141
Of
arrest 739 or 1141
cells In metaoha se
harvest microcells
I
6 -;"a-d;a~te ) 6 ~"
microcells
PEG select )
G4 1&resistant ES
cell colonies
®
transchromosomic
ES cell , freely
)
Inject recipient
blastocysts with
{ranschromosom ic
ES cells
chimeric mice ,
Hsa21 in ES ce ll
human cell line wilh segregating Hsa21 derived tissue
neo-tagged Hsa21
Figure 20 ,11 Making the transchromosomic mouse line Tc1 to model Down syndrome.
(Al Using gene targeting, a neomycin resistance gene (neo) was inserted into human chro mosome
21 (Hsa21) in a human cell line. Cells carrying a neo-tagged Hsa21 were arrested in metaphase and
then centrifuged to isolate microcells carrying one or just a few chromosomes. The microce!is were
fused to murine embryonic stem (ESl cells. After selection with G4 18 for uptake of the neo gene,
ES cells were screened to identify the extent of Hsa2 1 present. Fema le mouse ES cell lines were
injected into mouse blastocysts to generate female chimeric mice. The latter were bred to establish
the Tel mouse stra in, w hich carries a freel y segregating Hsa2 1. PEG, poly(ethylene glycol).
(8) Fluorescence in situ hybridization (FISH) of metaphase ch romosomes from Tcl mouse
splenocytes. Chromosomes are revealed by DAP1 sraining (blue), and chromosome painti ng was
performed w ith probes speCific for human chromosome 21 (Hsa2 1) and mouse ch romosome X
(M muX). (Courtesy of Elizabeth Fisher and Victor Tybulewicz, University College London.)
TABLE 20.3 UNEXPECTED PHENOTYPES IN MICE FOLLOWING INACTIVATION OF ORTHOLOGS OF HUMAN TUMOR
SUPPRESSOR GENES
Mouse gene Tumors in standard knockout mouse Human tumors arising from loss of orthologous gene function
Apc(Min,L\7 16, mu ltiple polyps in small intesti ne polyps in colon progress ing to carcino m a
MBO alleles)
Ink4a fibro sarcoma, lym phoma, squamous ce ll ca rcinoma fam ilial me la no ma, sporadic pan creatic and b rain tumors
Nf2 o steosarcoma, fibrosarcom a, lung adenocarci noma, schwannomas, m eni ngom as, epend ymomas, gliomas
hepatocellular carcinoma
Tp53 osteosarcoma, lympho ma, soft-ti ss ue sa rcoma breast carcino ma, brain, sa rco mas, le ukemia
5econd·g eneration models usi ng cond itional in activation o f the above tumor sup pressor genes and/ or inac tivation of additi ona l genes involved in
tumorigenesis have g enerally produced good models; see the text and the reviews by Rangarajan & Weinberg (2003) and Frese & Tuveson (2007)
under Fu rther Reading.
Animal m odels are models-inevitably they are di fferent from us. We Differences in cell p hysiology and developmental pathways. For
o utline below some of the ways in whi ch humans differ from mice that example, melanocytes are found in the o utermost layer of human
can limit th e usefulness of m o use models. ski n, but in mice t hey are confined to hair fol li Cles, maki ng it
Differences in gene catalogs . Humans and mice share about 99% difficult to model disorders such as melano m a by Simply exposing
of genes. Often the differences refl e<:t differential expansion of mice to excess ultraviolet (UV) irradiation.
a large gene family in the two species (see Tabl e 10.8) . However, Differences in brain structure and capacity. The size and com plexity
there are some sign ificant differences in genes w ith a single or j ust of the h uman brain far exceed s that of t he mou se, and our uniqu e
a few copies in hum ans, and in some cases a human gene has no cognitive capacities mean that aspects of brain fu nctio n and bra in
cou nterpart in rodents (having been deleted from rodent li neages disease ca n never be accurately modeled in rodents.
during evolution) but yet is sufficiently important in humans to Differences in sensitivity to infectious agents. Viruses t y pically g ain
be a d isease locus. For exam ple, Leri-Weill dyschondreostosis acces s to ce lis by binding to specific cell surface receptors that
(OMI M 127300), Langer m eso m elic dysp lasia (OMIM 249700), and can differ significantly between species, w hich is w hy huma n
Kallma n syndrome type I (OMIM 3087(0) are caused by mutation s immu nodefi ciency virus (HIV) and rhinoviruses (respon sible for
in the SHOX o r KA L genes that do not have ortho logs in roden ts. th e com mon cold) d o not infe ct m ouse celts.
Differences in genetic background. Human po pu lati ons are m ostly Differences in longevity. Humans live abo ut 30 times as long as
outbred; laboratory mi ce are high ly inbred. When making mo use mice. The pathogenesis of som e human d isorders that manifest in
model s, the mutant phenot ype can vary enormously, depend ing mid d le to old age could involve tem poral components, ma king It
o n the genetic backg rou nd (see Box 20.1 J. d ifficult to replicate the phenotype in short· Jived mice.
Differences in gene regulation and gene expression. TI ssue-specific Differences in cell division. Humans are about 3000 times larger
t ra nscripti onal regulation has d iverg ed very sig nificantly between than mice, with a correspondingly larger n umber of cell s. In an
h um an and m o use. 50, even if the products of a pair o f hu man average lifetim e t here are about 10 16 human mitoses but about
and m ouse o rthologs seem high ly conse rved , mod ulation of 1011 mouse mitoses. but much the same incidences of developing
regu latory sequences m ay cause them t o be expressed in different ca nce r. Possibly, so m e d istin ct cancer·preventing m echanism s are
ways and interact d ifferently with other gene prod ucts. intri nsic t o human cell s.
Differences in biochemical and metabofic parh ways. The metabo lic Differences in telo meres and telomerase. Mouse telomeres are
rate of mice is about seven t imes t hat of hu mans and th ere are much lo nger t ha n human telomeres (about 40-60 kb versus
Sig nificant human-mou se differences in various biochemical abo ut 10 kb). As celis divide, telomeres get progressively shorter,
pathways, such as in d rug metabolism . Fo r example, important and telo mere eros ion has been impli cated in cell senescence.
d ifferences in xenobiotic receptor and cytochro m e P450 Telo merase, the enzym e that maintains and extends telomeres, is
expressio n leve ls, tissue distrib ut ion, en zymatic activ it ies, ligand fu nctionally active in m ost mouse cells but is scarcely de tectable
affinities, substrate specifi cit ies, and so on, can lead to markedly in many huma n ad ult somatic cell s.
d ifferent responses to d rugs in mice and humans. The re are Altho ugh in some areas the biologies of humans and mi ce are simply
also importa nt human- mou se differences in the ex p ression of too divergent to accurately recapitulate certain aspects of human
g lycosylating enzym es, w ith im pli cations for protein di stribut ion disease, some human- m ouse di fferences can be overcome to some
and fun ction. extent by m aking humanized m ice (see the text).
670 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
TABLE 20,4 EXAMPLES OF ZEBRAFISH DISEASE MODELS PRODUCED BY GENE KNOCKDOWN USING MORPHOLINO
OLIGONUCLEOTIDES
-
Disease OMIMnumber'l Disease locus Zebrafish phenotype
Ba rfh syndrome 302060 TAZ dose-dependent letha lity, severe developmental and growth retClrdation,
bradycardia, pericardial effusions, and gene ralized edema
Joubert syndrome type 5 610188 (fP290 (NPHP6) defects in retinal, cerebellar, and otic cavity development
Lenz microphthalmia 300166 BCOR developmental perturbations of the eye, skeleton, and centra l nervous system
aOnline Mendel ian Inheritance in Man database at http://www.ncbi.nlm .nih.gov/s ites/entrez?db=omim. Data on zebrafish models extracted from
Kari G, Rodeck U & Dicker AP {200l) Clin. Pharmacal. Ther. 82, 70-80.
672 Chapter 20: Genetic Manipulation of Animals for Modeling Disease and Investigating Gene Function
breakthrough has been in the extended provision of pluripotent stem cell lines to
enable gene targeting in non · murine vertebrates.
New promising ES cell lines have been developed from rat bl astocys ts, and
stable rabbit ES cell lines have also recently been reported. An alternative and
exciting new route has been to induce somatic cells to dedifferentiate so as to
generate pluripotent stem cells that are operationally identical to ES cells. Such
induced pluripotent stem (iPS) cells may serve as alternatives to ES cells for per
forming gene targeting. iPS cell lines have been repo rted from a variety of experi
men tal animals including mice, rats, pigs, and rhesus monkeys.
The diffic ulties in accurately replicating many human phenotypes in rodent
models h ave prompted considerable interest in n onhuman primate models.
Altho ugh transge nesis has been po ssible in some cases, such as the rhesus
macaque model of Huntington disease described above, germ-line transmission
of functi onal transgenes was not successful until 2009, wh en transgene tra nsmis
sion was reported in marmosets. Together with the development of n onhuman
primate iPS cells, these breakthroughs may lead to a significant increase in the
use of nonhuman primate models.
Human iPS cells have also recently been made and in Chapter 21 we will detail
the technology for making human iPS cells, because of their therapeutic poten
tial and their exciting potential for providing human cellular models of di sease.
CONCLUSION
Animal models are vital for our exploration of how genes function in cells, in
understanding human disease, and in testing novel drugs and therapies. The
ability to modify the genome of model animals has greatly assis ted research
efforts.
The use and usefulness of animal models depends on various factors, includ
ing evolutionary relatedness to humans, ethical concerns, and practical advan
tages and limitations. Fly and worm models have been inval uable for under
standing gene fu nction and m odeling disease at the cellular level. Beca use they
are particularly amenable to genetic m anipulation, mice and zebra fish are impor
tant vertebrate models.
Genetic manipulation of animals normally invo lves introdu cing exogenous
DNA into the germ line to create a transgenic animal. Pronuclear microinjection
of exogenous DNA into fertilized zygotes is often used to study the effect of
expressing an exogenous gen e tha t may be normal or a mu tant. An alternative is
to modify endogenous genes, either by gene targeting (modifying a specific pre
determined gene) or by some kind ofrandom mutagenesis. Gene targeting often
exploits the cellular process of homologous recombination to replace important
sequences in an endogenous gene by a mutant version.
Site-specific recombination gives control over where a gene is inserted and
when it might be expressed. Use of systems such as Cre-IoxP allows conditional
knockouts, in which genes essential for early development can be switched offin
specific tissues o r at a desired developmental stage. This technology can also be
used to engineer larger chromosomal changes such as deletions, insertions, and
translocations.
Gene function can also be altered by targeting RNA transcripts for destruction
by RNA interference pathways or blocking thei r expression by using morpholino
antisense oligonucleotides. Alth ough these m ethods often have transient effects
and may allow some residual gene function, they are comparatively easy to do
and can produce some useful models quite quickly.
Large-scale mu tagenesis often uses random mutagenesis with mutagenic
chemicals or insertional inactivation by a known transgene sequ ence, when the
position of the disrupted gene can be identified by the inserted sequence.
New genetic technologies are increasing the range of disease models. Until
recently, knocking out gene function in vertebrates was very largely performed in
mice, but the use of zinc finger nucleases has extended gene targeting to species
such as zebrafish and rats. The development ofiPS cells from somatic cells opens
the prospect of widening the range of model species still further. Finally, human
iPS cells can also offer the prospect of cellular human models of human disease.
We consider this aspect in Chapter 21 within the context of biomedical applica
tions of human iPS cells.
Chapter 21
Genetic Approaches to
Treating Disease
KEY CONCEPTS
Genetic techniques can be applied in various ways in treating disease, regardless of whether
the disease has a genetic origin or not.
Genetics is being used extensively in the drug discovery process to identify new drug targets.
• Therapeutic recombinant proteins produced by expressing cloned DNA are safer than those
sourced from animal and human sources.
Genetic engineering of vaccines may involve inserting genes for desired antigens into a viral
vector and using either the modified vector or the purified antigen as the vaccine.
Alternatively, DNA can be used as a vaccine by direct injection into muscle cells.
Cells lost by disease or injury can be replaced by suitable stem cells that have been directed
to differentiate into the correct specialized cell type.
Nuclear reprogramming involves fundamental alteration ofthe chromatin structure and
gene expression pattern so as to convert differentiated cells into less specialized cells
(dedif!erentiation) or to a different type of differentiated cell (transdif!erentiation).
Autologous cell therapy involves conversion of a patient's cells into a type of cell that is
deficient through disease or injury. In principle this can be done by artificially stimulating
the mobilization of existing stem cells in the patient or by nuclear reprogramming.
Skin fibroblasts from a patient can be dedifferentiated to make patient-specific totipotent or
pluripotent stem cells that can be differentiated to required cell types for use in autologous
cell therapy and for studying disease. In therapeutic cloning the nucleus from a patient's cell
is transplanted into an enucleated egg cell from a donor, giving rise to a totipotent cell. More
conveniently, the skin cells can simply be induced to express key transcription factors that
will convert them to pluripotent stem cells.
Gene therapy involves transferring genes, RNA, or oligonucleotides into the cells of a patient
so as to alter gene expression in some way that counteracts or alleviates disease. The
patient's cells are often genetically modified in culture and then returned to the body.
• Gene therapy strategies often seek to compensate for underproduction of an important
gene product. Other strategies are designed to block the expression of a mutant gene that
makes a harmful gene product, or to induce alternative splicing so that a harmful mutation
does not get included in a mRNA, or to kill harmful cells, for example in cancer gene
therapy.
• Viral vectors are commonly used ill gene therapy because they have good gene transfer
rates, but they sometimes provoke strong immune responses and also abnormal activation
of cellular genes such as proto-oncogenes.
• Zinc finger nucleases can be designed to make just one double-strand DNA break within the
genome of intact cells. Cellular pathways that repair double-strand DNA breaks can then be
induced to replace a sequence containing a harmful gene mutation with the normal allelic
sequence.
678 Chapter 21: Genetic Approaches to Treating Disease
In the preceding chapters in this part ofthe book, we have looked at how genetics
can be used to test individuals and populations for genetic diseases and suscep
tibility for complex diseases. Chapter 20 explored genetic manipulation of ani
mals to model human diseases. In this chapter, we look at how genetics can be
employed to treat diseases. A chapter on disease treatment might reasonably
consider two quite separate matters: the treatment of genetic disease, and the
genetic treatment of disease. We therefore consider briefly what these two areas
cover before describing how genetic techniques are being used directly to treat
disease.
Despite much work, however, relatively few genetic diseases have wholly sat
isfactory treatments.
,-~, ,~~,
~
11 ,Q
19
r!$\
~ ..
".. . :'
small-molecule drugs: production of production of production of
-t6L2R}-
~~
~C't~
, ~ ,~' ,' 3N
Flgure11.1 Some of the many different
- -~-- ways in which genetic technologies are
production of genetiC modification testing of treatment In genotyping of patients
used in the treatment of disease. See
reagents for of patient or donor genetically modified in personalized
gene therapies cells in gene and animals medicine the text for detail. Il2R, interieukin type 2
cell therapy receptor.
680 Chapter 21: Genetic Approa ches to Treating Disease
Drugs are often small chemical compounds {hat are extracted from and correlate with the drugs that caused that effect. Often, however,
natural sources or chem ically synthesized. Generally a drug targer is the search for compounds is a d irected one: the object is to search for
some naturally existing molecule or cellular structure that is involved drugs that react with a defined target or disease phenotype.
in a pathology of interest, and which the intended drug is desig ned Various types of cultured cells, and immortalized aneuploid cell
to act on to treat the pathology. The molecular targets that have lines, can be used for drug screening. Recently isolated stem cell lines
been considered most druggable have traditionally been proteins. allow the possibility of controlled differentiation of euploid cells to
Frequently they are receptors (notably G-protei n-coupled receptors), a desired differentiated cell type. As described in Section 21.3, skin
enzymes (especially protein kinases, phosphatases, and proteases, or fibroblasts from any individua l can now also be reprogrammed to
protein channels (voltage-gated or ion-gated). Sma ll- molecule drugs make pluripotent stem cells, and so disease-specific pluripotent stem
fit tightly into crevices on the surface of their target proteins, t hereby cells are being produced that can be directed to differentiate to give
modulating their function. desired human disease cells for d rug screening.
Drug screening frequently involves high-rhroughput screening. Anima l disease models offer a physiological disease environment.
With the use of liquid handling devices and robotics, hund reds Mice and rats are important in the testing of drug therapies before
of thousands to millions of different chemical compounds are launching clinical trials, but rodent embryos are impractical for
individually assayed for their ability to react with some kind of high-throughput drug screening. The culture conditions of small
biological material in specified wells of numerous microtiter dishes. invertebrates such as C. elegans and D. melanogGster allow high
The biological materia l has traditionally involved a cell-free system, throughput drug screening in a physiological context, but they are
microbes (bacteria, yeasts), or cultured cell lines, but more recently limited to drugs that recogn ize evolutionarily very highly conserved
in vivo animal drug screening has also been possible. Searching for targets. Zebra fish embryos offer the advantage of both being
compounds may occur in a non-directed approach with normal cells vertebrates and also transparent (allowing live tissue imaging) and are
or tissues in which the object is to identify any phenotypes produced becoming increasing ly important in drug screening.
GENETIC APPROACHES TO DISEASE TREATMENT USING DRUGS, RECOMBINANT PROTEINS, AND VACCINES 681
Insu li n diabetes
Leptin obesity
Erythropoietin anemia
lines (or microbial bioreactors). [nstead, transgenic animals have been used
because large amounts of a recombinant human protein can be produced in
ways that simplify its purification. One way is to arrange that the desired protein
is secreted in the animal's mille. An expression vector is designed with the coding
sequence for the human protein under the control of a promoter from a gene that
normally makes a milk protein, and the recombinant DNA is inserted into the
animal's germ line (Figure 21.2).
mammary gland
s peCific promoter
\ transfeclion
)
-'~?
transgene
X- celis goat (or cow,
rabbit, etc. )
egg ----:nucleated1-~~~s
egg
recombinant proteins in the milk of
t ransgen ic livestock. To express the
desired recombinant protein X in the milk
donor fusion
of a transgenic livestock anima l such as a
goat, a suita bl e transgene is constructed
with a promoter that will cause the gene to
-' be expressed and sec reted in the animal's
transge nk: embryo
milk. The transgene can be inserted into
J. the animal's germ line by using pronuclear
l couect milk and of the tran sgene are mated, and female
purify protein X offspring serve as t he production herd.
Th e thera peutic protein is expressed in the
animal 's milk but needs to be purified by
sequenti al chromatog raphy processes that
ultimately deliver a prod uct that is more
than 99.99% pure. pA, polyadenylation site.
GENETIC APPROACHES TO DISEASE TREATMENT USING DRUGS, RECOMBINANT PROTEINS, AND VACCINES 6B3
tA) rodent mAb (8) chimeric Ab (C) humanized Ab FlgUl. 21 .3 Antibody engineering.
(A) Classic monoclonal antibodies (mAbs) are
monospecific rodent antibodies synthesized
by hybridomas, (B) Chimeric VIC antibodies
(Abs) are gene tically engineered to have
rodent variable region sequences containing
the critically important hypervari able
complementarity determining region (CDR)
DO
joined to human constant region sequences.
(el Humanized antibodies Can be eng inee red
so that a\l the molecule except the CDR is of
human origin. (D) More recently, It has been
possible to obtain fully human antibodies
(D) fully human Ab IE} scFv Ab KEY:
rodent human
by different routes (see the text and Figure
n
21.4). (El Single-chain variable fra gment
0 --D ~
constant (C) region domains
}--+
locus plus the entire human Igy light-chain
human IgH, loci. This HAC was then introduced into
IgK mrniloci mouse ES celis that had previously been
.,ectroporation
subjected to rounds of Cre-/oxP gene
targeting to delete the endogenou smouse
Ig loci. The resulting mouse can generate
human a(tl flclal Immunoglobulins by rearrangements of the
IgH
( chromosome (HAC)
IgK / introduced human 19 loci, and so can be
used to produce human mAbs of any desired
mouse ES cells containing specificity. For full details see US patent
human Ig loci 704 1870, accessible by searc hing the patent
database at http://www.pate ntstorm.us.
J.
J.
myeloma
cells
i--"\ (
y
'\ ( --""\
spleen
cells
I , hybridomas
V 'J V
,J, J. J.
l0~ ~ anU·X mAbs
TABLE 21.2 PARTLY OR FULLY HUMAN THERAPEUTIC MONOCLONAL ANTIBODIES (mAbsl APPROVED BY THE US FOOD
AND DRUG ADMINISTRATION AS OF MID-200S
Disease category mAb generic name (trade name) mAb class b Disease treated
complem ent-S ecul izuma b (Sol i ris~) humanized paroxysmal nocturnal hemoglobinuria
aC011 a, C020, C025, C033, and C052, white blood cell antigens; IL2R, interleukin type 2 receptor; IgE, immunoglobulin E; TNF-a, tumor necrosis
factor a; EGFR, epidermal growth factor receptor; HER2, human epidermal growth factor receptor 2; VEGF, vascular endothelial growth factor;
GPllb!llla, a platelet integrin; RSV, respiratory syncytial virus; NSCL, non-small-cell lung. bSee Figure 213 for structures.
chronic viral infections. Several aptamers are currently being tested in preclinical
and clinical trials, and recently the aptamer pegaptanib (Macugen®) received
FDA approval for the treatment ofneovascular (wet) age-related macular degen
eration, a principal cause of blindness in adults. Loss of vision in this disorder
arises by the formation of new blood vessels (angiogenesis) and leakage from
blood vessels, both of which are promoted by vascular endothelial growth factor
(VEGF). Pegaptanib aptamers specifically bind to VEGF and inhibit its angiogen
esis functions.
Pluripotent embryonic stem cells are derived from the early embryo, a stage
in development at which cells are naturally still unspecialized. Later, as tissues
form in development, cells become more and more specialized but some stem
cells are retained in the fetus and in adults. They are needed to regenerate tissue.
Our blood, skin, and intestinal epithelial cells, for example, have limited lifetimes
and are replaced regularly by new cells produced by differentiation from stem
cells. Stem cells are also used in tissue repair. Although many stem cells derived
from fetal and adult tissues are tissue-specific, mesenchymal stem cells are found
in a wide range of tissues and may constitute a subset of cells associated with
minor blood vessels (pericytes).
Embryonic stem cells
Cells of the inner cell mass of a blastocyst can be isolated and cultured to gener
ate embryonic stem (ES) cell lines (Box 2 1.2). Mouse ES cells have been studied
for many yea rs but, after the isolation of human ES cell lines was first reported in
1998, stem cell therapy was propelled to the forefront of both medical research
and ethical debate. ES cells have clear advantages for cell therapy because they
can be grown comparatively readily in culture and propagated indefinitely
and can give rise to cell lineages. Different ES cell lines show significant variation
in some properties, however, such as their potential for differentiation. Some ES
cell tines readily differentiate toward mesodermal lineages, for example, whereas
others do not.
The use of ES cell lines is contentious because the methods used to isolate
them have traditionally involved the destruction of a human embryo to obtain
cells in tile inner cell mass. Nevertheless, some human ES cell lines have been
made without destroying an embryo. For example, individual cells (blastomeres)
can be carefully isolated from very early embryos in such a way that the embryo
remains viable, and a few human ES cell lines have been generated from indi
vidual blastomeres isolated in this way.
Tissue stem cells
Deriving stem cells from adult human tissues and adult and cord blood Ilas not
been contentious. Tissue stem cells we re initially thought to be comparatively
-
Embryonic stem (ES) cell lines are cultured cells that are derived from
cells from the inner cell mass OCM) of the blastocyst. For laboratory
produced in this way can be used for research purposes with the
of cells from the ICM in culture dishes. Tra ditiona lly, the inner surface
fibroblasts that have been treated so that they do not divide. The
feeder cells provide a sticky surface to which the ICM cells can
attach and they secrete growth factors and other chemicals that will
stimulate the growth of the ICM cells. Surviving ICM cells that divide
and proliferate are collected and re-plated on fresh culture dishes and
plastic under the right experimental conditions. For example, when bone mar
row stem cells are placed in the environment of liver tissue they seem to be
induced to give rise to new liver cells. However, when rigorously tested, there is
little evidence to support claims of this type. Instead, tissue stem cells are now
known to be almost always rather limited in their differentiation potential. An
exception is provided by the adult germ line-pluripotent germ-line stem cells
have recently been derived from spermatogonial cells of the adult human testis
and show many similarities to ES cells. Many tissue stem cells are also difficult to
grow in culture.
While normal mobilization of an individual's stem cells can help make small
repairs to different degrees, large injuries and serious disease pose too much of a
challenge for the body's natural repair processes. But what if we could artificially
enhance the normal stem cell response to repair' Bone marrow cells that are
mobilized in response to disease or injury are an attractive target for artificially
enhanced mobilization. They include mesenchymal stem cells, which can
become bone or cartilage cells, and endothelial progenitor cells, which produce
the cells that make up our blood vessels and help with tissue repair. A promising
report published in early 2009 showed that administration of the drug Mozobil®
and a growth factor was found to elicit a hundredfold increase in the release of
endothelial and mesenchymal progenitor cells from the bone marrow in mice.
Nuclear reprogramming entails altering nuclea r gene expression Each nucleus is potentially subject to regulation by factors that
to bring about a profound change in the differentiation status of normally regulate gene expression in the o ther nucleus. One nucleus
a speCialized cell. It may cause a somatic cell to regress so that it predominates-typically the nucleus of the larger and more actively
acquires the properties of an unspecialized embryo or a pluripotent or dividing cell- and the other is reprogrammed to adopt the pattern
progenitor cell (dedifferentiation), or it may cause a lineage switch so of gene expression of the dominant nucleus. Thus, if actively dividing
that one type of differentiated somatic cell changes into another type pluripotent ES cells are fused with somatic cells, the somatic cell
(tran sdifferentiarion) . The fertilized egg cell provides a natura l example nucleus will be reprogrammed to the pluripotent state (Figure 1,
of nuclear reprogramming: the sperm is a highly differentiated cell and path B).
the nucleu s is hIghly condensed until reprogrammed by factors In th e Specific transcription factors can induce pluripotency
committed to their fate, was thou ght to be irreversible. The cloning of a sheep
called Dolly changed that view, because the nucleus of an adult somatic cell was
successfully reprogrammed, giving rise to a totipotent cell.
As detailed in Section 20.2, the cloning of Dolly was made possible by isolat
ing nuclei from mammary gland cells from a Finn-Dorset sheep and then uans
planting individual nuclei into enucleated eggs from a Scottish blackface sheep
(somatic cell nuclear transfel~see Figure 20.4). In one out of 434 attempts, fac
tors in the egg cytoplasm someh ow reprogrammed the somatic cell nucleus to
return it to the totipotent state. The resulting cell was stimulated to develop to the
blastocyst stage and was then implanted in the uterus of a foster mother to give
rise to Dolly. As Dolly's nuclear genome was identical to that of the Finn-Dorset
sheep, she was considered to be a clone of that sheep. A variety of other mam
mals have now been cloned by somatic cell nuclear transfer. However, cloned
mammals have a very high incidence of abnormalities, most probably as a result
of inappropriate epigenetic modifications.
Although naturai human clones- identical twins and triplets, for example
are reasonably con1IDOll, artificial hun1an cloning is considered ethically unjusti
fiable (Box 21.4) and is banned by legislation in many countries. Nevertheless,
somatic cell nuclear transfer offered for the first time the possibility of making an
ES cell line from any patient willing to donate some easily accessible skin fibro
blasts. Individual nuclei would be isolated from the skin cells and transferred into
unfertilized eggs that had been donated by donors and subsequently enucleated.
BOX 21 ,3 CONT.
pathway to produce a different but related cell type (Figure 1, path megakaryocyte/ eryth ro id pathway. As a resu lt, transfection of a gene
e), or by direct cell conversion without an obvious intermediate expreSSing the GATA- 1 transcription factor into myeloid cells can
dedifferentiation step (Figure 1, path D) . cause transd ifferentiation toward megakaryocyte/erythroid cells;
Ectopic expression of just a single type of transcription factor ectopic exp ression of PU. 1 in meg akaryocyte/ erythroid cells stimu lates
ca n cause transdifferentiation. Thus, for example, the transcription transdifferentiation into myeloid cell s (Figure 1, path C). Efficient
factors PU.l and GATA-l are mutually antagonistic in the d evelopm ent direct conversion of adult pancreati c exocrine cells to beta cells can be
of blood cell lineages. If PU.l predomina tes, differentiation occurs achieved by transfecting genes for three tranScription factors that are
toward myeloid cells, but if GATA·1 dominates, cells go down the essential for beta cell function: Ngn3, Pdxl, and Mafa (Figure 1, path OJ.
o
--t epidermiS
~ neural cells
8~
G
~ macrophage
~ lymphocyte
~mu s cle
/ embryo
_ _-
- l ) exocrine
a;r-pancreas
unfertilized leM
egg blastocyst
C ;=;-::_.
~ ~-:.
33( 0
...,
J
ES cells - ~~
4 I
~
' diffe,enu"ed
~ cells
..---'l)
~
Flgu... 1 Experimental routes for nuclear reprogramming in mammals. Normal steps in development and tissue differentiation are shown by
black arrows. Experimental procedures are shown as b lue arrows. (Al Nuclear transfer into an unfertilized egg reprograms a somatic nucleus to
dedifferentiate to give a totipotent cell. (B) A somatic cell is stimulated to dedifferentiate to become a pluripotent stem cell that closely resembles
an ES cel l. (el lineage switching is a form of transdifferentiation that involves dedifferentiation followe d by differentiation down a different
differentiation pathway. (Dl Transdifferentiation by d irect conversion of one somatic cell into another. ICM, inner cell mass.
692 Chapter 21: Genetic Approa ches to Treating Disease
In principle, reproductive cloning to produce a human clone could cloned animals born have serious abnormalities, most probably
provide the only option for a couple to have children in certain cases. because the epigenetic modifications of the donor cell are not reliably
For example, a woman with a serious mitochondrial disorder could reprogrammed. Allowing human reproductive cloning to proceed
avoid transmission of the disease by becoming pregnant with a with current imperfect technology would be grossly unethical because
donated oocyte into which a nucleus had been transplanted from of the early deaths and major abnormalities that could be expected.
one of her or her partner's somatic cells. The technology could also Conceivably, advances in knowledge might remove this objection,
meet more general needs such as the production of cloned individuals although it is not clear how we could know without conducting
to replace a dying child or other loved one. However, human unethical experiments.
reprod uctive cloning is banned by legislation in many countries Even if the technology were perfectly efficient and safe and
because of powerful ethical arguments. had no health consequences, the second major ethical argument in
Of the two m ajor ethical arguments, the practical argument principle says that humans should be valued for themselves and not
is the more powerful one. It points to the low success rate of all treated as instruments to achieve a purpose.
mammalian cloning experiments. and to the fact that many of the
The resulting cells would be stimulated to develop into blastocysts, and inner cell
mass cells would be removed to make ES cells. This type of approach became
known as therapeutiC clolling, to distinguish it from the banned reproducti/Je
cloning (Figure 21.5).
Therapeutic cloning has been controversial. Not only does it involve the
destruction of an unfertilized human egg, but also, by providing the technology
for cloning human blastocysts, it has raised fears that human reproductive clon
ing could be facilitated. In addition, obtaining eggs from human donors involves
an invasive and potentially dangerous procedure. The attraction of therapeutic
cloning was that it offered the possibility of making patiellt-specific ES cells that
could, in principle, be used for autologous cell therapy, and disease-specific ES
cells that could be used for studying pathogenesis and in drug screening as
described below. In practice, cloning human blastocysts by somatic cell nuclear
transfer proved to be extraordinarily inefficient-thus far, there is no report of a
human ES cell line made by this method.
unfertilized
egg
rv
\.....;'
donor is transplanted into an enucleated
egg. Factors in the oocyte cytoplasm are
able to reprog ram the somatic nucleus so
•
transferred of the blastocyst, which is then destroyed.
Into egg The retrieved cells are cultured to establish a
pl uripotent ES cell line that can be induced
T ----+ differentiation
to differentiate to different types of somatic
celt . The large red X signifi es that using
cloned human blastocysts for reproductive
•
ES cells
19% production of the first cloned mammal, a sheep called Dolly, by somatic cell nuclear transfer; see Figure 20.4 (Nature 380.
64-66; PMID 8598906)
2004 chemical screens identify small molecules that cause reversal of differentiation. such as reversi ne, which can cause myoblasts
to ded ifferentiate into progenitor cells (J. Am. Chem. Soc. 126, 4 10-411; PMID 1 4719906 and Nat. Biotechnol. 22, 833-840;
PMID 15229546)
2004 reprogramming of nuclei transferred from olfactory sensory neurons to eggs to generate totipotent cells and subsequ ently
cloned mice (Nature 428,44-49; PMID 14990966)
- 2004 stepwise reprogramming of B celis into m acrophages-a n early example of a lineage switch «elf 117, 663-676;
PMI D 15 163413)
-2005 nuclear reprogramming of somatic cell s after fusion w ith hum an ES cells (SCience 309, 1369-1373; PMID 161 23299)
2()()6 reprogramm ing of cultured mouse embryonic and adult fibroblasts to pluripotent stem cells following transfectlon by genes
encoding just four transcription factors (CeI/126, 663-676; PMID 16904174)
2()()7 induction of p luripotency in human somatic cells (CeI/131, 861-872; PMID 18035408 and Science 31 8, 1917-1920;
PMID 18029452)
2008 transdifferentiation by in vivo reprogramming of adult pancreatic exocrine cells to beta cells (Nature 455, 627-632;
PM ID 18754011)
2008 nu clear reprogrilmming provides the first patient-specific and disease-specific p luripotent stem cells (Celf 134, 877- 886;
PMID 18691 744 and Science 32 1, 12 18-1 221; PMID 18669821 )
2009 generation of fertil e viable mice derived exclusively from iPS cells (Nature 461, 86-90; PMID 1967224 ' , Nature 461, 91-94;
PMID 19672243, and eellStem Celi S, 135- 138; PMID 19631602)
2009 generation of IPS cells using recombinant proteins on ly (Cell Srem Cell 4, 381-384; PMID 19398399)
2010 reprogramming of fibroblasts to give functional neurons (Vie rbuchen T et at, Nafure 463, 103 5-1041 ; PMID 20107439)
iPS cell, induced pluripotent stem cell; PMID, PubMed identifier number.
694 Chapter 21: Genetic Approaches to Treating Disease
differentiated cells
For some disorders, the use of iPS cell-derived cells also means that drugs can
now be tested directly on the relevant human cells. Drugs can be tested for th eir
ability to reduce the progression of the pathogenesis or even to counteract it in
some way, and the toxicity levels can be determined directly, rather than relying
on animal studies.
The potential ofiPS cells for efficient autologous cell therapy remains unclear.
Safety concerns were an issue with the original technology, which used retroviral
vectors to transfect reprogramming genes, one of which was a known oncogene,
c-Myc. However, new methods do not use c-Myc, and avoid genome integration
by using non-integrating vectors or by transducing cells with the pluripotency
inducing transcription factor proteins themselves, or miRNAs and other small
molecules involved in the same pathways. Performing efficient precisely directed
differentiation remains a major challenge. Like ES celis, undifferentiated iPS cells
can give rise to tumors, and there is considerable ignorance about the molecular
details of the long, multi-step differentiation pathways required to convert plu
ripotent stem cells to differentiated somatic cells.
Transd ifferentiation
An alternative to iPS cells is to use lineage- reprogramming methods that would
result in limited dedifferemiation to produce oligopotent progenitors of a desired
cell type, or a direct cell conversion (transdifferenliation) to produce the desired
cell type (see Box 21.3 Figure l, pathways C and D). For example, pancreatic exo
crine cells could be reprogrammed to make pancreatic beta cells that could pos
sibly be induced to join pancreatic islets to replace beta cells lost in type 1 diabe
tes. Byjusttakingone step back in differentiation or a sideways step, differentiation
toward producing the desired cell type would be much simpler or not even
required. Additionally, disruption to the epigenetic marks that are set during
PRINCIPLES AND APPLICATIONS OF CELL THERAPY 695
TABLE 21 .3 SOME PRINCIPAL TARGETS FOR STEM CELL THERAPY AND EXAMPLES OF SUCCESSFUL CELL REPLACEMENT
~
Disease/injury Cell affected
----------------------------------------~~
Stem cell therapy Comments/references
Leu kemia/ lymphoma hematopoietic allogeneic bone marrow transplantation from successful over many decades
stem cells killed by matched donor
chem otherapy
Eye injury/ disease limbal stem cells in limbal stem cells are taken from the healthy Pe llegrini et al. (1 997) Lancet 349, 990- 993;
one eye eye, expanded ex vivo, and transplanted back PMID 9 100626 and Kolli et al. (2010) Stem
into the affected eye Cells in press; PM ID 20014040
Spinal cord injuries neurons tran splantation of fetal neural stem cells o r ES sOme evidence for successful treatment;
cell-derived oligodendrocyte progen itors first target for approved clinical tria ls using
ES cells
Heart disease cardiomyocytes (hea rt d irect injection of bone marrow stem cells conflicting views o n the degree of success,
muscle cell s) (including hematopoietic stem cells and and uncertain ty regarding the underlying
mesenchymal stem cells) cause of any clinical improvement
Stroke neuro ns human fetal neu ral stem cells used as cell first clinical trial underw ay
source
Parkinson disease midbrain dopaminerg ic fetal neuronal cells used as cell source difficulties in ensuring that grafted neurons
neuro ns do not themselves develo p disease after a
few years
Rat Parkinson d i sease m idbrain d opaminergic iPS cells were differentiated into neural Wernig et al. (2008) Proc. Natl Acad. Sci. USA
model neu rons precursor cells that were transplanted into the 105,5856-586 1; PMID 18391 196
fetal mouse bra in; t he transplanted cell s w ere
functionally integrated, causing improvements
in symptoms
Brain disease (in mouse glia l celJs as a resu lt of human feral glial progenitor cells were injected Windrem et al. (2008) Cel/Stem Ce1/2,
shiverer mutant) mye lin deficiency into the nervous system to correct abnormal 553-565; PMID 18522848
brain development
Bl indness (in Crx-deficient degenerat ion of human ES cells were differentiated to give Lamba et al. (2009) Cell Stem Cell 4, 73-79;
mice) photoreceptors retinal ce lls that were transplanted into the PMID 19128794
retinas of the m o use Crx mutant and restored
some v isual function
Germ-line gene therapy involves making a genetic change that can disease genes are carried by affected people; the great majority
be transmitted down the generations. This would most probably be are in healthy heterozygotes. For a recessive disease affecting one
done by genetic manipulation of a pre-implantation embryo. but it person in 10,000 (q2 :=; 1/10,000; q = 0.01) only 1% of disease alleles
m ight occur as a by-product of a treatment aimed at somatic cells are in affected people. Whether or not we stop affected people
that incidentally affected the patient's germ cells, Pure somatic gene from transmitting their disease genes (either by germ-line therapy
therapy treats only certain body cells of the patient without having any or by the cruder option of sterilization as the price of treatment)
effect on the germ line. Technical difficulties currently make germ-line has very little effect on the frequency of the disease in future
therapy unrealistic. Even if these problems are solved. ethical concerns generations.
will remain: genetic man ipulation of the germ line is prohibited by law For fully penetrant dominant conditions, all the disease alleles
in many countries. are carried by affected people, and for X-linked recessives the
The main argument against germ-line therapy is that these proportion is one-third. But the dream of eliminating the disease
treatme nts are necessarily experime nta l. We cannot foresee every once and for all from the population falls down because, as the
consequence, and the risk is minimized by ensuring that its effects equations in Box 3.7 show, most serious dominant or X-linked
are confined to the patient we are treating. This would imply that diseases are maintained in the population largely by recurrent
once we have enough experience of somatic therapy, it might mutation.
be ethical to proceed to germ-line therapy. However, even if the A third, and cogent, objection is that germ-line therapy is simp ly
initial treatment were performed with informed consent, later not necessary. Candidate couples would most probably have
generations are given no choice. This leads to the view that we dominant or recessive Mendelian disorders (re<:urrence risk 50%
have a responsibility not to inflict our ideas or products on future and 25%, respectively). Given a dish containing half a dozen
generations, and so germ-line therapy will always be unethical. IVF embryos from the couple, it would seem crazy to select the
The argument in favor of germ-line therapy is that it solves the affected ones and subject them to an uncertain procedure, rather
problem once and for all. Why leave the patient's descendants at than simply to select the 50% or 75% of unaffected ones for re
risk of a disease if you could eq ually well eliminate the risk? implantation.
The argument in favor needs to be set against the population The argument that somatic therapy is less risky than germ-line
genetic background. The Hardy-Weinberg equations set out in therapy seems incontrovertible. In particular, the safer non-integrating
Chapter 3 (p. 84) are highly relevant here; in addition there is a vectors could not be used for germ-line therapy. We are less convinced
strong practical argument that germ-line therapy is unnecessary. by the genera l argument that it is unethical to impose our choices on
For recessive conditions, only a very small propor tion of the future generations.
The designer baby catchphrase encapsulates fWO sets of worries. First, only a handful of embryos, and usually two or three are implanted
people w ill use in vitro fertilization and pre-implantation diagnosis to maximize the chance of success. 5ele<:tion on multiple criteria is
to select embryos with certain desired qualities and reject the rest simply not compatible with having enough embryos to implant. More
even though they are normal. This contrasts with the use of the same generally, nature has endowed us with a simple and highly agreeable
procedure to avoid the birth of a baby with a serious d isease. Second, method of making babies that is very effective for the large majority
people will use the thera peutic technologies described in this chapter, of couples, and it is hard to imagine most people abandoning this in
not to treat disease but forgeneticenhancemenr; that is, endowing favo r of a long~ drawn- out, unpleasant. and highly invasive procedure
genetically normal people with superior qualities. that is very expensive and has a low success rate.
The first scenario is already with us in the form of pre-implantation Genetic enhancement is a difficult question - or it will become
sex selection and a few highly publicized cases in which a couple have one, once we have any idea wh ich genes to enhance. On the one hand,
sought to ensure that their next child can provide a perfectly matched parents are supposed to do what they can to give their children a
transplant tissue to save the life of a sick child. Current cases involve good start in life; on the other hand, options available only to the rich
a transplant of stem cells from cord blood. That would do no harm to are often seen as buying an unfair advantage. Perhaps fortunately, we
the baby-it would be different if it were proposed to take a kidney. are a long way from identifying suitable genes, even if the techniques
It is often suggested that these cases are the start of a slippery slope for using them were available. Attempts to produce genetically
that leads inevitably to demands for very extensive spec ification of the enhanced animals have not been a success and in some cases have
genotype of the baby- as envisaged in the phrase designer baby. The been spectacular failures. In the long term, the possibilities must
reality is different. Simply selecting for HLA compatibility means that be immense, and there will surely be very difficult ethical issues to
only one in four embryos is selected. Most IVF procedures produce confront.
698 Chapter 21: Genetic Approaches to Treating Di sea se
~_ ~8Q
gene A gene A 8C1
diseased cells normal phenotype diseased cells correct normal phenotype
(lack gene (increase In with mutant allele endogenous (mutation corrected)
A product) gene A product ) (Ar.lul) making gene
harmful product
A" U'
~ ----+ 08
•,
) I
-:~::;}
l
V.
antisense
oligonucleotide, 0
toxin gene .~
cells killed
by toxic prodrug
metabolites
8
horizontal bar) of gene A. In other cases
.,
,j ""'I .......~ ~
f0
shown in (B) and (C) a mutant gene (Anlut)
, makes a harmful product that is shown
• I •
gene
\ r _. I
, _.
nondiseased cens.
"I
.. •
greatly overexpress a normal antigen in a way that can provoke weak: immune
responses. Immunotherapy is designed to amplifY the naturally weak
immune response to cancer cells.
in vivo ex vivo
Flgur. 21.8 /n vivo and ex vivo gene
~ X:::.J-- ~- }- therapy. In ex vivo gene therapy, cells are
remove 1
.J,.. "~
e,,x. ~
cells; ~~-::J
removed from the patient, genetically
modified in some way in the laboratory, and
c:::;;~
cells ~ am~i~
,-:,~ ~ C!:i~
Ii.. ~:t3
•. -_ ,,~
returned to the patient. This allows just the
appropriate cells to be treated, and they
pat ient's cells some cells ~ can be checked to make sure they have the
A now X+ --.. "'-
_!Co .,_'" correct genetic modification before they are
X+ cells returned to the patient. For many tissues,
this is not possible and the cells must be
return celiS to ooay modified within the patient's body (in vivo
gene therapy),
700 Chapter 21: Genetic Approaches to Treating Disease
TABLE 21,4 VIRA L VECTORS OF USE IN MAMMALIAN GENETRANSFECTION AND GENE THERAPY
Virus class Viral genome Cloning Interaction Target cells Transgene Comments
capacity with host expression
genome
y-Retroviruses ssRNA 7-8 kb integrating dividing celts only long-lasting moderate vector yield 3 ;
(oncoretroviruses) risk of activation of
cellula r oncogene
Lentiv iruses ssRNA; -9 kb upto8kb integrati ng dividing and (eng-last ing high vecror yield<l; risk of
non-dividing cells; and hi gh-leve l oncogene activation
tropism varies expression
Adenoviruses dsDNA; often 7.5 kb, non-integrating divid ing and transient but high- hig h vector yield 3 ;
upto38kb but up to non-divid ing cells level expression immunogenicity can be
35 kb for a maj or problem
gutless vectors
Adena-associated ssDNA; 5 kb < 4.5 kb non-integrating dividin g and high-level high vector yield J ;
viruses non-dividin g cells expression in small cloning capacity
medium to long but immunogen icity is
te rm (yea r) less significant t han for
adenovirus
Herpes simplex dsDNA; >30 kb non-integrating centra l nervous potential for long- able to establish lifelong
virus 120-200 kb system lasting expression latent infections
aHigh vector yield, 1012 transducing units/ml; moderate vector yield, 10 10 transdUCing units/mL
Some viruses that are used to make gene therapy vectors are naturally patho
genic, and some can potentially generate strong immune responses. For safety
reasons, viral vectors are generally designed to be disabled so that they are
replication-defective. However, replication-competent viruses have sometimes
been used as therapeutic agents, notably oncolytic viruses for treating cancers.
Where a virus is naturally immunogenic, the viral vectors are modified in an
effort to reduce or eliminate the immunogenicity.
Retroviral vectors
Retroviruses-RNA viruses that possess a reverse transcriptase-deliver a nu
cleoprotein complex (preintegration complex) into the cytoplasm of infected
cells. This complex reverse- transcribes the viral RNA genome and then integrates
the resulting cDNA into a single, but random, site in a host cell's chromosome.
Retroviruses offer high efficiencies of gene transfer but can be generated at titers
that are not so high as some other viruses, and they can afford moderately high
gene expression. Because they integrate into chromosomes, long-term stable
trans gene expression is possible, but uncontrolled chromosome integration con
stitutes a safety hazard because promoter/enhancer sequences in the recom
binant DNA can inappropriately activate neighboring chromosomal genes.
y-Retroviral vectors are derived from simple mouse and avian retroviruses
that contain th,.ee transcription Wlits: gag, pol, and enll (Figure 2 L.9). In addition,
a cis-acting RNA element, 'If, is important for packaging, being recognized by viral
proteins that package the RNA into infectious particles. Because native y-retrovi
ruses transform cells, the vector systems need to be engineered to ensure that
they can produce only permanently disabled viruses. y- Retroviruses cannot get
their genomes through nuclear pores and so infect dividing cells only. This limi
tation can, however, be turned to advantage in cancer treatment. Actively divid
ing cancer cells in a normally non-dividing tissue such as brain can be selectively
infected and killed without major risk to the normal (non-dividing) cells.
Lentiviruses are complex retroviruses that have the useful attribute of infect
ing non-dividing as well as dividing cells, and can be produced in titers that are a
hundredfold greater than is possible for y-retroviruses. In addition to expressing
late (post-replication) mRNAs from gag, pol, and enlltranscription units, six early
702 Chapter 21: Genetic Approaches to Treating Disease
gag/pol
•
env
) 0 0 0
~0 0 0 0 0
cP. gaS/poi
proteins
with integration. Recombinant viral
genomes are packaged into infective but
construct construct tJ. D.. D..6 ,6 6 repfkatio n~ defjcient virus particles, which
nucleus 6 .6. f:J.. L:l. bud o ff from the cell and are re<overed from
env proteins the supernatant. (Bl Lentiviruses, such as the
~~w
pol, and envgenes, they possess a variety
~~~
vpu, ror, rev, and net. that encode proteins
viral proteins are produced before replication of the virus, The early proteins
include two regulatory proteins, tat and rev, that bind specific sequences in the
viral genome and are essential for viral replication, Like other retroviruses, they
allow long-term gene expression, Results with marker genes have been promis
ing, shovving prolonged in vivo expression in muscle, liver, and neuronal tissue.
Most lentiviral vectors are based on HN. the human immunodeficiency virus,
and much work has been devoted to eliminating unnecessary genes from the
complex HN genome and generating safe packaging lines while retaining the
ability to infect non-dividing cells. Lentiviruses appear to have a safer chromo
some integration profile than 'Y-retroviruses; self- inactivating lentivirus vectors
provide an additional layer of safety.
Adenoviral and adeno-associated virus (AAV) vectors
Adenoviruses are DNA viruses that cause benign infections of the upper respira
tory tract in humans. As with retroviral vectors, adenoviral vectors are disabled
and rely on a packaging cell to provide vital fun ctions, The adenovirus genome is
relatively large, and gudess adenoviral vectors, which retain orily the inverted ter
minal repeats and packaging sequence, can accommodate up to 35 kb of thera
peutic DNA (Agure 21-10).
Adenovirus virus vectors can be produced at much higher titers than ,,(-retro
viruses and so transgenes can be highly expressed. They can also efficiently
transduce both dividing and non-dividing cells. A big disadvantage is their
immunogenicity. Even though a live re plication-competent adenovirus vaccine
has been safely administered to several million US army recruits over several
decades (for protection against natural adenoviral infections) , unwanted immune
reactions have been a problem in several gene therapy trials, as described below.
Moreover, because these vectors are non-integrating, gene expression is short
PRINCIPLES OF GENE THERAPY AND MAMMALiAN GENETRANSFECTION SYSTEMS 703
,.1
(B)
transcription units II to lS are focused on
ITR ITR producing structural proteins for packaging
rep Ie cap I ~ the genome. In the lirst series of adenovirus
vectors, ea rly transcription units, notably
( - - --- ------------ ---- ---------)
REGION REPLACED E1, were elimin ated and could be replaced
by insert DNA, but the more recent gutless
term; repeated administration would be necessary for sustained expression, vec tors can replace all except the terminal
which could only exacerbate the immune response. repeats and the 'If sequence. (B) Adeno
Adeno-associated viruses (AAVs) are nonpathogenic single-stranded DNA associated virus (MV) has a small, very
viruses. They are completely unrelated to adenoviruses; their name comes from simple 4.7 kb single-stranded DNA genome
their reliance on co-infection by an adenovirus (or herpes) helper virus for repli with terminal inverted repeats (lTRs), and just
two open reading frames (ORFs). The rep ORF
cation. The genome contains only two genes: the rep gene makes proteins that
encodes various proteins required for the
control viral replication, structural gene expression, and chromosome integra AAV life cycle, and the cap ORF makes the
tion; the cap gene encodes capsid structural proteins. Multiple different sero three capsid proteins. In a recombi nan t AAV
types of MV have been isolated and some have usefully narrow tropism. For vector, the rep and cap ORFs are replaced by
example, the MV9 variant is highly tropic for the spinal cord and brain astro a deSired transgene. The functionsnecessary
cytes, and may be useful in future treatments of spinal cord injuries. Another for production of virus particles (including
important advantage ofMV vectors is that they can permit robust in vivo expres rep) are provided by a packaging cell.
sion of transgenes in various tissues over several years while exhibiting little
immunogenicity and little or no toxicity or inflammation.
Unmodified human MV integrates into chromosomal DNA at a specific site
on 19q13.3-qter. This highly desirable property would provide long-term expres
sion without the safety risks of random insertional mutagenesis. Unfortunately,
the specificity of integration is provided by the rep protein, and because the rep
gene needs to be deleted in the constructs used for gene transfer, chromosomal
integration of M V vectors OCCLUS randomly (MV is disadvantaged by having a
small genome; even vectors in which 96% of the MV genome has been deleted
accept inserts with a maximum size ofjust 4.5 kb).
Other viral vectors
Herp es simplex viruses are complex viruses containing at least 80 genes; they are
tropic for the central nervous system (CNS). They can establish lifelong latent
infections in sensory ganglia, in which they exist as non-integrated extrachromo
somal elements. The latency mechanism might be explOited to allow the long
term expression of transferred genes, in the hope that they will spread through a
synaptic network. Their major applications would be in delivering genes into
neurons for the treatment ofdiseases such as Parkinson disease and CNS tumors.
Vectors have a high insert capacity. Practical vectors are still at an early stage of
development.
Modified vaccinia virus Ankara (MVA) is a highly attenuated vaccinia virus
that has been used safely as a smallpox vaccine. Infection with MVA results in
rapid replication of viral DNA, but it is largely non-p ropagative in human and
mammalian cells. Replication -competent MVA vectors have been constructed
and used largely for transferring suicide genes to kill tumor cells in cancer gene
therapy.
704 Chapter 21: Genetic Approaches to Treating Disease
to DNA and so ca use the DNA to be significantly cOinpacted. To form DNA nano
N-terminal cysteine to which the PEG is covalently bound). Within this complex,
the DNA forms a very condensed structure. Because of their much reduced size,
to dividing and non-dividing cells and have a plasmid capacity of at least 20 kb.
r~
cytosol non-dividing cellsthe precise mechanism
cytoplasmic of entry into the nucleus is unclear. [From
delivery
nucleus Simoes S, Filipe A, Faneca H et at (2005)
Experr Opin. Drug Deliv. 2, 237-254.
Taylor & Francis ltd.)
to kill hannful cells, but instead to counteract the harmful effects of genes within
cells. Some conditions are caused by gain-oI-function mutations (Chapter 13,
p. 428) or a dominant-n egative effect (Chapter 13, p. 431), and the problem is a
resident gene that is doing something positively harmful. Diseases that might
benefit from sllch strategies could include cancers that arise through activated
oncogenes, dominant Mendelian conditions other than those caused by loss of
gene function, and autoimmune diseases. Infectious diseases might also be
treated by targeting a pathogen-specific gene or gene product.
In principle, two groups of strategies could be used to counteract a gene's
harmful effects. One way is to specifically block expression of the harmful gene
by downregulating transcription, by destroying the transcript, or by inhibiting
the protein product. In Section 21.2 we outlined various methods that inhibit
gene expression at the protein level; here, we will consider methods that inhibit
or cleave specific RNA transcripts. A second type of strategy seeks to restore the
normal gene function in some way. Eitherthe DNA sequence is altered to correct
a genetic defect, or alternative gene splicing is artificially induced so as to bypass
a causative mutation at the RNA level.
Whatever method is used, some sort of agent or construct mllst be delivered
to a target cell and made to function there. In that respect, the problems of effi
cient delivery and expression are similar to those described in the previous sec
tion. In the case of dominant diseases or activated oncogenes, there is the addi
tional problem of designing an agent that will selectively attack the mutant allele
without affecting the normal allele.
ensure that sufficient intact antisense oligonucleotides (AOs) arrived at the target
tissue, clinical applications focused on applying the AOs directly to diseased tis
sue. In 1998, Vitravene® (fomivirsen) became the first FDA-approved therapeutic
AO; it is applied directly to the eyes to treat cytomegalovirus retinitis in AlDS
patients: The cornea (and anterior chamber of the eye) is an immunologically
privileged site and so a high concentration of reagents can be administered
directly by injection.
Subsequent technological developments produced a second generation of
AOs based on nucleic acid analogs tha t are much more chemically stable than
normal oligonucleotides, such as 2-0-methyl phosphorothioate oligonucle
otides, morpholino phosphorothioate oligonucleotides, and locked nucleic acids
(Figure 21.12). As we saw in Section 20.3, morpholino oligonucleotides are now
widely used to knock down genes in some model organisms, but there have
been few therapeutic applications that exploit morpholinos or other second
generation AOs to block gene expression. Although second-generation AOs are
chemically stable, off-target effects can occur when the AO inhibits RNAs other
than the target.
In addition, as for any other oligonucleotide, efficient delivery of AOs to cells
has been a Signifi cant problem. In an effort to increase the in vivo uptake of AOs
by cells, some second-generation AOs, such as the morpholino oligonucleotides,
have been modified by conjugating peptides to them; however, this can some
times provoke immune reactions.
Therapeutic ribozymes
Aseparate wave of RNA therapeutics focused on RNA enzymes (ribozymes), such
as the plant hammerhead ribozyme, that naturally cleave other RNA target mol
ecules. Genetically modified ribozymes have been made in which a catalytic,
RNA-cleaving domain from a natural ribozyme is fused to an antisense RNA
sequence designed to bind to transcripts from a specific target gene of interest.
The modified ribozyme would bind to the transcripts of the target gene and
cleave them. Ribozymes have, however, not been so effective in practice: clinical
trials to treat cancer and other diseases have had rather disappointing
outcomes.
(A) (8)
o___"(ybase o\Oybase
I\OH 'e
I \ o.. .
o
°I
o=p-o- o=p- s
I
I
o\Oybase o\Oybase
I
r----'OH r----'oMe
° °,
(C)
(0) ~
O=P- N
I
o
/
- --(r base
"-
'gO Figure 21.11 Second-generation
antisense oligonucleotides. (A) The
normal ribose phosphate backbo ne
of RNA. (B) Structure of a 2-0-methyl
phosphorothioate oligonuclemide.
N
°I
o=p -0
(el Structure of a morpholino
phosphorothioate oligo nucleotide.
~
(D) Structure of an oligonucleotide known
as a locked nucleic ac id.
RNA AND OLIGONUCLEOTIDE THERAPEUTICS AND THERAPEUTIC GENE REPAIR 707
Therapeutic siRNA
[n the 1990s, the discovery of RNA interference (RNAi) transformed the potential
for specific inhibition of gene expression and provided a stimulus to the develop
ment of RNA therapeutics. Experimentally, the object is to produce a dauble
stranded RNA with a base sequence corresponding to a transcribed part of the
target gene, and so provoke natural cellular defense pathways to destroy tran
scripts of the specific gene. Long double-stranded RNA cannot be used for this
purpose in human (or other mammalian) cells because it provokes interferon
responses that result in nonspecific inhibition of gene expression (see Chapter 12,
p. 391). However, short double-stranded RNA is highJy efficient at inducing
knockdown of the expression of specific genes.
Performing RNAi in mammalian cells often involves transfecting rransgenes
that express a short hairpin RNA. After transfection into cells, the expressed RNA
is cleaved by the cytoplasmic enzyme dicer to give a double-stranded siRNA
about 21-23 nucleotides long (see Figures 12.3 and 12.4). Other RNAi therapies
deliver just the naked siRNA.
siRNA technology is highly efficient at gene knockdown in mammalian cells
and is transforming our understanding of human gene function. There is there
fore considerable excitement about its therapeutic potential. Promising thera
peutic targets wo uld include viral infections, cancers (targeting oncogenes), and
neurodegenerative disease (targeting harmful mutant alleles). The method has
been used with some success In treating various animal models of disease, but
delivery is comparatively inefficient because of the reliance on non-viral vector
systems.
Some early clinical trials using siRNA have focused on the immunologically
privileged eye. One principal target is the VEGF gene that underlies macular
degeneration (as described in the section on therapeutic aptamers on page 686)
and has involved simply injecting naked siRNA. Applying RNAi to target diseases
with less localized pathology is a much more formidable task. However, systemic
and reasonably effective delivery has been possible in animal models by attach
ing siRNAs to cholesterol, to ferry the siRNA through the bloodstream.
1 ,'rT--i'''''''',--;--,''~ N
genes within cells. See Chapter 20 p. 654
and Figu re 20 .13 for the background to
zinc linger nucleases. A pair of zinc finger
nucleases (ZFNS) each containing four zinc
fin gers can be designed to specifically
bind within a cell to a target sequence
binding of two ZFNs; dimeric
N that includes a defective mutation (red
DNA·cleaving domains (OCD)
5'-------- --------5'
3'
-3' as templates for synthesizing new DNA to
repla<:e the previously existi ng sequence.
A plasmid with a rele vant segment of the
~ = ::::
11 Invasion by donor
DNA strand and
,epa;,
normal gene sequence (bold blue lines) may
be provided as a donor DNA. By acting as a
template for synthesizing new DNA from a
portion of the homologous sequence (thick
blue lines), it facilitates replacement of the
5' 3' pathogenic point mutation by the normal
3' 5' sequence.
repaired sequence
BOX 21 .8 SOME OF THE MillaR UPS liND DOWNS IN THE PRIICTICE OF GENE THERAPY
1990 com mencement of th e first clin ical trial involving gene therapy, using recombin an t y-ret roviruses to overcome an autosoma l
recessive form of severe com bin ed immun odeficiency (SClD) due to deficie ncy of adenosine dea m inase (ADA); t he tria l was hailed as
a success but pati en ts had also been t reated in paralle l wit h standard enzyme replacemen t using polyethylene glycol (PEGl-ADA
1999 deat h of Jesse Gelsinger In Septem ber 1999, ju st a few days after receivi ng recom binant adenoviral particles by intra-hepat ic
inj ection in a cli nical t ria l of ge ne th erapy for orni th ine t ranscarbamylase deficiency; a massive immune response to the adenovirus
particles resulted in multiple orga n fa ilu re
2000 the fi rst unambiguous gene therapy success involved using recombinant "tretroviruses to treat an X-linked form of SClO (see
Figu re 21.1 6); howeve r, several of the treated chi ldren went on to develop leukemia, w ith one death, as a result of insertional
act ivation of cellular oncoge nes (see t he main text)
2006 transient success in using y-retroviral gene therapy to treat two ad ult patients w ith chronic granu lomatous d isease, a recessi ve
disorder that affect s phagocyte function causing immunodeficiency; despite initial success, both patients went on to show
t ransgene silencing. and also myelod ysplasia as a result of insertional activation of cellular genes (see the m ain text); 27 mon ths
after gene therapy one pa tient died from overw helming sepsis; see Ott et at (2006) Na r. Med. 12, 401 -409; PMIO 165829 16 and
Stein et at (2010) Nat. Med. 16, 198-204; PMIO 20098431
2006 report o f limited success in gene therapy for hemophilia B: MV2 vectors were used successfully t o t ransd uce hepat ocytes and
express a Factor IX transgene; however, an im mune resp onse led to destruction of th e transduced ce ll s; see Man no et al.
Nat. Mod. 12. 342- 34 7; PM ID 16474400
2009 successful g ene the ra py for ADA deficiency reported by Aiuti et aL N. Engl. J. Med. 360, 447-458; PMI O 19179314
2009 fi rst report of successful ge ne therapy for a central nervous system disorder, X-l inked adrenoleukodystrophy, and the fir st using a
lent ivira l vector; see Cartie r et a!. Science 326, 818-823; PMIO 19892975
2009 repor t of successful in vivo ge ne t herapy usin g retinal injection of recombina nt AAV to trea t Leber congenita l amauros is, a form
of ch ildhood bli ndn ess; the significa nt post-treatment gain in vision was found to be maintai ned after one yea r; see Ci dec iya n et al.
N. Engl. 1. Med. 361, 725-7 27; PMID 19675341
benefits. By contrast, dominant d isorders- in which one ofthe two alleles is nor
mal and makes a gene product- present much greater challenges. Getting the
correct gene dosage to treat haploinsuffi ciency could be problematic, and coun
teracting the effect of a mutant allele that makes a harmful product would require
very efficient and selective silenci ng of the mutant allele, while leaving the nor
mal allele unaffected.
monogenic naked!
diseases plasmid
7.9% DNA
17.7%
CD34+ stem cel ls enriched retroviral vector carrying Fig UN 21 .. 15 Ex vivo gene therapy
by magnetic bead-antibody IL2RG gene for X-linked severe combined
immunodeficiency disease (X-SClDl. Bone
marrow was removed from the patient and
antibody affini ty was used to enrich for cells
expressing the CD34 antigen, a marker of
he ma topo ietic stem cells. To do this, bone
marrow cell swe re mixed with paramagnetic
.~
beads coated with a CD34-specific
monoclonal antibody; beads containing
./ 1 -
bound cell swere removed by using a
magnet. The transduced stem cells were
expanded in culture before being returned
to the patient. For detail s, see Cavazzana
Calvo M et at (2000) SCience 288, 669-672;
PMIO 10784449 and Hacein-Bey-Abina 5 et
al. (2002) N. Eng/,). Med. 346, 11 85- 1193;
marrow aspirated und /
general anesthetic er - ' " PMID 11 961 146,
infuse 14-38 million transduce cells in plastic bag
ceils per kg body weight for 3 days; cells multiply
5--8-fold
Figure 2L8), Disorders res ulting from recessively inherited defects in white blood
cell fun ctio n were among the first targets for gene therapy White blood cells have
key roles in the immune system (see Tables 4,7 and 4,8), and defects in their func
tion can give rise to severe immunodeficiencies.
The first definitive gene therapy success came in treating severe combined
immunodefici ency (SClD), in which both Band T lymphocyte function is defec
tive, Patients are extremely vulnerable to infectious disease, and in some cases
they have been obliged to live in a sterile environment, The most common form
of SClD is due to inactivating mutations in the X-linked JURG gene that makes
the Yc common gamma subunit for multiple interleukin receptors, including
interleukin receptor 2, Lymphocytes use interleukins as cytokines to signal to
each other and to other immune system cells, and so lack oftheyc cytokine recep
to r subunit had devastating effects on lymphocyte and immune system function.
Another common form of SClD is due to adenosine deaminase (ADA) deficiency;
the resulting build-up of toxic purine metabolites kills T cells. Because T cells
regulate B-cell activiry, both T- and B-cell function is affected.
scm gene therapy has involved ex vivo y-retroviral transfer of JURG or ADA
coding sequences into autologous patient cells (see Box 2L8) . To aid the chances
of success, bone marrow cells we re used because they are comparatively enriched
in hematopoietic stem cells, the cells that gi ve rise to all blood cells (see Figure
4,17); a further refinement "'as to select for cells expressing CD34, a marker of
hematopoietic stem cells (Figure 21.1 6), By 2008 Fischer et aJ. reported that 17
out of 20 X-linked SClD patients and 11 out of 11 ADA-deficient SClD patients
had been successfully treated and retained a functional immune system (for
more than 9 years after treatment in the earliest patients).
With the use of retrovirus vectors, genes could be inserted into chromosomes;
the therapeutic DNA would be stably transmitted after cell division, giving long
lasting trans gene expression, This clear advantage was eroded by the lack of con
trol over where the transge nes integrated. In one clinical trial for X-linked scm,
four patients developed T-acute lymphoblastic leukemia after retroviral integra
tion because promoter/enhancer sequences in the inserted DNA inapp ropriately
activated a neighboring LM02 proto-oncogene. Activation of LM02 is now
kno;vn to promote the self-renewal of pre-leukemic thymocytes, so that commit
ted T cells accumulate additional genetic mutations required for leukemic
transformation,
Simil ar protocols were applied to treating patients with an X-linked form of
chronic granulomatous disease (CGD) . Patients with CGD have immunodefi
ciency arising from mutations in any of four genes that encode subunits of the
NADPH oxidase complex, NADPH oxidase is involved in making free radicals and
other toxic small molecules that phagocytes use to kill the microbes that they
engulf; a defective NADPH oxidase results in defective phagocyte fun ction.
712 Chapter 21: Genetic Approaches to Treating Disease
Ovarian cancer t umo r cells gene addition to restore tumor intraperitoneal inj ect ion of retrovirus or adenovirus encoding full
suppressor gene function length eDNA encoding p53 or BReA 1, wit h the hope of restoring cell
cycle control
Ovarian ca ncer tumor cells oncogene inactivation inject adenovirus encodi ng a scFv antibody to ErbB2; hope to
inactivate a growth signal
Malignant tumor-infiltrating genetic manipulation of t umor extractTILs from surgically removed tumor and expand in culture;
melanoma lymphocytes cel ls to trigger apoptosis infectTILs ex vivo w ith a re t roviral vector expressi ng TNF-a, in fuse
!TIl') into patient; hope that Tl l s will target remaining tumor celis, and the
TN F-o: wil l kill them (see Figure 21.7E fo r the principle)
Vari ous tumors tumor cell s increase antigenicity of tumor transfect tumor cell s w ith a retrovirus expressing a cell surface
cells so that immune system antigen such as HlA-B7 or a cytokine such as IL-12, Il-4, GM-CSF, or
destroys tumor y-interfero n; hope that this enh ances the immunogenicity of the
rumor, so that the host immune system destroys it; often done ex vivo
with lethally irradiated tumor cells (see Figure 21.7E for the prinCiple)
Prostate cancer dendritic cells modify dendritic cells to treat autologous dendrit ic cells with a tumor antigen or eDNA
enhance tumor-specific expressing the antigen, to prime them to mou nt an enhanced
immu ne reaction immune respo nse to the tumor cells (see Figure 21.7E for t he
principle)
Maligna nt glioma tumor cells genetic modification o f tumor inject a retroviru s expressing thymidine kinase (TK) or cytosine
(brain tumor) cells so that they convert a deaminase (CD) into the tum or; only the dividing tumor cells, not
non-toxic prod rug into a toxic the su rrou nd ing non-dividin g brain cells, are infected; then treat
compou nd that ki lls them with ganciclovir (TK-positive cells convert th is to the toxic gancidovir
t riphosphate) or 5-fluorocytosine (CD-positive cell s convert it to
the toxic 5-fluorouracill; virally infetted (dividing ) cells are killed
selectively (see Figure 21.1 7)
Head and neck tumor cells use o f oncolytic viruses that are inject ONYX-015 engineered adenovirus into tumor; the virus can on ly
tumors engineered to selectively kill replicate in p53-deficlent ce ll s, so it selectively lyses tumor cells; this
tumor cells treatment was effective when combined with syst em ic chemotherapy
TNF, tumor necrosis factor; IL, interJeukin; GM -GSF, granulocyte/ macrophage colony-stimulating factor. See the National Institute of Health clinica l trials
database at http://cli nica ltri als.gov! fora comprehensive survey.
714 Chapter 21: Genetic Approaches to Treating Disease
• Use of oncolytic viruses that are engineered to kill tumor cells selectively
Genetic modification of tumor cells so that they, but not surrounding non
tumor ceils, convert a non-toxic prod rug into a toxic compound that kills
them (Figure 2 l.J 7).
As at 2009, three cancer immunotherapies were at the phase III stage of clinical
trials. Also at the phase [l] stage are adenoviral-based transfer of a herpes simplex
virus thymidine kinase suicide gene for treating glioblastoma (see Figure 21.17)
and adenoviral-based expression of p53 for the treatment of head and neck can
cers and for U-Fraumeni (OMIM 151623) tumors.
A general problem is that the methods for killing cancer cells are far from
100% efficient, and even when the efficiency is very high so that tumors shrink,
they can grow again as surviving cancer cells proliferate. The idea of curing can
cer is being replaced by a more realistic long-term managem en t of cancer
disorders.
CONCLUSION
Genetic technologies can be used in the treatment of disease, regardless of
whether the disease has a genetic cause or not. This can be part of a treatment
regime that involves conventional drugs or vaccines, but genetic manipulations
can also be used in the production of drugs and vaccines. Genetic engineering
has also been used to make therapeutic proteins. Expression cloning in microor
ganisms, cultured cells, or transgenic livestock (in which the protein ofimerest is
expressed in milk or eggs) avoids the health risks associated with harvesting these
proteins from animal or human sources. Genetic engineering has also been
applied to make partly or fully human monoclonal antibodies. They are more
stable in human serum than rodent monoclonal antibodies and are consequently
beller suited for therapeutic purposes.
Genetic interventions are also being used in two developing and intercon
nected therapeutic strategies: cell replacement therapy (to replace cells lost
through disease or injury, enabling repair of damaged tissues) and gene therapy
(involving genetic modification of a patient's cells). Stem cells are important in
cell therapy because they are both self-renewing and also capable of being dif
ferentiated to make new tissue cells. Bone marrow transp lants are an established
form of stem cell therapy to treat blood disorders and hematopoietic stem cell
depletion after chemotherapy.
To avoid immune rejection, autologous cells are prefered for cell therapy. In
principle, they can be generated by artificially enhancing the mobilization and
differentiation of stem cells in the patient or by using laboratory nuclear repro
gramming methods to reverse the normal developmental progression of cells
toward specialization, wWch has opened new research and therapeutic opportu
nities. The transplant of the nucleus from an adult somatic cell into an enucle
ated oocyte allowed the cloning of Dolly the sheep and various other types of
animal species, and could in principle be used to make patient-specific embry
onic stem cells; however, the procedures are technically challenging and ethically
contentious.
Nuclear reprogramming of skin fibroblasts trom patients by ectopic expres
sion of specific transcription factors offers a new route to obtaining patient
specific and disease-specific pluripotent stem cells, with few ethical concerns.
716 Chapter 21: Genetic Approaches to Treating Disease
Such induced pluripotent stem (iPS) cells can be differentiated to provide hwnan
ceUular models of disease and are being used to test drugs; in principle they
could provide sources of cells for autologous cell therapy. Transdifferentiation
techniques hold the promise of making more subtle changes to cell lineage (e.g.
altering pancreatic exocrine cells into beta cells to replace those lost in type 1
diabetes).
In gene therapy, the aim is to transfer a DNA, RNA, or oligonucleotide into a
patient's cells to counteract or alleviate disease. Transfer into cells is often
achieved with virus vectors, which offer high efficiency of gene transfer and high
level. sometimes long-lasting, transgene expression. Safety issues are a concern
because integration into chromosomes is uncontrolled and can accidentally
activate a neighboring proto-oncogene. Non-viral vectors are safer, but gene
transfer rates are low and expression is often transient and comparatively weak.
The genetic modification could theoretically be directed to germ-line cells (a
permanent, transmissible modification). but for ethical reasons germ-line gene
therapy is not being attempted, and all current trials involve somatic gene ther
apy. The aim is often to add a gene that can functionally replace a defective gene,
but some therapies seek to block expression of a mutant or ectopically expressed
gene at the RNA level, usually using RNA interference, or at the protein level.
Another class of gene therapy strategy seeks to restore the function of a defec
tive gene, either at the RNA level by inducing exon slcipping so that a causative
mutation is bypassed, or at the DNA level by using zinc finger nucleases. These
enzymes are genetically engineered to act as restriction nucleases that cleave
genomic DNA at just one site, such as at the location of an inactivating point
mutation. The double- strand DNA break can trigger a cellular repair pathway
that uses homologous recombination to replace the mutant sequence by a
homologous wild-type sequence provided by a transfected plasmid.
Most clinical u-ials of gene therapy have sought to u-eat cancer, and here the
strategy is to kill cancer cells selectively; however, success has been limited. The
first clearly successful gene therapies have been for recessive blood cell disor
ders, which are more amenable because the cells are highly accessible and can be
genetically modified outside the body before being returned to the patient, and
because even a small amount of transgene expression can often bring about
some clinical benefit.
FURTHER READING
General Cardinale A & Biocca 5 (2008)The potential of intracellular
antibodies for therapeutic targeting of protein ~ misfo lding
Costa T, Scriver CR & Childs B (1985) The effect of Mendelian diseases. Trends Mol. Med. 14, 373-380.
disease on human health: a measurement. Am.). Med. Genet. Hoogenboom HR (2005) Selecting and screening human
21,231-242.
recombinant antibody libraries. Nat. Biotechno/. 23,
Treacy EP. Valle D & Scriver CR (2001)Treatment of genetic 1105- 1116.
disease. In The Metabolic And Molecular Basis Of Inherited Kim SJ, Park Y & Hong HJ (2005) Antibody engineering for the
Disease, 8th ed. (Scriver CR, Beaudet AL, Sly WS, Valle MD development of therapeutic antibodies. Mol. Cefl20, 17-29.
eds). McGraw-HilI. Lillica SG, Sherman A, McGrew MJ et al. (2007) Oviduct-specific
expression of two therapeutic proteins in transgenic hens.
Genetic approaches to identifying novel drugs and Proc Natl Acad. Sci. USA 104, 1771-1776.
drug targets Reichert JM (2008) Monoclonal antibodies as innovative